Dropout tales

Insights into a popular regularization technique

Dropout is an effective regularization technique used to reduce overfitting in neural networks. It works like this: given a feedforward neural network, at training time remove some neurons at each non-output layer, depending on a certain probability. For example, if the probability is 0.5 (this probability can vary across levels), you flip a coin and decide if a certain neuron should be in or out. The picture below shows the given original network (called base or parent network) on the left and the network after applying dropout on the right.

As you may have noticed, at training time the network on the right (picture above) is a simpler network (less units) prone to express a simpler model.

At test time no units are dropped, so that the full network is used to make predictions. The picture below shows what happens at training time and during test.

At training time, a unit (neuron) is present with a certain probability p and is connected to units in the next layer with weights w. At test time, the unit is always present and weights are multiplied by p. This is because we would like the outputs of units during test time to be equivalent to their expected outputs at training time.

In fact, dropout retains a unit with probability p and removes a unit (the output of a unit is set to 0) with probability 1 − p. This means that if the output of a unit prior to dropout was x, then after dropout the expected output would be E[output] = px + (1 − p) · 0 = px. Therefore, to ensure that the outputs have the same expectation at test time as they did during training, we have to multiply weights by p at test time.

However, this implementation of dropout is undesirable because it requires scaling of neuron outputs at test time. This is bad for test-time performance, so it is preferable to use inverted dropout, where the scaling occurs at training time instead of testing time. In inverted dropout, the output of any retained unit is divided by p before the value is propagated to the next layer. In this case

$\displaystyle \text{E}[\text{output}] = p \cdot \frac{x}{p} + (1 - p) \cdot 0\,,$

avoiding output scaling at test time.

Dropout as a regularizer

By the fact that units can go away at random, each neuron may miss an important input (or more important inputs) from the previous layer and so it can not rely on any one input. The neuron has to spread out the weights with respect to its incoming neurons, causing the weights to shrink. This shrinking lowers the squared norm of the weights. Hence dropout is, in some respects, similar to L2 regularization. This explanation can be found in this video lecture by Andrew Ng.

The fact that some type of L2 regularization was hiding behind the dropout technique was already discussed in a 2013 article. One of the study findings is that dropout can be seen as an attempt to apply an L2 penalty after normalizing the feature vector by a quantity depending on the diagonal of an estimate of Fisher information matrix.

In the picture above, a comparison of two L2 regularizers (take a look at this page if you need a quick recap on regularization). The solid ellipses are level surfaces of the likelihood and the dashed curves are level surfaces of the regularizer. The top panel shows a classic spherical L2 regularizer. Let I be the Fisher information matrix. If I were a multiple of the identity matrix, then these level surfaces would be perfectly spherical. In dropout, these level surfaces are non-spherical (bottom panel) due to the normalization of the problem features by diag(I)⁻¹ᐟ²: L2 penalty is applied after scaling (the features have been balanced out).

Dropout as a bagging algorithm

There is an obvious link between intuitive regularization and size/complexity of the network (see picture below). Smaller networks correspond to rigid and simple models. It would be useful sometimes — to avoid overfitting — to exploit a method that helps reducing complexity, returning a better performing model. Intuitively, fewer neurons (units) in action correspond to simpler models.

As you may have noticed, at training time the network (after applying dropout) is a simpler network (less units) prone to express a simpler model, maybe reducing overfitting. The network is trained to produce accurate predictions on unseen data even in unfriendly conditions where some neurons are missing.

Recall that to learn with bagging, we deﬁne t diﬀerent learners (ensemble models), construct t diﬀerent datasets by sampling from the training set with replacement, and then train model i on dataset i. The bagging meta-algorithm is depicted below: (1) create multiple data sets Dᵢ through sampling with replacement; (2) employ multiple learners Lᵢ in parallel; (3) combine all learners using an averaging or majority-vote strategy.

Dropout aims to approximate this process, but with an exponentially large number of neural networks. Dropout trains the ensemble consisting of (possibly all) subnetworks that can be formed by removing non-output units from a given base network (see figure below). The base network can be identified with its 2ⁿ thinned subnetworks.

When training with dropout, we use minibatches and each time we load an example into a minibatch, we randomly sample a diﬀerent binary mask (0 out, 1 in) applying to all of the input and hidden units in the network.

There is a significant difference between bagging and dropout. Bagging models are all independent. Dropout models, instead, share parameters: each model inherits a diﬀerent subset of parameters from the parent neural network. This parameter sharing makes it possible to represent an exponential number of models with a tractable amount of memory. Moreover, dropout training differs from bagging in that each model is trained for only one step.

In bagging, the prediction of the ensemble is given by the arithmetic mean of all of the resulting predictions. In the case of dropout, at test time it is not feasible to explicitly average the predictions from exponentially many thinned models. However, there is a simple approximate averaging method that works well in practice. There is no theoretical reason (at the moment) for the accuracy of this approximate averaging method, but empirically it performs very well. The idea is to use a single neural net at test time without dropout. This neural net is obtained adjusting the weights as shown before, i.e. outgoing weights of a retained unit are multiplied by p at test time. We already observed that this ensures that, for any hidden unit, the actual output at test time is the same as the expected output at training time. By doing this scaling, a large number of networks with shared weights can be combined into a single neural network to be used at test time.

Dropout in practice

Dropout is implemented in PyTorch through the nn.Dropout class. nn.Dropout randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Note that here p is the probability to drop the unit; this is different from our previous usage (so far we have denoted with p the probability to retain a unit).

Below, a minimal example showing how Dropout sets to zero several units of matrix x (setting p = 0.75, about 3 units out of 4 are dropped).

import torch
from torch.nn import Dropout

x = torch.full((3, 5), 1.0)
print(x)
dropout = Dropout(p = 0.75)
y = dropout(x)
print(y)

The TensorFlow analogue is tf.keras.layers.Dropout. Below, a small neural network example with nn.Dropout modules interspersed between Linear layers.

import torch
from torch.nn import Sequential, Linear, ReLU, Dropout

model = Sequential(Linear(10, 100), ReLU(),
                   Dropout(),
                   Linear(100, 50), ReLU(),
                   Dropout(),
                   Linear(50, 2))
t = torch.rand(10)
print(model(t))

If the neural network is defined as a class, it is possible to specify nn.Dropout occurrences in the forward method.

What are the best values for p ? There is no right value that works for all kinds of situations, the key is to repeat experiments until a satisfactory value is reached. As initial values to be refined later, some sources cite that, typically, an input unit is included with probability 0.8 (p = 0.2) and a hidden unit is included with probability 0.5 (p = 0.5).

Useful links

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov
Journal of Machine Learning Research 15 (1929–1958), 2014 [link].

Fundamentals of Deep Learning
N. Buduma, N. Lacascio
36–37, O’Reilly Media, Inc., 2017.

Dropout Training as Adaptive Regularization
S. Wager , S. Wang , P. Liang
arXiv:1307.1493 [stat.ML], 2013.

Deep Learning
I. Goodfellow, Y. Bengio, A. Courville
Chapter 7 (224–270), MIT Press, 2016 [link].

Dropout — PyTorch docs page.

How does dropout work during testing in neural network?[Prylipko]