Overfitting in neural networks, just like in all other models, occurs when the model memorizes the training set rather than generalizes upon it, which does not really help when it comes to new data. Perhaps the first go-to solution would be to just gather more data or augment the existing dataset; however, the second go-to solution is to regularize the model.
This topic introduces some of the classic regularization techniques used during the training of neural networks.
L1 and L2 regularization
Unconventionally, we will start off with the most common type of regularization — the L2 regularization (also known as weight decay). L2 regularization introduces a penalty term to the loss function to limit the model complexity (also known as modeling the data too closely). The penalty term makes the weights as small as possible (but not exactly zero), so that the output function is smoother.
To be more specific, for every weight () in the network, the (where is the regularization strength) term is added to the cost function (the at the front of the penalty simply makes the gradient of the term equal to ).
With L2 regularization, we are primarily concerned with . Considering a synthetic network example (where the black curve represents the true function and the blue curve is the fitted model), let's look at the impact of different values:
For small values of (a & b), the model perfectly overfits the data (since the values are so close to 0, no regularization actually occurs). For intermediate values (c & d), the function is closer to the ground truth, and for large values, the fit becomes worse. These specific values of should not be generalized, but they provide some intuition. L2 tends to use all of the inputs in a more balanced way instead of using a specific subset a lot, which has a stabilizing effect.
L1 regularization adds the term to each weight in the cost function. L1 makes the weights become sparse and only use the most important inputs (and sets everything else to 0), which reduces the noise sensitivity of the network.
Unless feature selection is needed, L2 regularization is generally expected to yield better results.
Early stopping
While we can try to force our model to be ‘more simple’, we can also try to ‘catch’ the moment when the model starts overfitting and stop the training process right before it, which is known as early stopping. The general idea of early stopping is to terminate the training if there are no improvements on the validation set for quite some time.
In more detail, training takes place over a certain number of iterations (alternatively, epochs), and after each iteration, the predictive performance is evaluated on the validation set. When the error on the validation set drops, a copy of the parameters is saved. Termination happens when there are no validation set error improvements for a specified number of iterations. The number of iterations is often denoted as and is the only hyperparameter in early stopping. In the end, the last best saved set of parameters is returned (instead of the actual latest parameters).
Early stopping might terminate the training process before reaching full convergence, specifically if the model has captured the basic pattern of the function but hasn't yet overfitted to the noise. Since the weights are set to low values, they don't get a chance to grow large, which makes early stopping similar to L2 regularization (it has been proven that in the case of simple gradient descent with a quadratic loss function, L2 regularization is equivalent to early stopping).
Although rather intuitive, early stopping might halt the training before reaching the global minimum.
Dropout
Next, we will consider dropout, which makes the network learn a redundant representation of the data by disabling a fraction of neurons. Dropout works by setting a subset of units in the hidden layer (also could be the input layer, although less frequently) to zero in each iteration.
On a high level, in the training phase, a fraction of neurons is deactivated randomly at every iteration with probability p (which is a hyperparameter, known as the dropout probability, and is often set to 0.5 by default). This operation makes the network less dependent on a specific set of hidden units (since they might be deactivated at any point) and lowers the magnitude of the weights so that the change in function is less affected by the absence or presence of specific units. In turn, this makes the network more robust.
When it comes to inference, dropout is not present because it introduces randomness, and we would like to keep the predictions consistent based on the learned weights. To make the predictions, the full network (where all units are active) is used instead. This introduces an issue: the network now has more hidden units than it was trained with (the weights have been adjusted under the assumption that a certain fraction of the neurons could be dropped out):
So, to compensate for the fact that all neurons are now active, the weights are multiplied by . This is known as the weight scaling inference rule (e.g., if was set to during training, during inference, the neuron outputs will be multiplied by ).
Rescaling the weights during inference doesn't seem that convenient, so, in practice, most implementations use the inverted dropout, where the activations are rescaled during the training phase (e.g., if , the activations would be multiplied by ).
Intuitively, dropout is like an ensemble of reduced networks, and ensembles generally result in better performance. Unlike a regular ensemble, though, instead of training multiple models and averaging the outputs, dropout trains many models at once.
Conclusion
As a result, you are now familiar with:
- How L1 and L2 regularizations are applied to the weights of the cost function;
- Early stopping, which halts the training once there are no significant improvements in the validation set error;
- Dropout, which deactivates a fraction of the neurons to ensure that the network is less dependent on the connections between specific units.