The learning rate is perhaps the most important hyperparameter to fine-tune during the training of neural networks. The story goes like this: if the selected fixed learning rate is too large, you diverge and get nowhere. If it's too small, the training is slow and everything comes down to a suboptimal solution.
In the earlier days, one would probably do a grid or random search and hope for the best learning rate to be found somewhere along the way, which was time-consuming and potentially inaccurate. One of the possible solutions — introduce scheduling to adjust the learning rate over the course of training automatically, based on some condition (for example, reduce the learning rate by some constant if the validation error doesn't improve).
This topic introduces some of the most common learning rate scheduling strategies in their basic modifications.
Step decay
As a brief reminder of the most basic setting, the learning rate () is the step size in stochastic gradient descent (SGD) during back-propagation that is present in the update rule:
Here, we just set the learning rate () once and deal with issues such as slow convergence or getting stuck in local minima further down the line.
First, let's look at step decay, which reduces the learning rate by some factor every few epochs (e.g., the learning rate drops by half every 20 epochs). The general intuition behind step decay can be thought of as "let's move as fast as possible for the first epochs, and drop the speed only a few times to converge after the majority of the path has been completed". Formally, step decay can be defined as
There are two (technically three, if the initial learning rate is included) hyperparameters: (the decay factor) and the step size, where decays the learning rate (and lies in the interval), and the step size (here, we operate in epochs, but also could be in iterations) which controls how often the learning rate is decayed. and the step size depend on the specific model. The step size is (usually) chosen so that there are 2 or 3 learning rate drops over training.
As a (not so fun) fact, in the original Xception paper, the is set to 0.94 and the step size is 2 epochs for the ImageNet dataset. For MobileNetV3, the learning rate is reduced by 0.01 every 3 epochs, etc. In practice, step decay is particularly suitable for 100+ epochs of training, and also improves the accuracy.
Exponential decay
Another approach is the exponential decay rate. Instead of dropping the rate after some fixed number of epochs, the learning rate is reduced after every epoch, which makes the learning rate decay more smoothly. Exponential decay is defined as follows:
is consistent with the step decay and also lies in the interval.
By definition, step and exponential decay look very similar. Exponential decay, just like step decay, stabilizes convergence. However, exponential decay has a higher tendency to drop the test accuracy (and thus, step decay is more preferred in practice):
Cosine annealing
So far, we have consistently reduced the learning rate in both schedulers. Cosine annealing does something very different by changing the learning rate in cycles. It starts with a large learning rate that drops to the minimum and then increases again. At first, this kind of behavior might seem strange (didn't we want to go fast in the initial stages and slow down later on?).
In reality, the loss surface might be a bit more complicated than a single-minima bowl. That is to say, the loss surface can be extremely non-convex and have many local minima, and the straightforward learning rate reduction might result in a sub-optimal solution. If the learning rate fluctuates within a limited range, there is a higher chance to reach the global minimum because higher-end learning rates might lead to a new region and lower-end rates help to be more specific. The learning rate reset simulates a "warm" restart of the learning process, but with a set of good weights as a starting point instead of random weights (known as the "cold" restart).
Thus, cosine annealing is defined as
where
and are the ranges of the learning rate;
is the number of epochs between the cycles.
The cosine term follows the regular cosine function trajectory but is re-scaled so that the cosine has a maximum of (instead of 1) and a minimum of (instead of -1). Towards the end of training, the learning rate with cosine annealing becomes really close to 0.
Cosine annealing tends to be more unstable, but often leads to higher accuracy (and more advanced versions can lead to a great accuracy improvement):
A comparative table of the LR schedulers
| The main takeaway | Misc |
Step decay | Significantly drops the learning rate after epochs. | Consistent results, improves accuracy, widely used. |
Exponential decay | Drops the learning rate after each epoch. | Tends to have lower accuracy when compared to other schedulers. |
Cosine annealing | The learning rate fluctuates between two boundaries in a cyclical fashion. | Tends to be more unstable than step and exponential decay, but results in higher accuracy, although doesn't always work. Needs more hyperparameter tuning. |
Conclusion
As a result, you are now familiar with the following:
The learning rate scheduler is a mechanism to adjust the learning rate during the course of training based on a certain condition (such as reaching a specific epoch);
Step decay, which reduces the learning rate by a factor every epochs;
Exponential decay, which is similar to step decay, but the reduction happens for every single epoch;
Cosine annealing, which varies the learning rate between the lower and upper bounds, allowing for more exploration compared to exponential and step decay.