The major disadvantage of gradient descent with a fixed learning rate is its approach to parameter updates: significant adjustments are made for larger gradients, and minor ones for smaller gradients. This becomes an issue when the gradient of the loss surface is substantially steeper in one direction compared to another, making it difficult to select an appropriate fixed learning rate. In the previous topics, you learned about ways to make the learning rate adaptive in some manner, namely Momentum GD and RMSProp.
In this topic, we take a look at the combination of Momentum and RMSProp — the Adam optimizer.
Adam
Just like RMSProp, Adam (Adaptive Moment Estimation) updates each parameter individually and uses the exponentially decaying average of the past squared gradients () along with the exponentially decaying average of the past gradients (), similar to Momentum GD:
where
— the time step;
— (mean, also known as the first moment estimate), a decayed running average of all gradients for iterations. keeps track of the direction of the gradients;
— (variance, also known as the second moment estimate), a decayed running average of the squared gradients for iterations. tracks the magnitude of the gradients and represents the gradient's volatility ( will be higher if the gradients severely change at each step, and lower if they are more stable);
— the gradient of the cost function at step ();
— exponential decay rate, associated with the running average of the gradients (), represents the amount of momentum to update the weights (e.g., if , the model will heavily consider the past gradients for the current update);
— exponential decay rate, associated with the running average of the squared gradients, controls how fast the estimate of the second moment (variance) of the gradients decays (e.g., if , the past squared gradients have a great influence on the learning rate).
and are initialized as zero vectors, making them biased towards 0, which introduces the bias-corrected first and second moment estimates:
The Adam parameter update then becomes
where
— the initial learning rate;
— a small constant to avoid zero division.
The full simplified Adam parameter update can be presented as
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x -= learning_rate * mt / (np.sqrt(vt) + eps)With Adam, you have to set the following hyperparameters:
— the initial learning rate
— the decay rate associated with the running average of the gradients;
— the decay rate associated with the running average of the squared gradients;
— a constant to avoid zero division.
Hyperparameter tuning
That's it.
On a more serious note, you mainly have to tune the initial learning rate. Epsilon is almost never tuned and doesn't have a dramatic effect on the performance (being a constant set to 1e-08 by default in the PyTorch implementation). However, it's worth mentioning that
The default value for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. (source)
The default and (again, as per the Pytorch Adam implementation) seem to yield good results in practice and it’s safe to leave them at that for most tasks (also, if we consider that the decay rate lies in the [0, 1) range, these defaults are pretty close to 1). Rather small values for , e.g., or are rarely used in practice, and, when is set to a large enough value in the context of the specific task with a small , Adam will converge (more on that here). Lower use cases are present in some original GAN implementation, for example, ProGAN sets the and , while DCGAN and Pix2Pix set and .
The initial learning rate is set to 1e-3 by default in PyTorch. The scheduler (adjusting the learning rate during the training process so that it varies following a predefined schedule, e.g., changing the learning rate every epochs as a basic example) can be combined with Adam (and, in fact, scheduling could lead to a significant performance improvement).
An alternative approach is to use the Optuna library or a similar package for automatic hyperparameter search. As an overview, one would create an objective function (which includes the model itself, the possible hyperparameter space, and the objective to optimize), create a study, and retrieve the best parameters. Optuna could be used to find the optimal s or the initial learning rate or both.
A note on Momentum and RMSProp
In this section, let's try to see how exactly Adam combines RMSProp with Momentum step-by-step. We will end up with a simplified version of Adam (to be specific, there will be no bias correction, but it will be pretty similar).
As a reminder, the momentum update has the following form:
Momentum notation
— the momentum coefficient;
— the gradient at step ;
— momentum at time step t, the running average of past gradients.
— the initial learning rate.
For RMSProp, the update is:
RMSPRrop notation
— the running average of the squared gradients;
— the decay factor
Note that corresponds to the first moment estimate (the decayed running average of all gradients for iterations), or in Adam, and is the second moment estimate (the decayed running average of the squared gradients for iterations), which is in Adam. So, combining and results in the Adam update without the bias correction term:
Conclusion
As a result, you are now familiar with how the Adam optimizer combines Momentum GD with RMSProp, its main components, and some considerations for choosing the hyperparameter values.