Computer scienceData scienceMachine learningIntroduction to deep learningOptimizers

Adam

11 minutes read

The major disadvantage of gradient descent with a fixed learning rate is its approach to parameter updates: significant adjustments are made for larger gradients, and minor ones for smaller gradients. This becomes an issue when the gradient of the loss surface is substantially steeper in one direction compared to another, making it difficult to select an appropriate fixed learning rate. In the previous topics, you learned about ways to make the learning rate adaptive in some manner, namely Momentum GD and RMSProp.

In this topic, we take a look at the combination of Momentum and RMSProp — the Adam optimizer.

Adam

Just like RMSProp, Adam (Adaptive Moment Estimation) updates each parameter individually and uses the exponentially decaying average of the past squared gradients ( $m_t$ ) along with the exponentially decaying average of the past gradients ( $v_t$ ), similar to Momentum GD:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \\ v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2

where

$t$ — the time step;
$m_t$ — (mean, also known as the first moment estimate), a decayed running average of all gradients for $t$ iterations. $m_t$ keeps track of the direction of the gradients;
$v_t$ — (variance, also known as the second moment estimate), a decayed running average of the squared gradients for $t$ iterations. $v_t$ tracks the magnitude of the gradients and represents the gradient's volatility ( $v_t$ will be higher if the gradients severely change at each step, and lower if they are more stable);
$g_t$ — the gradient of the cost function at step $t$ ( $g_t = \nabla_\theta J(\theta_t)$ );
$\beta_1$ — exponential decay rate, associated with the running average of the gradients ( $m_t$ ), represents the amount of momentum to update the weights (e.g., if $\beta_1 =0.9$ , the model will heavily consider the past gradients for the current update);
$\beta_2$ — exponential decay rate, associated with the running average of the squared gradients, controls how fast the estimate of the second moment (variance) of the gradients decays (e.g., if $\beta_2 = 0.999$ , the past squared gradients have a great influence on the learning rate).

$m_t$ and $v_t$ are initialized as zero vectors, making them biased towards 0, which introduces the bias-corrected first and second moment estimates:

\hat{m_t} = \frac{m_t}{1 - \beta_1^t} \\ \hat{v_t} = \frac{v_t}{1 - \beta_2^t}

The Adam parameter update then becomes

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t

where

$\eta$ — the initial learning rate;
$\epsilon$ — a small constant to avoid zero division.

The full simplified Adam parameter update can be presented as

m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x -= learning_rate * mt / (np.sqrt(vt) + eps)

A comparison between Adam and other optimizers

With Adam, you have to set the following hyperparameters:

$\eta$ — the initial learning rate
$\beta_1$ — the decay rate associated with the running average of the gradients;
$\beta_2$ — the decay rate associated with the running average of the squared gradients;
$\epsilon$ — a constant to avoid zero division.

Hyperparameter tuning

Karpathy's tweet on setting the optimal learning rate in Adam

That's it.

On a more serious note, you mainly have to tune the initial learning rate. Epsilon is almost never tuned and doesn't have a dramatic effect on the performance (being a constant set to 1e-08 by default in the PyTorch implementation). However, it's worth mentioning that

The default value for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. (source)

The default $\beta_1 = 0.9$ and $\beta_2 = 0.999$ (again, as per the Pytorch Adam implementation) seem to yield good results in practice and it’s safe to leave them at that for most tasks (also, if we consider that the decay rate lies in the [0, 1) range, these defaults are pretty close to 1). Rather small values for $\beta_1$ , e.g., $\beta_1 = 0$ or $0.1$ are rarely used in practice, and, when $\beta_2$ is set to a large enough value in the context of the specific task with a small $\beta_1$ , Adam will converge (more on that here). Lower $\beta_1$ use cases are present in some original GAN implementation, for example, ProGAN sets the $\beta_1 = 0$ and $\beta_2 = 0.99$ , while DCGAN and Pix2Pix set $\beta_2 = 0.5$ and $\beta_2 = 0.99$ .

The initial learning rate $\eta$ is set to 1e-3 by default in PyTorch. The scheduler (adjusting the learning rate during the training process so that it varies following a predefined schedule, e.g., changing the learning rate every $n$ epochs as a basic example) can be combined with Adam (and, in fact, scheduling could lead to a significant performance improvement).

An alternative approach is to use the Optuna library or a similar package for automatic hyperparameter search. As an overview, one would create an objective function (which includes the model itself, the possible hyperparameter space, and the objective to optimize), create a study, and retrieve the best parameters. Optuna could be used to find the optimal $\beta$ s or the initial learning rate or both.

A note on Momentum and RMSProp

In this section, let's try to see how exactly Adam combines RMSProp with Momentum step-by-step. We will end up with a simplified version of Adam (to be specific, there will be no bias correction, but it will be pretty similar).

As a reminder, the momentum update has the following form:

\begin{align} v_{t+1} & = \beta v_{t} + (1 - \beta) g(t) \\ \theta_{t+1} & = \theta_{t} - \eta v_{t+1} \end{align}

Momentum notation

$\beta$ — the momentum coefficient;
$g(t)$ — the gradient at step $t$ ;
$v_t$ — momentum at time step t, the running average of past gradients.
$\eta$ — the initial learning rate.

For RMSProp, the update is:

\begin{align} E[g^2]_t = \gamma E[g^2]_{t-1} + (1- \gamma) g_t^2 \\ \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}}g_{t} \end{align}

RMSPRrop notation

$E[g^2]_t$ — the running average of the squared gradients;
$\gamma$ — the decay factor

Note that $(1)$ corresponds to the first moment estimate (the decayed running average of all gradients for $t$ iterations), or $m_t$ in Adam, and $(3)$ is the second moment estimate (the decayed running average of the squared gradients for $t$ iterations), which is $v_t$ in Adam. So, combining $(1)$ and $(3)$ results in the Adam update without the bias correction term:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon}v_{t+1}

Conclusion

As a result, you are now familiar with how the Adam optimizer combines Momentum GD with RMSProp, its main components, and some considerations for choosing the hyperparameter values.

2 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Adam

Adam

Hyperparameter tuning

A note on Momentum and RMSProp

Conclusion

Related topics