Computer scienceData scienceMachine learningIntroduction to deep learningOptimizers

Adam

11 minutes read

The major disadvantage of gradient descent with a fixed learning rate is its approach to parameter updates: significant adjustments are made for larger gradients, and minor ones for smaller gradients. This becomes an issue when the gradient of the loss surface is substantially steeper in one direction compared to another, making it difficult to select an appropriate fixed learning rate. In the previous topics, you learned about ways to make the learning rate adaptive in some manner, namely Momentum GD and RMSProp.

In this topic, we take a look at the combination of Momentum and RMSProp — the Adam optimizer.

Adam

Just like RMSProp, Adam (Adaptive Moment Estimation) updates each parameter individually and uses the exponentially decaying average of the past squared gradients (mtm_t) along with the exponentially decaying average of the past gradients (vtv_t), similar to Momentum GD:

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \\ v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2

where

  • tt— the time step;

  • mtm_t — (mean, also known as the first moment estimate), a decayed running average of all gradients for tt iterations. mtm_t keeps track of the direction of the gradients;

  • vtv_t — (variance, also known as the second moment estimate), a decayed running average of the squared gradients for tt iterations. vtv_t tracks the magnitude of the gradients and represents the gradient's volatility (vtv_t will be higher if the gradients severely change at each step, and lower if they are more stable);

  • gtg_t — the gradient of the cost function at step tt (gt=θJ(θt)g_t = \nabla_\theta J(\theta_t));

  • β1\beta_1 — exponential decay rate, associated with the running average of the gradients (mtm_t), represents the amount of momentum to update the weights (e.g., if β1=0.9\beta_1 =0.9 , the model will heavily consider the past gradients for the current update);

  • β2\beta_2 — exponential decay rate, associated with the running average of the squared gradients, controls how fast the estimate of the second moment (variance) of the gradients decays (e.g., if β2=0.999\beta_2 = 0.999, the past squared gradients have a great influence on the learning rate).

mtm_t and vtv_t are initialized as zero vectors, making them biased towards 0, which introduces the bias-corrected first and second moment estimates:

mt^=mt1β1tvt^=vt1β2t\hat{m_t} = \frac{m_t}{1 - \beta_1^t} \\ \hat{v_t} = \frac{v_t}{1 - \beta_2^t}

The Adam parameter update then becomes

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t

where

  • η\eta — the initial learning rate;

  • ϵ\epsilon — a small constant to avoid zero division.

The full simplified Adam parameter update can be presented as

m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x -= learning_rate * mt / (np.sqrt(vt) + eps)

A comparison between Adam and other optimizers

With Adam, you have to set the following hyperparameters:

  • η\eta — the initial learning rate

  • β1\beta_1 — the decay rate associated with the running average of the gradients;

  • β2\beta_2 — the decay rate associated with the running average of the squared gradients;

  • ϵ\epsilon — a constant to avoid zero division.

Hyperparameter tuning

Karpathy's tweet on setting the optimal learning rate in Adam

That's it.

On a more serious note, you mainly have to tune the initial learning rate. Epsilon is almost never tuned and doesn't have a dramatic effect on the performance (being a constant set to 1e-08 by default in the PyTorch implementation). However, it's worth mentioning that

  • The default value for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. (source)

The default β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999 (again, as per the Pytorch Adam implementation) seem to yield good results in practice and it’s safe to leave them at that for most tasks (also, if we consider that the decay rate lies in the [0, 1) range, these defaults are pretty close to 1). Rather small values for β1\beta_1, e.g., β1=0\beta_1 = 0 or 0.10.1 are rarely used in practice, and, when β2\beta_2 is set to a large enough value in the context of the specific task with a small β1\beta_1, Adam will converge (more on that here). Lower β1\beta_1 use cases are present in some original GAN implementation, for example, ProGAN sets the β1=0\beta_1 = 0 and β2=0.99\beta_2 = 0.99, while DCGAN and Pix2Pix set β2=0.5\beta_2 = 0.5 and β2=0.99\beta_2 = 0.99.

The initial learning rate η\eta is set to 1e-3 by default in PyTorch. The scheduler (adjusting the learning rate during the training process so that it varies following a predefined schedule, e.g., changing the learning rate every nn epochs as a basic example) can be combined with Adam (and, in fact, scheduling could lead to a significant performance improvement).

An alternative approach is to use the Optuna library or a similar package for automatic hyperparameter search. As an overview, one would create an objective function (which includes the model itself, the possible hyperparameter space, and the objective to optimize), create a study, and retrieve the best parameters. Optuna could be used to find the optimal β\betas or the initial learning rate or both.

A note on Momentum and RMSProp

In this section, let's try to see how exactly Adam combines RMSProp with Momentum step-by-step. We will end up with a simplified version of Adam (to be specific, there will be no bias correction, but it will be pretty similar).

As a reminder, the momentum update has the following form:

vt+1=βvt+(1β)g(t)θt+1=θtηvt+1\begin{align} v_{t+1} & = \beta v_{t} + (1 - \beta) g(t) \\ \theta_{t+1} & = \theta_{t} - \eta v_{t+1} \end{align}
Momentum notation
  • β\beta — the momentum coefficient;

  • g(t)g(t) — the gradient at step tt;

  • vtv_t — momentum at time step t, the running average of past gradients.

  • η\eta — the initial learning rate.

For RMSProp, the update is:

E[g2]t=γE[g2]t1+(1γ)gt2θt+1=θtηE[g2]t+ϵgt\begin{align} E[g^2]_t = \gamma E[g^2]_{t-1} + (1- \gamma) g_t^2 \\ \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}}g_{t} \end{align}
RMSPRrop notation
  • E[g2]tE[g^2]_t — the running average of the squared gradients;

  • γ\gamma — the decay factor

Note that (1)(1) corresponds to the first moment estimate (the decayed running average of all gradients for tt iterations), or mtm_t in Adam, and (3)(3) is the second moment estimate (the decayed running average of the squared gradients for tt iterations), which is vtv_t in Adam. So, combining (1)(1) and (3)(3) results in the Adam update without the bias correction term:

θt+1=θtηE[g2]t+ϵvt+1\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon}v_{t+1}

Conclusion

As a result, you are now familiar with how the Adam optimizer combines Momentum GD with RMSProp, its main components, and some considerations for choosing the hyperparameter values.

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo