Computer scienceData scienceMachine learningIntroduction to deep learningOptimizers

Adagrad and RMSProp

10 minutes read

Gradient descent is central to the process of training neural networks, but vanilla gradient descent comes with a set of caveats, such as the risk of getting stuck in local optima or slow convergence speed.

In this topic, we will consider approaches that help mitigate the challenges of standard gradient descent, specifically two of the first adaptive learning rate algorithms: Adagrad and RMSProp.

The motivation behind adaptive learning rates

At its core, gradient descent minimizes an objective function J(θ)J(\theta), updating the model’s parameters (weights) in the direction of the steepest descent. The learning rate η\eta determines the step size to take in order to reach the (local) minimum.

One of the simpler versions of the parameter update (with SGD) can be represented as

while True:
  data_batch = sample_training_data(data, 1) # sample 1 example
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

SGD process

With gradient descent, we want to find the global optimum of the loss function as fast as possible and avoid getting stuck in the local optima. In the most straightforward scenario, you choose a single learning rate for the entire duration of the optima search and hope for the best. There are several issues with this approach:

  • You set the learning rate manually and universally (e.g., the parameters are updated globally). A small learning rate results in slow convergence, while a larger one hinders convergence, with no guarantees of reaching the global optimum. This issue was partially addressed by momentum gradient descent, Nesterov accelerated gradient (NAG), and the scheduler (where the learning rate changes over time, so the learning rate at the end of the search will be different from the learning rate at the beginning). However, all these approaches perform the same update for all parameters.
  • In optimization, features can be sparse. The average gradient for these sparse features tends to be small, which results in slower updates for the parameters associated with the sparse features. If the features have vastly different frequencies, we want to perform larger updates for the parameters associated with rare features and smaller updates for the frequent ones. "Parameter importance" refers to the idea that parameters with larger gradients correspond to more important features, and vice versa for parameters with smaller gradients.

Below, we will discuss two parameter update rules that address the global parameter update issue.

Adagrad

Adagrad (Adaptive gradient) individually adapts the learning rate to the parameters, assigning a smaller learning rate (thus performing smaller updates) for parameters associated with frequent features. Conversely, it assigns larger updates for infrequent features by keeping track of the sum of squared gradients.

The Adagrad update looks like

θt+1=θt, iηGt+ϵgt\theta_{t+1} = \theta_{t, \ i} - \frac{\eta}{\sqrt{G_{t} + \epsilon}}g_{t}

where

  • θt+1\theta_{t+1}— the individual parameter being updated in the current iteration;
  • tt — the time step;
  • η\eta — the initial learning rate;
  • GtG_t — the accumulated sum of squared gradients for tt iterations;
  • gtg_t — the gradient of the cost function at step tt (gt=θJ(θt)g_t = \nabla_\theta J(\theta_t));
  • ϵ\epsilon — a constant to avoid zero division.

Or, simplified,

grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared += dx * dx
	x -= learning_rate * dx / (np.sqrt(grad_squared + epsilon))

A comparison between SGD and Adagrad

With Adagrad, you set the following hyperparameters:

  • The initial learning rate (most implementations set it to 0.01 by default);
  • Epsilon (a small value added to the denominator to avoid division by zero).

Adagrad accumulates the squared gradients in the denominator. The sum keeps growing during training, which causes the learning rate to diminish to the point where the learning process (the minima search) pretty much halts itself.

RMSProp

Adagrad addresses the challenge of individual parameter update by accumulating squared gradients in the denominator of the update. However, there is a drawback: the learning rate begins to diminish over time, which slows down the optimization process. RMSProp (Root Mean Square Propagation) addressed this by adding a decay factor. The decay factor introduces a time awareness of sorts, so, instead of taking the entire squared gradient history into account, the update starts to focus on a more recent squared gradient window (or the moving average), prioritizing the newer gradients (and forgetting the old ones) to perform the updates.

First, the moving average of the squared gradients is computed for each weight:

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2

where

  • E[g2]tE[g^2]_t — the moving average of the squared gradients;
  • γ\gamma — the decay factor, discussed in more detail below.

After that, the parameter update is performed (the Adagrad notation is preserved):

θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}}g_{t}Or, again, simplified:

mean_squared = 0
while True:
	dx = compute_gradient(x)
	mean_squared = decay_rate * mean_squared + (1 - decay_rate) * dx * dx
	x -= learning_rate * dx / (np.sqrt(mean_squared + epsilon))

A comparison between SGD and RMSProp

The decay rate lies in the [0, 1] range, and the usual values are [0.9, 0.99, 0.999]. Values close to 0 will introduce instability (making RMSProp focus on the most recent gradients), and values closer to 1 will retain a longer history. Hinton recommends setting the decay rate to 0.9.

Due to the decay rate, the squared gradients are kept in check, which makes RMSProp converge faster than Adagrad.

RMSProp requires setting the following hyperparameters:

  • The initial learning rate (again, as per Hinton, the recommended value is 0.001),
  • The decay rate (described above),
  • Epsilon (same as for Adagrad).

In practice, a variation of RMSProp is often used, namely, Adam, that combines the decay rate with momentum, which results in superior performance when compared to both RMSProp and Adagrad.

Conclusion

As a result, you are now familiar with:

  • The motivation behind adaptive learning rates;
  • Adagrad, which was one of the first adaptive learning rate techniques. It adapts the learning rate for each parameter individually but tends to suffer from the accumulation of squared gradients in the denominator of the learning rate update;
  • RMSProp, which introduces the decay rate to Adagrad, making it consider the recent gradient history instead of the entire history, leading to a faster convergence speed.
3 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo