Computer scienceData scienceMachine learningIntroduction to deep learningOptimizers

Gradient descent with momentum

15 minutes read

Vanilla gradient descent has certain limitations. Its slow progress through regions with gentle slopes can result in prolonged optimization processes, hindering the discovery of optimal solutions. This is because the gradient in these regions is very small.

In this topic, we will take a look at gradient descent with momentum, which addresses these issues by accelerating convergence, escaping local minima, and reducing oscillations.

A reminder on gradient descent

Gradient descent serves as an iterative optimization technique to discover a local minimum within a differentiable function. Its primary goal revolves around identifying parameter values that minimize the loss function, represented as J(θ)J(\theta).

In essence, the gradient of a function points towards the direction where the function ascends. Therefore, negating the gradient provides us with the direction in which the function descends. Consequently, gradient descent operates by taking iterative steps in the opposite direction of the gradient.

The gradient is computed by

θJ(θ)=[J(θ)θ1J(θ)θ1J(θ)θn]θt+1=θtϵθtJ(θt)\begin{align*} \nabla_{\theta}J(\theta) &= \begin{bmatrix} \frac{\partial J(\theta)}{\partial \theta_1}\\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{bmatrix}\\ \theta_{t+1} &= \theta_{t} - \epsilon \nabla_{\theta_t}J(\theta_t) \end{align*}

Here's a breakdown of the notation:

  • θt\theta_t — the parameter vector to the cost function J(θ)J(\theta).
  • θt\nabla_{\theta_t} — the gradient operator with respect to the parameter vector θ\theta at the current iteration tt.
  • ϵ\epsilon — the learning rate that determines the size of the step.

The algorithm

In the equation above we saw the gradient update rule. The update rule for gradient descent with momentum incorporates a momentum term to the standard gradient descent update rule. The momentum helps smooth out the updates and accelerates convergence, especially when the cost function has a lot of small oscillations or noisy gradients.

Here's the update rule for gradient descent with momentum:

vt+1=βvt+(1β)θtJ(θt)θt+1=θtϵvt+1\begin{align} v_{t+1} & = \beta v_{t} + (1 - \beta) \nabla_{\theta_t}J(\theta_t) \\ \theta_{t+1} & = \theta_{t} - \epsilon v_{t+1} \end{align}

  • vtv_{t} — the velocity or momentum at time step tt. It is a moving average of past gradients.
  • β\beta — the momentum coefficient, a hyper-parameter between 0 and 1. It controls how much of the previous velocity is retained and how much of the current gradient is incorporated. Typical values for β\beta are 0.9 or 0.99. Lower values of β\beta result in more oscillations in the optimization process.
  • θtJ(θt)\nabla_{\theta_t}J(\theta_t) — the gradient of the cost function with respect to the parameters θ\theta at the current parameter values θt\theta_t, as explained earlier.
  • ϵ\epsilon — the learning rate, as in standard gradient descent.

Update rule in detail

Simplifying equation 1, updatet+1=βupdatet+ηθtJ(θt)θt+1=θtϵupdatet+1\begin{align} \text{update}_{t+1} & = \beta \cdot \text{update}_{t} + \eta \nabla_{\theta_t}J(\theta_t) \\ \theta_{t+1} & = \theta_{t} - \epsilon \cdot \text{update}_{t+1} \end{align} Starting with update0=0\text{update}_{0} = 0. At the beginning, during time step 0, the update is initialized to zero because we haven't started the optimization process yet. update0=0\begin{align*} \text{update}_{0} &= 0\\ \end{align*} Moving to time step 1, the update starts to take shape. At this point, it's a positive value, representing a movement in the direction opposite to the gradient. However, it's important to note that the minus sign (from equation 4), indicating the opposite direction, will be incorporated in the next equation. update1=βupdate0+ηθ1J(θ1)=ηθ1J(θ1)\begin{align*} \text{update}_{1} &= \beta \cdot \text{update}_{0} + \eta \cdot \nabla_{\theta_1}J(\theta_1) \\ &= \eta \cdot \nabla_{\theta_1}J(\theta_1) \end{align*} In time step 2, the update is calculated as a weighted average of two components. The first component is β\beta times the update from the previous step (time step 1), and the second component is the gradient at the current step. Both of these components are positive. So, in this step, we're essentially moving in the current direction plus a fraction of the direction pointed to in the previous step. update2=βupdate1+ηθ2J(θ2)=βηθ1J(θ1)+ηθ2J(θ2)\begin{align*} \text{update}_{2} &= \beta \cdot \text{update}_{1} + \eta \cdot \nabla_{\theta_2}J(\theta_2) \\ &= \beta \cdot \eta \cdot \nabla_{\theta_1}J(\theta_1) + \eta \cdot \nabla_{\theta_2}J(\theta_2) \end{align*} Proceeding similarly to time step 3,

update3=βupdate2+ηθ3J(θ3)=β(βηθ1J(θ1)+ηθ2J(θ2))+ηθ3J(θ3)=β2ηθ1J(θ1)+βηθ2J(θ2)+ηθ3J(θ3)\begin{align*} \text{update}_{3} &= \beta \cdot \text{update}_{2} + \eta \cdot \nabla_{\theta_3}J(\theta_3) \\ &= \beta \cdot (\beta \cdot \eta \cdot \nabla_{\theta_1}J(\theta_1) + \eta \cdot \nabla_{\theta_2}J(\theta_2)) + \eta \cdot \nabla_{\theta_3}J(\theta_3) \\ &= \beta^2 \cdot \eta \cdot \nabla_{\theta_1}J(\theta_1) + \beta \eta \cdot \nabla_{\theta_2}J(\theta_2) + \eta \cdot \nabla_{\theta_3}J(\theta_3) \end{align*}

These explanations break down each update iteration, clarifying how the update rule evolves over time and emphasizing that the final negative sign (indicating the opposite direction to the gradient) becomes a part of the subsequent equation. It also highlights that the update is a weighted average, with the weighting being exponential in nature.

Visualizing the update rule

Let's attempt to visualize how the update takes place in the presence of momentum gradient descent, and the cost function value varies. To begin with, we take this function,

f(x)=2x43.5x35.5x2+2.5x\begin{align*} f(x) = 2x^4 - 3.5x^3 - 5.5x^2 + 2.5x \end{align*}

The function plot

Now, we perform the iterations of momentum gradient descent on this error surface.

A plot of cost function and velocity changes over the iterations

The left plot shows how the cost function value changes over iterations. The goal is to minimize this cost function, and each iteration moves us closer to that minimum. As iterations progress, the cost decreases, indicating the optimization process is working. The right plot illustrates how the velocity (speed and direction of parameter updates) changes with each iteration. It represents how quickly the algorithm is converging.

The behavior of GD with and without momentum

Iterations of regular gradient descent

Here we illustrate a situation where gradient descent gets stuck at a local minimum during optimization. As the algorithm approaches the local minimum, the gradient becomes small, and the optimization process halts prematurely, preventing the algorithm from reaching the global minimum.

Iterations of gradient descent with momentum

In contrast, this illustration shows the effectiveness of gradient descent with momentum. With the addition of momentum, the algorithm gains the ability to escape local minima. As it approaches the local minimum, the momentum allows the algorithm to continue its progress, eventually reaching the global minimum.

Benefits of GD with momentum

  1. One of the primary advantages of GD with momentum is its ability to accelerate convergence. By incorporating momentum, GD accumulates information from past gradients, allowing it to maintain and build on its velocity. This enables the algorithm to make more significant updates, particularly when navigating through regions with small but consistent gradients. As a result, convergence to the optimal solution occurs more rapidly.
  2. Vanilla GD can exhibit oscillations or zigzagging behavior, especially when the cost function has noisy gradients or small irregularities. Gradient descent with momentum mitigates these oscillations by smoothing out the updates. The momentum term effectively acts as a moving average of past gradients, which helps to stabilize the optimization process and reduces unnecessary back-and-forth movement.

    A comparison between momentum and regular gradient descent

  3. One of the most significant advantages of GD with momentum is its ability to escape local minima in the optimization landscape. In traditional GD, the algorithm can get stuck in local minima due to small gradients. With momentum, GD gains the momentum to overcome these barriers and explore regions of the cost function that might lead to a better global minimum.
  4. Momentum GD is less sensitive to the choice of the initial learning rate. Unlike regular GD, Momentum GD adds a fraction of the previous update to the current one. This factor does not completely eliminate the need to fine-tune the initial learning rate but somewhat reduces the impact of the initial learning rate.

Some limitations

We discussed earlier that momentum gradient descent has faster convergence, however, the same is not true when the error surface is complex and strongly convex. Momentum GD is primarily designed for smooth, differentiable loss functions. However, in the presence of strong convex or non-convex loss functions, that are non-smooth or have discontinuities, momentum GD can still get stuck in poor local minima or can lead to oscillations. While it helps overcome some local minima, especially shallow ones, it does not guarantee escape from deeper or narrow local minima.

Another limitation is its difficulty in setting the right step size. There comes a trade-off between the step size (i.e. learning rate) and convergence rate in momentum gradient descent. Lowering the step size can reduce the impact of stochastic noise but may slow down convergence, while increasing momentum can lead to compounding errors, potentially hindering convergence.

So, what's the solution? One can leverage adaptive learning rate methods like RMSprop or Adam to adjust the learning rate for each parameter based on historical gradient information. These methods can automatically reduce the learning rate for parameters that have received large updates, helping to reduce the risk of overshooting the optimal solution while still making quick progress.

As a side note, Adam and RMSProp incorporate momentum to improve efficiency (the classical RMSProp formulation does not include momentum, however, various implementations, such as torch.optim.RMSprop(), incorporate the momentum term into RMSProp).

Conclusion

In summary, Gradient Descent with Momentum offers a solution to optimize functions efficiently, especially when dealing with deep learning and non-convex landscapes. By utilizing momentum, this method accelerates convergence, reducing the risk of getting trapped in local minima and minimizing oscillations. It's important to note the existence of even more advanced optimization algorithms like Adam, which further enhance convergence and stability.

3 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo