MathComputational mathOptimization

Pitfalls of gradient descent

Provided by: Edvancium

13 minutes read

The gradient descent method can be a very useful tool to optimize multivariable functions. However, it has its issues, too. In this topic, you will touch on potential difficulties that might arise when working with this method, as well as ways to overcome them.

Stationary points

If we were to draw a straight line between any two points of a strictly convex function, this line would always be above the function for that interval. For example:

Straight line between any two points of a strictly convex function

$f(x)=x^2$ and $f(x, y) = x ^2 -xy + y^2$ , examples of strictly convex functions.

A stationary point is the one where all partial derivatives of a function are equal to zero. When the function is not strictly convex, the stationary point could be a local minimum or a saddle point.

Examples of functions with saddle points

$f(x)=x^3$ and $f(x, y) = x ^2 -xy - y^2$ , examples of functions with saddle points.

A sequence generated by the gradient descent method converges on a stationary point for the objective function. It follows that the gradient method works best for strictly convex functions, as there is usually a single well-defined stationary point: its global minimum.

The method can become very inefficient — and even fail — for functions with regions that plateau. The gradient will be equal to the zero vector at these points, and this would mean that the sequence will get stuck for all subsequent iterations.

The gradient descent method depends on the starting point you choose. By starting the sequence at a different point, you are often able to avoid regions with stationary points and the issues that come with those. Furthermore, a good staring point can even improve convergence!

For example, let's consider the following function: $f(x,y) = \frac{7xy}{e^{x^2 + y^2}}$ $\bold{x_0} = (-0.5, -0.5)$ will serve as a starting point. Then, by applying the gradient descent method, you will get this:

Gradient descent method (Figure 1)

You can see in the animation how the method arrives at the wrong stationary point.

If instead you choose $\bold{x_0} = (-0.6, -0.5)$ as a starting point and apply the method again, you will see the following picture:

Gradient descent method (Figure 2)

This time, the method arrives at the right stationary point: the minimum!

Step size

As you've seen previously, the gradient descent method is also heavily dependent on the chosen step size. Sometimes an ill-chosen constant step size for a particular function can get the sequence stuck in a back-and-forth situation. In the example below, you can see how fairly inefficient zig-zag patterns occur as the direction of the gradient oscillates unchecked.

Zig-zag patterns

In addition, instead of getting stuck between two opposite points, you can see that the sequence grows farther and farther away from the minimum for a given step size. This means even though the gradient vector points in the direction of the steepest descent, following it doesn't guarantee descent!

The sequence grows farther away from the minimum for a given step size

Like any other vector, the search vector $- \gamma \nabla f(\bold{x}_k)$ has both direction and magnitude.

$- \gamma \nabla f(\bold{x}_k) = \underbrace{- \nabla f(\bold{x}_k)}_\text{direction} \cdot \underbrace{\gamma}_\text{magnitude}$

Then, the apparently contradictory statement from above makes sense when you consider that the negative gradient vector points to the direction where the minimizer is, but it doesn't tell us how far away it is. By evaluating the gradient at each step, the vector is continuously adjusted to point in the correct direction. Then, you need a way to adjust its size to guarantee it is correctness as well.

Formally, the descent method should follow two conditions to avoid overshoots and slow convergence. These are the Wolfe conditions:

$\begin{matrix} 1. & f(\bold{x}_k - \gamma_k \cdot \nabla f(\bold{x}_k)) \leq f(\bold{x}_k) - c_1 \cdot \gamma_k \cdot D_{\nabla f(\bold{x}_k)}f(\bold{x}_k) \\ \\ 2. & | D_{\nabla f(\bold{x}_k)}f(\bold{x}_k - \gamma_k \cdot \nabla f(\bold{x}_k)) | ≤ c_2 |D_{\nabla f(\bold{x}_k)}f(\bold{x}_k)| \end{matrix}$ With $0 < c1 < c2 < 1$ .

The first condition is known as the Armijo rule. It breaks down as follows:

On the left-hand side of the inequality, you have the value that the objective function will take for the next step $k+1$ , expressed in terms of the current step $k$ :

$\bold{x_{k+1}} = \bold{x}_k - \gamma_k \cdot \nabla f(\bold{x}_k) \implies f(\bold{x_{k+1}}) = f(\bold{x}_k - \gamma_k \cdot \nabla f(\bold{x}_k))$ On the right-hand side, you subtract a term proportional to the amount of change expected after a step in the direction of the steepest descent, from the value of the objective function at the current step:

$f(\bold{x}_k) - c_1 \cdot \gamma_k \cdot D_{\nabla f(\bold{x}_k)}f(\bold{x}_k) \\ \ \\ \begin{align*} c_1 : & \ \text{constant} \\ \gamma_k : & \ \text{step size for the k-th iteration} \\ f(\bold{x}_k) : & \ \text{current value of the objective function for the k-th iteration} \\ D_{\nabla f(\bold{x}_k)}f(\bold{x}_k) :& \ \text{directional derivative of the objective function along its gradient for the k-th iteration} \end{align*}$ Recall that the directional derivative gives us information of the rate at which the objective function would change if you were to follow the direction given by some vector.

By keeping $0 < c1 < 1$ , the right-hand side constrains the left-hand side to a maximum value that should only be a small proportion of the expected change. The Armijo rule guarantees that the next step will result in a significant decrease with respect to the current step. This secures that the descent method actually descends.

The second condition is known as the curvature condition. It constrains the expected change of the next iteration on the right-hand side of the inequality, to a proportion of the expected change for the current iteration on the left-hand side. It guarantees that each next step will be closer to the minimizer, as the expected rate of change decreases as you approach the minimum or a similar stationary point.

As an example, let's consider the following objective function:

$f(x,y) = x^2 + y^2 − 2y$ with the following values:

$\begin{align*} c_1&=0.1 \\ c_2 &= 0.9 \\ \bold{x}_k &= (-2, 1) \end{align*}$ for both constants from the Wolfe conditions, as well as the current step value.

You can compute the gradient at this point like this:

$\nabla f(x,y)= \begin{bmatrix} 2x \\ 2y-2 \end{bmatrix} \implies \nabla f(x,y) |_\bold{x_0}= \begin{bmatrix} 2 \ (-2) \\ 2 \ (1)-2 \end{bmatrix} = \begin{bmatrix} -4 \\ 0 \end{bmatrix}$ Then, you can compute the directional derivative at the current step: $D_\bold{v}f(\bold{x}) = \nabla f(\bold{x}) \cdot\bold{v} \implies D_{\nabla f(\bold{x}_k)}f(\bold{x}_k) = \nabla f(\bold{x}_k)\cdot \nabla f(\bold{x}_k)=\begin{bmatrix} -4 \quad 0 \end{bmatrix} \cdot \begin{bmatrix} -4 \\ 0 \end{bmatrix} = 16$

Now, you can substitute these values for both Wolfe conditions, starting with the Armijo rule:

$\begin{align*} 1. \qquad \qquad f ( -2 + 4 \cdot \gamma_k, \ 1 + 0 \cdot \gamma_k) &\leq f(-2,1) - (0.1) \cdot \gamma_k \cdot (16) \\ f(-2+4\gamma_k,\ 1) &\leq 3 - 1.6 \gamma_k \\ (-2+4\gamma_k)^2 + (1)^2 − (2)(1) &\leq 3-1.6\gamma_k \\ 4 + 2\cdot(-2)(4\gamma_k) + 16 \gamma_k^2 - 1 &\leq 3-1.6\gamma_k \\ 3 - 16\gamma_k + 16 \gamma_k^2 &\leq 3 -1.6 \gamma_k \\ \gamma_k^2 - 0.9\gamma_k &\leq 0 \implies \gamma_k \cdot (\gamma_k - 0.9) \leq 0 \implies 0 \leq \gamma_k \leq 0.9 \end{align*}$ and following with the curvature condition:

$\begin{align*} 2. \quad \quad | \nabla f( -2 + 4 \cdot \gamma_k, \ 1 + 0 \cdot \gamma_k) \cdot \nabla f(\bold{x}_k) | & \leq (0.9) \cdot (16) \\ \left| \begin{bmatrix} (2) \ (-2+4\gamma_k) \quad 2 \ (1)-2 \end{bmatrix} \cdot \begin{bmatrix} -4 \\ 0 \end{bmatrix} \right| &\leq 14.4 \\ |(-4)(-4+8\gamma_k) + (0)(0)| &\leq 14.4 \\ |1-2\gamma_k| &\leq 0.9 \\ \sqrt{(1-2\gamma_k)^2} &\leq 0.9 \\ (1-2\gamma_k)^2 &\leq 0.81 \implies 0.05 \leq \gamma_k \leq 0.95 \end{align*}$ The intersection of both intervals gives us the range of step values, for which significant descent is guaranteed, without any overshoots:

$0.05 \leq \gamma_k \leq0.9$ You can narrow this interval by choosing different values of $c_1$ and $c_2$ . Typically, $c_1 \approx 10^{-4}$ and $c_2 \approx 10^{-1}$ .

Choosing the best step size is an optimization problem on its own. It is your job to make sure that the step size optimization algorithm you pick fulfills both Wolfe conditions!

Laying the groundwork

As you've seen, the gradient descent method requires a step size optimization at every iteration. Usually, this represents an iterative process by itself, thus the method can become computationally costly and often slow to converge. However, it provides a solid foundation, from which more refined methods that can provide much faster convergence stem, like the conjugate gradient, or the Nesterov Accelerated Gradient descent algorithms.

Conclusion

The gradient descent method is very sensitive to the choice of step size. A poor choice might lead to a slow convergence or no convergence at all.
The gradient descent method depends heavily on the shape of the function. It works best on strictly convex functions, but it can also work on quasi-convex functions, as long as you avoid saddle points.
By tweaking both the starting point and the step size optimization algorithm, you can achieve reasonably fast convergence.
The step size optimization algorithm should fulfill both Wolfe conditions to avoid divergence and slow convergence.
The method is the foundation for more advanced optimization algorithms.

How did you like the theory?

Report a typo

Pitfalls of gradient descent

Stationary points

Step size

Laying the groundwork

Conclusion

Related topics