MathComputational mathOptimization

Gradient Descent

12 minutes read

We already know how computers can use descent methods to optimize single-valued objective functions. However, when it comes to higher-dimensional domains like the ones used in machine learning models, we need to be able to define a search direction that takes into account change in more than one coordinate. In this topic we will explore how the gradient operator can help us optimize for this vector, as well as some advantages and disadvantages of the method.

The direction of steepest change

Let's say a friend of ours finds himself at some point on a mountain where the ground level can be defined using a function z=h(x,y)z=h(x,y), but our friend is unsure about the taken path to get here. So, our friend decides to walk down to the base of the mountain. However, our buddy is also quite tired at this point, so naturally, we would like to make sure each step that is taken by our friend results in good progress walking down the mountain. In which direction should the next step be taken?

A friend in the mountains

In order to answer this question, let's recall the gradient operator.

As we learned previously, the gradient operator gives us a vector-valued function comprised of a function's derivatives on each coordinate

f(x)=[fx1(x)fx2(x)fxn(x)];f:RnRxRn\nabla f(\bold{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\bold{x}) \\ \\ \frac{\partial f}{\partial x_2}(\bold{x}) \\ \\ \vdots \\ \\ \frac{\partial f}{\partial x_n}(\bold{x}) \end{bmatrix} \quad ; \begin{matrix} \qquad f: \mathbb{R^n} \to \mathbb{R} \\ \bold{x} \in \mathbb{R^n} \\ \end{matrix}Turns out that if we multiply the gradient of a function evaluated at a given point x0\bold{x}_0 times another vector vRn\bold{v} \in \mathbb{R}^n, we will get a measure of the rate at which the function would change if we were to follow the direction given by the vector:

Dvf(x)x=x0=f(x)x=x0vD_\bold{v}f(\bold{x})|_{\bold{x}=\bold{x_0}} = \nabla f(\bold{x})|_{\bold{x}=\bold{x_0}} \cdot\bold{v}

This is called a directional derivative, and it can be denoted as either DvfD_\bold{v} f or vf\nabla_\bold{v}f. We can think of it as a weighted sum of contributions of change for each coordinate.

For example, let's say that we have a gradient g(x,y,z)\nabla g(x,y,z) such that when evaluated at a point x0=(x0,y0,z0)\bold{x_0}=(x_0,y_0,z_0), we have:

g(x)x=x0=[3520]\nabla g(\bold{x})|_{\bold{x}=\bold{x_0}} = \begin{bmatrix} 3 \\ 52 \\ 0 \end{bmatrix}Then, for a given vector vR3\bold{v} \in \mathbb{R^3},

Dvg(x)x=x0=[3 52 0][v1v2v3]=(3)v1+(52)v2+(0)v3D_\bold{v}g(\bold{x})|_{\bold{x}=\bold{x_0}} = \begin{bmatrix} 3 \ 52 \ 0 \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ \end{bmatrix} = (3)\cdot v_1 + (52) \cdot v_2 + (0) \cdot v_3We can see that at point x0\bold{x_0}, the second coordinate contributes the largest amount of change, whilst the first coordinate contributes some amount of change, and there's no change contribution from the third coordinate.

From the dot product definition, we can say the following:

Dvf(x)=f(x)vcos(θ)D_\bold{v}f(\bold{x}) = || \nabla f(\bold{x})|| \cdot || \bold{v}|| \cdot \cos(\theta)With θ\theta being the angle between the gradient evaluated at a given point, and the vector v\bold{v}. Assuming v\bold{v} as a unit vector, we can see how the maximum possible positive value for the directional derivative is achieved when this angle is equal to 0º, which tells us the direction that results in the maximum increase for a given function at a given point is the same as the direction of the gradient of the function evaluated at that same point. In other words, the gradient points in the direction of the steepest change at a given point. Not only that, but the maximum possible negative value for the directional derivative happens when θ=180º\theta = 180º, which tells us the direction that results in the maximum decrease is given by the vector that points in the opposite direction to the gradient of the function at a given point.

So, in which direction should our friend take the next step to gown down the mountain? In the direction pointed to by h(x,y)-\nabla h(x,y) at the given coordinates, of course!

A friend comes down from the mountain

Method of steepest descent

Let's recall the rule that defines a descent method:

xk+1=xk+γpk\bold{x}_{k+1} = \bold{x}_k + \gamma \cdot \bold{p}_{k}Where kk is the current iteration, γ\gamma is the step size, and pk\bold{p}_k is the search direction. The method of steepest descent incorporates the notion of the direction of steepest change together with the descent method rule, by choosing the search direction pk=f(xk)\bold{p}_k = - \nabla f(\bold{x}_k)

Then, we have

xk+1=xkγf(xk)\bold{x}_{k+1} = \bold{x}_k - \gamma \cdot \nabla f(\bold{x}_k)This way, we can ensure we are going in the direction of the fastest descent at each iteration.

As an example, let's consider the following function to minimize:

z=f(x,y)=x2+y22yz=f(x,y) = x^2 + y^2 - 2yFrom its plot, we can see there's only one distinct minimum value

There is only one distinct minimum value

Let's say we want to start our iteration process at some point on the surface described by our function, like (2,1,f(x0))=(2,1,3)( -2,1,f(\bold{x}_0)) = (-2, 1,3) for some arbitrary x0=(2,1)\bold{x}_0 = (-2, 1)

A point on the surface described by  function

Then, we can compute the gradient —as well as its opposite vector— at this point

f(x,y)=[2x2y2]    f(x,y)x0=[2 (2)2 (1)2]=[40]    f(x,y)x0=[40]\nabla f(x,y)= \begin{bmatrix} 2x \\ 2y-2 \end{bmatrix} \implies \nabla f(x,y) |_\bold{x_0}= \begin{bmatrix} 2 \ (-2) \\ 2 \ (1)-2 \end{bmatrix} = \begin{bmatrix} -4 \\ 0 \end{bmatrix} \implies -\nabla f(x,y) |_\bold{x_0} = \begin{bmatrix} 4 \\ 0 \end{bmatrix}We can plot this vector by restricting z=3z=3, which results in a 2D plot like so:

Vector on the plane

Now, we know which direction to take for the next iteration. For the step size let's consider a constant γ=0.1\gamma = 0.1, then

x1=x0γf(x0)=(2,1)(0.1)(4,0)=(1.6,1)\bold{x}_1 = \bold{x}_0 - \gamma \cdot \nabla f(\bold{x}_0) = (-2, 1) - (0.1) \cdot (-4, 0) = (-1.6,1)Which means

f(x1)=(1.6,1,1.56)f(\bold{x}_1) =(-1.6,1,1.56)

Two points on the surface described by the function

We have effectively descended!

By repeating this process, we can approach the minimizer closer and closer each time:

Method of steepest descent

To get an idea of how close we can get, xk\bold{x}_{k} will be within a distance of 0.088 units from the true minimizer x\bold{x}^* at iteration k=15k=15.

Scaling the gradient

We can use the gradient to ensure we are taking the direction of the fastest descent at each step. However, we still need to figure out how to choose a good step size. Let's take another look at the gradient descent rule:

xk+1=xkγf(xk)\bold{x}_{k+1} = \bold{x}_k - \gamma \cdot \nabla f(\bold{x}_k)We can think of the step size as a factor that scales the effect the gradient has on each iteration. To see this scaling effect in action, let's run the previous example, but with γ=0.01\gamma = 0.01 and γ=1.1\gamma=1.1 this time.

With γ=0.01\gamma = 0.01, we get:

Scaling the gradient (1)

We can see this time it took many more points to descend. In fact, in order to achieve an xk\bold{x}_{k} located within a distance of 0.08 units from the true minimizer x\bold{x}^*, it would take us about k=155k=155 iterations (more than x10 times!).

With γ=1.1\gamma=1.1, we get:

Scaling the gradient (2)

For this one, we can see that we are not even going down! At three iterations, the value of zz continues to increase and it will continue indefinitely. This happens because the step is too large, so it will always overshoot and miss the minimum, going from one side of the paraboloid to the other, increasing zz as the surface expands upwards. In fact, z=658.38z = 658.38 for the same k=15k=15 from before (with zk=15\bold{z}_{k=15} being almost 660 units away from the true minimizer!)

Good steps

As we saw, this method is very sensitive to changes in step size. A poor choice for this parameter could result in a drudgingly slow convergence for the sequence, or it could even result in no convergence at all.

So, how do we find a good step size? First, we want to make sure f(xk+1)<f(xk)f(\bold{x}_{k+1}) < f(\bold{x}_k) for each iteration (that way we know we are actually descending). Later on, we will see that this is formally known as the Armijo rule.

In practice, one way to fulfill this rule would be to pick an initial value for γ=γ0\gamma = \gamma_0 and compute f(xkγ0f(xk))f(\bold{x}_k - \gamma_0 \cdot \nabla f(\bold{x}_k)). Then, we try again with a new γ=γ1=γ02\gamma = \gamma_1 = \frac{\gamma_0 }{2}, then with a γ=γ2=γ12\gamma = \gamma_2 = \frac{\gamma_1}{2} and so on, until

f(xkγf(xk))<f(xk)f(\bold{x}_k - \gamma \cdot \nabla f(\bold{x}_k)) < f(\bold{x}_k)Making sure to stop when γ\gamma gets too small (this way we prevent a slow convergence). Repeating this process for each iteration kk.

However, in order to achieve the exact steepest descent, we need to solve the following single-variable optimization problem:

γk=arg minγf(xkγf(xk))\gamma_k = \argmin_{\gamma} f(\bold{x}_k - \gamma \cdot \nabla f(\bold{x}_k))This will give us a large enough step size to reach the minimum value that the function can take for the next iteration (out of all the possibilities associated with each possible value γ\gamma), securing f(xk+1)f(xk)f(\bold{x}_{k+1}) \leq f(\bold{x}_k) at that step. This single-valued optimization problem can be solved with any of the methods we are already familiar with.

Conclusion

In this topic, we have learned the following:

  • The directional derivative can tell us how much change will result from moving in a specific direction from a given point.
  • The gradient vector evaluated at a given location will point to the direction of maximum increase from that location.
  • The gradient descent method uses the negative gradient vector as the search direction for each step, ensuring the direction of steepest descent at each step.
  • The method is very sensitive to the choice of step size. A poor choice might lead us to a slow converge, or no convergence at all. The process of choosing a good step size is an optimization problem in itself.
35 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo