We already know how computers can use descent methods to optimize single-valued objective functions. However, when it comes to higher-dimensional domains like the ones used in machine learning models, we need to be able to define a search direction that takes into account change in more than one coordinate. In this topic we will explore how the gradient operator can help us optimize for this vector, as well as some advantages and disadvantages of the method.
The direction of steepest change
Let's say a friend of ours finds himself at some point on a mountain where the ground level can be defined using a function , but our friend is unsure about the taken path to get here. So, our friend decides to walk down to the base of the mountain. However, our buddy is also quite tired at this point, so naturally, we would like to make sure each step that is taken by our friend results in good progress walking down the mountain. In which direction should the next step be taken?
In order to answer this question, let's recall the gradient operator.
As we learned previously, the gradient operator gives us a vector-valued function comprised of a function's derivatives on each coordinate
Turns out that if we multiply the gradient of a function evaluated at a given point times another vector , we will get a measure of the rate at which the function would change if we were to follow the direction given by the vector:
This is called a directional derivative, and it can be denoted as either or . We can think of it as a weighted sum of contributions of change for each coordinate.
For example, let's say that we have a gradient such that when evaluated at a point , we have:
Then, for a given vector ,
We can see that at point , the second coordinate contributes the largest amount of change, whilst the first coordinate contributes some amount of change, and there's no change contribution from the third coordinate.
From the dot product definition, we can say the following:
With being the angle between the gradient evaluated at a given point, and the vector . Assuming as a unit vector, we can see how the maximum possible positive value for the directional derivative is achieved when this angle is equal to , which tells us the direction that results in the maximum increase for a given function at a given point is the same as the direction of the gradient of the function evaluated at that same point. In other words, the gradient points in the direction of the steepest change at a given point. Not only that, but the maximum possible negative value for the directional derivative happens when , which tells us the direction that results in the maximum decrease is given by the vector that points in the opposite direction to the gradient of the function at a given point.
So, in which direction should our friend take the next step to gown down the mountain? In the direction pointed to by at the given coordinates, of course!
Method of steepest descent
Let's recall the rule that defines a descent method:
Where is the current iteration, is the step size, and is the search direction. The method of steepest descent incorporates the notion of the direction of steepest change together with the descent method rule, by choosing the search direction
Then, we have
This way, we can ensure we are going in the direction of the fastest descent at each iteration.
As an example, let's consider the following function to minimize:
From its plot, we can see there's only one distinct minimum value
Let's say we want to start our iteration process at some point on the surface described by our function, like for some arbitrary
Then, we can compute the gradient —as well as its opposite vector— at this point
We can plot this vector by restricting , which results in a 2D plot like so:
Now, we know which direction to take for the next iteration. For the step size let's consider a constant , then
Which means
We have effectively descended!
By repeating this process, we can approach the minimizer closer and closer each time:
To get an idea of how close we can get, will be within a distance of 0.088 units from the true minimizer at iteration .
Scaling the gradient
We can use the gradient to ensure we are taking the direction of the fastest descent at each step. However, we still need to figure out how to choose a good step size. Let's take another look at the gradient descent rule:
We can think of the step size as a factor that scales the effect the gradient has on each iteration. To see this scaling effect in action, let's run the previous example, but with and this time.
With , we get:
We can see this time it took many more points to descend. In fact, in order to achieve an located within a distance of 0.08 units from the true minimizer , it would take us about iterations (more than x10 times!).
With , we get:
For this one, we can see that we are not even going down! At three iterations, the value of continues to increase and it will continue indefinitely. This happens because the step is too large, so it will always overshoot and miss the minimum, going from one side of the paraboloid to the other, increasing as the surface expands upwards. In fact, for the same from before (with being almost 660 units away from the true minimizer!)
Good steps
As we saw, this method is very sensitive to changes in step size. A poor choice for this parameter could result in a drudgingly slow convergence for the sequence, or it could even result in no convergence at all.
So, how do we find a good step size? First, we want to make sure for each iteration (that way we know we are actually descending). Later on, we will see that this is formally known as the Armijo rule.
In practice, one way to fulfill this rule would be to pick an initial value for and compute . Then, we try again with a new , then with a and so on, until
Making sure to stop when gets too small (this way we prevent a slow convergence). Repeating this process for each iteration .
However, in order to achieve the exact steepest descent, we need to solve the following single-variable optimization problem:
This will give us a large enough step size to reach the minimum value that the function can take for the next iteration (out of all the possibilities associated with each possible value ), securing at that step. This single-valued optimization problem can be solved with any of the methods we are already familiar with.
Conclusion
In this topic, we have learned the following:
- The directional derivative can tell us how much change will result from moving in a specific direction from a given point.
- The gradient vector evaluated at a given location will point to the direction of maximum increase from that location.
- The gradient descent method uses the negative gradient vector as the search direction for each step, ensuring the direction of steepest descent at each step.
- The method is very sensitive to the choice of step size. A poor choice might lead us to a slow converge, or no convergence at all. The process of choosing a good step size is an optimization problem in itself.