We already know how the first derivative test can help us with optimization problems. However, this only works if we can manipulate the algebraic expression for both the objective function and its derivative, and sometimes a function's derivative is unknown or is very difficult to compute. Furthermore, computers can have a hard time with algebra, as it is easier for them to deal with numbers than to manipulate symbolic expressions. In this topic, you will learn about different algorithms used by computers to optimize a given function numerically.
Golden-section search method
A naive way of finding a minimum (or maximum) for a function over a given interval is to sort all the values the function takes over that interval by comparing them over and over again. Ignoring the fact that this would mean chopping up a continuous function into discrete points, it would still be impractical by the sheer number of computations, especially if the function itself is problematic to compute. If instead, we were to partition the interval into smaller and smaller sub-intervals iteratively such that they all contain the optimal solution, the number of computations required to approximate the solution would be reduced significantly.
Let's consider both an interval and a function which has a minimizer . We say that satisfies the unimodal property if it only decreases from to , and only increases from to
We can partition this function into two intervals of equal width: and
From here, we need to choose the interval that is guaranteed to contain the minimum. We can see that ; since is unimodal, we know it can't take values lower than for . Thus, we can say the minimizer must be inside interval
Then, we redefine
And partition it again into two intervals and
Now, we can see that and, since is unimodal, we know it can't take values lower than for . Thus, we can say the minimizer must be inside interval . Then, we redefine accordingly.
We repeat the process iteratively until we find a good enough approximation for the minimizer :
The width of the sub-intervals must be equal every time, then
for each iteration .
In addition, since
We have
But also, we know that
In order to keep the proportion between intervals and sub-intervals constant for each iteration, we must define a common ratio:Where is a constant value.
Now, if we take our previous expression for and divide both sides by the width of the smaller section for that iteration, we getThen, we can break down the left-hand in two factors, like so:
Substituting for , We can define a new constant , such that
Using the quadratic formula, we obtain
This quantity is known as the golden ratio. It appears frequently in geometry and in nature itself, and it gives its name to this method, as it corresponds to the proportion by which we reduce intervals into sub-intervals. Then,
For each interval .
The main advantage of this method is that it is computationally inexpensive, as we never deal with the function's derivative or perform any sort of algebraic manipulation. However, depending on the function and the chosen interval, sometimes it can be slow as it converges linearly.
Parabolic interpolation
Linear interpolation is often used to find a point in between two other points and that are sufficiently close to each other such that a line could be used to approximate from to . We can build on this concept to find a given function's local minimum iteratively, but instead of fitting a line between two points, we will fit parabolas between three points.
Let's consider a function :
We can fit a parabola given three arbitrary spots that belong to our curve, like so:
Now, by minimizing this parabola we can get closer to the minimizer of . To do that, we first pinpoint the parabola's minimizer:
Remember that a parabola's minimizer will be equal to the x-coordinate of its vertex, . Then,
Now, we take the parabola's minimizer and evaluate the objective function at that point,
Using this new point, we update our three points by discarding the point with the maximum value of the three, i.e. the one that is farthest from the minimum point of the objective function. We then fit a parabola through this new set of three points:
This process is repeated until the objective function's minimizer has been found.
Even though this method can achieve superlinear convergence, and is therefore somewhat faster than the golden-section method, it is not guaranteed to converge. Furthermore, if the three points become colinear, a new parabola won't be able to fit resulting in the method "getting stuck" repeating the same parabola for subsequent iterations.
Descent methods
There is a family of optimization algorithms that can find a minimizer for a function by computing steps over several iterations that get closer to the minimum until it is reached, using the following rule:
Where is the current iteration, is the step size, and is the search direction. As their names imply, controls where to move in the next direction, and by how much. Both and are usually chosen to make , which make the entire trajectory look like is descending with each step. Despite the name, the algorithm can be used to find maxima by adopting the opposite criterion.
Each parameter contributes to the method in its own way. If the step size is too large, new iterations may miss the minimum over and over again (sometimes indefinitely), and if the step size is too small, the number of iterations required for convergence might get too big. The search direction can be determined in various ways, but almost always the derivative of the function is involved since it can tell us whether the function is increasing or decreasing at a given point.
In the next section, we will see a more concrete example of a descent method: Newton's method for optimization.
Newton's method for optimization
Traditionally, Newton's method consists of applying the following rule iteratively:
With being the derivative of a function evaluated at iteration . Newton's method is used to determine the values at which the function becomes zero, i.e. the roots of the function.
Now, we know that for a minimum (or a maximum), the derivative of a function must be zero at that point. Therefore, we must look for the values that make the derivate of the function equal to zero, i.e. the roots of the derivative of the function. Thus, by applying the traditional Newton's method to the function's derivative instead of the function itself, we can find the roots of the derivative which correspond to the minimizer (or maximizer) of the function:
With being the derivative of the derivative of evaluated at iteration . This is known as Newton's method for optimization.
As an example, let's consider the function:
Its first and second derivatives are:
Then, the sequence obtained from Newton's method is defined by:
If we start iterating form , we will be within 0.001 units of the true minimizer by the fifth iteration:
| 0 | 0.100 | 2.603 |
| 1 | 0.170 | 2.282 |
| 2 | 0.253 | 2.133 |
| 3 | 0.314 | 2.100 |
| 4 | 0.332 | 2.099 |
However, we may get very different sequences for different values of ,
| 0 | 0.100 | 0.010 | 0.800 |
| 1 | 0.170 | 0.020 | - 0.320 |
| 2 | 0.253 | 0.038 | - 0.947 |
| 3 | 0.314 | 0.072 | - 4.586 |
| 4 | 0.332 | 0.129 | - 72.265 |
| 5 | 0.333 | 0.208 | - 15 844.269 |
| 6 | 0.333 | 0.286 | - 7.500 x 10⁸ |
| 7 | 0.333 | 0.327 | - 1.688 x 10¹⁸ |
As we can see, the sequence that corresponds to has a slower convergence whilst the one that corresponds to doesn't even converge at all. This is because Newton's method is based on following a quadratic approximation of the objective function, which introduces a limited range of convergence. The closer we are to the actual minimum when choosing the starting point, the more valid this approximation will be.
By comparing the descent method general rule to Newton's method for optimization, we can see that the latter is a particular case of the former, in which:
As we saw, Newton's method for optimization is sensitive to the choice of a starting point. This is a characteristic it shares with other descent methods. In addition, the method fails when its second derivative is equal to zero, which may happen if an inflection point is reached at some iteration .
Even though there's no manipulation of the algebraic expressions, note that computing the first and second derivatives over several iterations is necessary for this method. As such, Newton's method for optimization method usually results in very rapid but computationally complicated convergence.
Conclusion
In this topic, we've learned about different ways in which computers solve optimization problems without using symbolic algebra. Namely, we've learned about:
- The golden search method is a way to partition a given interval into subintervals until the optimal solution is found. We refer to the golden ratio when we transition from the interval to the sub-interval: .
- Parabolic interpolation is a way to minimize an objective function by fitting parabolas between 3 different points of a function over and over again.
- Descent methods are used to reach the optimal solution following very few steps over a given search direction and step size.
- A general descent method can be characterized with a formula as such: . Where is responsible for the direction and for the size of a step.
- Newton's method for optimization is a particular case of a descent method. We approach our solution by using the following formula: .