MathComputational mathOptimization

Numerical optimization methods

15 minutes read

We already know how the first derivative test can help us with optimization problems. However, this only works if we can manipulate the algebraic expression for both the objective function and its derivative, and sometimes a function's derivative is unknown or is very difficult to compute. Furthermore, computers can have a hard time with algebra, as it is easier for them to deal with numbers than to manipulate symbolic expressions. In this topic, you will learn about different algorithms used by computers to optimize a given function numerically.

Golden-section search method

A naive way of finding a minimum (or maximum) for a function over a given interval is to sort all the values the function takes over that interval by comparing them over and over again. Ignoring the fact that this would mean chopping up a continuous function into discrete points, it would still be impractical by the sheer number of computations, especially if the function itself is problematic to compute. If instead, we were to partition the interval into smaller and smaller sub-intervals iteratively such that they all contain the optimal solution, the number of computations required to approximate the solution would be reduced significantly.

Let's consider both an interval [a,b][a,b] and a function ff which has a minimizer x[a,b]x^* \in [a,b]. We say that ff satisfies the unimodal property if it only decreases from aa to xx^*, and only increases from xx^* to bb

Golden-section search method (step 1)

We can partition this function into two intervals of equal width: [a,b1][a, b_1] and [a1,b][a_1, b]

Golden-section search method (step 2)

Two intervals

From here, we need to choose the interval that is guaranteed to contain the minimum. We can see that f(b1)<f(a1)f(b_1) \lt f(a_1); since ff is unimodal, we know it can't take values lower than f(b1)f(b_1) for x[a,a1]x \in [a, a_1]. Thus, we can say the minimizer must be inside interval [a1,b][a_1, b]

Then, we redefine a=a1a=a_1

Golden-section search method (step 3)

And partition it again into two intervals [a,b2][a, b_2] and [a2,b][a_2, b]

Golden-section search method (step 4)

Now, we can see that f(a2)<f(b2)f(a_2) \lt f(b_2) and, since ff is unimodal, we know it can't take values lower than f(a2)f(a_2) for x[b2,b]x \in [b_2, b]. Thus, we can say the minimizer must be inside interval [a,b2][a, b_2]. Then, we redefine b=b2b=b_2 accordingly.

We repeat the process iteratively until we find a good enough approximation for the minimizer xx^*:

The width of the subintervals must be equal every time

The width of the sub-intervals must be equal every time, then

bai=biab - a_i = b_i - afor each iteration ii.

In addition, since

bia=(aia)+(biai)b_i - a = (a_i - a) + (b_i - a_i)

We have

bai=(aia)+(biai)b - a_i = (a_i - a) + (b_i - a_i)But also, we know that

ba=(bai)+(aia)b - a = (b - a_i) + (a_i-a)

In order to keep the proportion between intervals and sub-intervals constant for each iteration, we must define a common ratio:biaiaia=aiabai=baiba=k\frac{b_i - a_i}{a_i - a} = \frac{a_i - a}{b - a_i} = \frac{b - a_i}{b - a} = kWhere kk is a constant value.

Now, if we take our previous expression for bab-a and divide both sides by the width of the smaller section for that iteration, we getbaaia=baiaia+1\frac{b - a}{a_i-a} = \frac{b - a_i}{a_i-a} + 1Then, we can break down the left-hand in two factors, like so:

babaibaiaia=baiaia+1\frac{b - a}{b-a_i} \cdot \frac{b - a_i}{a_i-a} = \frac{b - a_i}{a_i-a} + 1Substituting for kk, 1k1k=1k+1    (1k)21k=1\frac{1}{k} \cdot \frac{1}{k} = \frac{1}{k} + 1 \implies \left(\frac{1}{k}\right)^2 - \frac{1}{k} = 1We can define a new constant φ=1k\varphi = \frac{1}{k}, such that

φ2φ1=0\varphi^2 - \varphi - 1 = 0Using the quadratic formula, we obtain

φ=1+52\varphi = \frac{1+\sqrt{5}}{2}This quantity is known as the golden ratio. It appears frequently in geometry and in nature itself, and it gives its name to this method, as it corresponds to the proportion by which we reduce intervals into sub-intervals. Then,

babia=φ;babai=φ    ai=bbaφ;bi=a+baφ\frac{b - a}{b_i - a} = \varphi \quad ; \quad \frac{b - a}{b - a_i} = \varphi \implies a_i = b - \frac{b - a}{\varphi} \quad ; \quad b_i = a + \frac{b - a}{\varphi}For each interval ii.

The main advantage of this method is that it is computationally inexpensive, as we never deal with the function's derivative or perform any sort of algebraic manipulation. However, depending on the function and the chosen interval, sometimes it can be slow as it converges linearly.

Parabolic interpolation

Linear interpolation is often used to find a point in between two other points f(x1)f(x_1) and f(x2)f(x_2) that are sufficiently close to each other such that a line could be used to approximate f(x)f(x) from x1x_1 to x2x_2. We can build on this concept to find a given function's local minimum iteratively, but instead of fitting a line between two points, we will fit parabolas between three points.

Let's consider a function ff:

Graph of another function

We can fit a parabola given three arbitrary spots that belong to our curve, like so:

Parabolic interpolation (step 1)

Parabolic interpolation (step 2)

Now, by minimizing this parabola we can get closer to the minimizer of ff. To do that, we first pinpoint the parabola's minimizer:

Parabolic interpolation (step 3)

Remember that a parabola's minimizer will be equal to the x-coordinate of its vertex, vxv_x. Then,

vx=b2a;p(x)=ax2+bx+c    xp=b2av_x=\frac{-b}{2a} \quad ; \quad p(x) = ax^2 +bx + c \implies x_p^* = \frac{-b}{2a}Now, we take the parabola's minimizer xpx_p^* and evaluate the objective function at that point, f(xp)f(x_p^*)

Parabolic interpolation (step 4)

Using this new point, we update our three points by discarding the point with the maximum value of the three, i.e. the one that is farthest from the minimum point of the objective function. We then fit a parabola through this new set of three points:

Parabolic interpolation (step 5)

This process is repeated until the objective function's minimizer has been found.

Parabolic interpolation

Even though this method can achieve superlinear convergence, and is therefore somewhat faster than the golden-section method, it is not guaranteed to converge. Furthermore, if the three points become colinear, a new parabola won't be able to fit resulting in the method "getting stuck" repeating the same parabola for subsequent iterations.

Descent methods

There is a family of optimization algorithms that can find a minimizer xx^* for a function ff by computing steps over several iterations that get closer to the minimum until it is reached, using the following rule:

xk+1=xk+γpkx_{k+1} = x_k + \gamma \cdot p_kWhere kk is the current iteration, γ\gamma is the step size, and pkp_k is the search direction. As their names imply, pkp_k controls where to move in the next direction, and γ\gamma by how much. Both γ\gamma and pkp_k are usually chosen to make f(xk+1)<f(xk)f(x_{k+1}) < f(x_k), which make the entire trajectory look like xx is descending with each step. Despite the name, the algorithm can be used to find maxima by adopting the opposite criterion.

Each parameter contributes to the method in its own way. If the step size is too large, new iterations may miss the minimum over and over again (sometimes indefinitely), and if the step size is too small, the number of iterations required for convergence might get too big. The search direction can be determined in various ways, but almost always the derivative of the function is involved since it can tell us whether the function is increasing or decreasing at a given point.

In the next section, we will see a more concrete example of a descent method: Newton's method for optimization.

Newton's method for optimization

Traditionally, Newton's method consists of applying the following rule iteratively:

xk+1=xkf(xk)f(xk)x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}With f(xk)f'(x_k) being the derivative of a function ff evaluated at iteration kk. Newton's method is used to determine the values at which the function becomes zero, i.e. the roots of the function.

Now, we know that for a minimum (or a maximum), the derivative of a function must be zero at that point. Therefore, we must look for the values that make the derivate of the function equal to zero, i.e. the roots of the derivative of the function. Thus, by applying the traditional Newton's method to the function's derivative instead of the function itself, we can find the roots of the derivative which correspond to the minimizer (or maximizer) of the function:

xk+1=xkf(xk)(f(xk))    xk+1=xkf(xk)f(xk)x_{k+1} = x_k - \frac{f'(x_k)}{\left(f'(x_k)\right)'} \implies x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}With f(xk)f''(x_k) being the derivative of the derivative of ff evaluated at iteration kk. This is known as Newton's method for optimization.

As an example, let's consider the function:

f(x)=3xln(x)f(x) = 3x - \ln(x)Its first and second derivatives are:

f(x)=31xf(x)=1x2\begin{align*} f'(x) &= 3-\frac{1}{x} \\ f''(x) &= \frac{1}{x^2} \\ \end{align*}Then, the sequence obtained from Newton's method is defined by:

xk+1=xk(xk2)(31xk)=xk+xk3xk2 xk+1=2xk3xk2x_{k+1} = x_k - (x_k^2) \cdot \left( 3 - \frac{1}{x_k} \right) = x_k+ x_k -3x_k^2 \quad \Rightarrow \ x_{k+1} = 2x_k -3x_k^2

If we start iterating form x0=0.1x_0 = 0.1, we will be within 0.001 units of the true minimizer x=13x^* = \frac{1}{3} by the fifth iteration:

kk xkx_k f(xk)f(x_k)
0 0.100 2.603
1 0.170 2.282
2 0.253 2.133
3 0.314 2.100
4 0.332 2.099

Newton's method for optimization

However, we may get very different sequences for different values of x0x_0,

kk xk [x0=0.1]x _k \ [x_0 = 0.1] xk [x0=0.01]x _k \ [x_0 = 0.01] xk [x0=0.8]x _k \ [x_0 = 0.8]
0 0.100 0.010 0.800
1 0.170 0.020 - 0.320
2 0.253 0.038 - 0.947
3 0.314 0.072 - 4.586
4 0.332 0.129 - 72.265
5 0.333 0.208 - 15 844.269
6 0.333 0.286 - 7.500 x 10⁸
7 0.333 0.327 - 1.688 x 10¹⁸

As we can see, the sequence that corresponds to x0=0.01x_0 = 0.01 has a slower convergence whilst the one that corresponds to x0=0.8x_0 = 0.8 doesn't even converge at all. This is because Newton's method is based on following a quadratic approximation of the objective function, which introduces a limited range of convergence. The closer we are to the actual minimum when choosing the starting point, the more valid this approximation will be.

By comparing the descent method general rule to Newton's method for optimization, we can see that the latter is a particular case of the former, in which:γpk=f(xk)f(xk)\gamma \cdot p_k = - \frac{f'(x_k)}{f''(x_k)}

As we saw, Newton's method for optimization is sensitive to the choice of a starting point. This is a characteristic it shares with other descent methods. In addition, the method fails when its second derivative is equal to zero, which may happen if an inflection point is reached at some iteration kk.

Even though there's no manipulation of the algebraic expressions, note that computing the first and second derivatives over several iterations is necessary for this method. As such, Newton's method for optimization method usually results in very rapid but computationally complicated convergence.

Conclusion

In this topic, we've learned about different ways in which computers solve optimization problems without using symbolic algebra. Namely, we've learned about:

  • The golden search method is a way to partition a given interval into subintervals until the optimal solution is found. We refer to the golden ratio when we transition from the interval to the sub-interval: φ=1+52\varphi = \frac{1+\sqrt{5}}{2}.
  • Parabolic interpolation is a way to minimize an objective function by fitting parabolas between 3 different points of a function over and over again.
  • Descent methods are used to reach the optimal solution following very few steps over a given search direction and step size.
  • A general descent method can be characterized with a formula as such: xk+1=xk+γpkx_{k+1} = x_k + \gamma \cdot p_k. Where pkp_k is responsible for the direction and γ\gamma for the size of a step.
  • Newton's method for optimization is a particular case of a descent method. We approach our solution by using the following formula: xk+1=xkf(xk)f(xk)x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}.
25 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo