MathComputational mathOptimization

Numerical optimization methods

Provided by: Edvancium

15 minutes read

We already know how the first derivative test can help us with optimization problems. However, this only works if we can manipulate the algebraic expression for both the objective function and its derivative, and sometimes a function's derivative is unknown or is very difficult to compute. Furthermore, computers can have a hard time with algebra, as it is easier for them to deal with numbers than to manipulate symbolic expressions. In this topic, you will learn about different algorithms used by computers to optimize a given function numerically.

Golden-section search method

A naive way of finding a minimum (or maximum) for a function over a given interval is to sort all the values the function takes over that interval by comparing them over and over again. Ignoring the fact that this would mean chopping up a continuous function into discrete points, it would still be impractical by the sheer number of computations, especially if the function itself is problematic to compute. If instead, we were to partition the interval into smaller and smaller sub-intervals iteratively such that they all contain the optimal solution, the number of computations required to approximate the solution would be reduced significantly.

Let's consider both an interval $[a,b]$ and a function $f$ which has a minimizer $x^* \in [a,b]$ . We say that $f$ satisfies the unimodal property if it only decreases from $a$ to $x^*$ , and only increases from $x^*$ to $b$

Golden-section search method (step 1)

We can partition this function into two intervals of equal width: $[a, b_1]$ and $[a_1, b]$

Golden-section search method (step 2)

Two intervals

From here, we need to choose the interval that is guaranteed to contain the minimum. We can see that $f(b_1) \lt f(a_1)$ ; since $f$ is unimodal, we know it can't take values lower than $f(b_1)$ for $x \in [a, a_1]$ . Thus, we can say the minimizer must be inside interval $[a_1, b]$

Then, we redefine $a=a_1$

Golden-section search method (step 3)

And partition it again into two intervals $[a, b_2]$ and $[a_2, b]$

Golden-section search method (step 4)

Now, we can see that $f(a_2) \lt f(b_2)$ and, since $f$ is unimodal, we know it can't take values lower than $f(a_2)$ for $x \in [b_2, b]$ . Thus, we can say the minimizer must be inside interval $[a, b_2]$ . Then, we redefine $b=b_2$ accordingly.

We repeat the process iteratively until we find a good enough approximation for the minimizer $x^*$ :

The width of the subintervals must be equal every time

The width of the sub-intervals must be equal every time, then

$b - a_i = b_i - a$ for each iteration $i$ .

In addition, since

$b_i - a = (a_i - a) + (b_i - a_i)$

We have

$b - a_i = (a_i - a) + (b_i - a_i)$ But also, we know that

$b - a = (b - a_i) + (a_i-a)$

In order to keep the proportion between intervals and sub-intervals constant for each iteration, we must define a common ratio: $\frac{b_i - a_i}{a_i - a} = \frac{a_i - a}{b - a_i} = \frac{b - a_i}{b - a} = k$ Where $k$ is a constant value.

Now, if we take our previous expression for $b-a$ and divide both sides by the width of the smaller section for that iteration, we get $\frac{b - a}{a_i-a} = \frac{b - a_i}{a_i-a} + 1$ Then, we can break down the left-hand in two factors, like so:

$\frac{b - a}{b-a_i} \cdot \frac{b - a_i}{a_i-a} = \frac{b - a_i}{a_i-a} + 1$ Substituting for $k$ , $\frac{1}{k} \cdot \frac{1}{k} = \frac{1}{k} + 1 \implies \left(\frac{1}{k}\right)^2 - \frac{1}{k} = 1$ We can define a new constant $\varphi = \frac{1}{k}$ , such that

$\varphi^2 - \varphi - 1 = 0$ Using the quadratic formula, we obtain

$\varphi = \frac{1+\sqrt{5}}{2}$ This quantity is known as the golden ratio. It appears frequently in geometry and in nature itself, and it gives its name to this method, as it corresponds to the proportion by which we reduce intervals into sub-intervals. Then,

$\frac{b - a}{b_i - a} = \varphi \quad ; \quad \frac{b - a}{b - a_i} = \varphi \implies a_i = b - \frac{b - a}{\varphi} \quad ; \quad b_i = a + \frac{b - a}{\varphi}$ For each interval $i$ .

The main advantage of this method is that it is computationally inexpensive, as we never deal with the function's derivative or perform any sort of algebraic manipulation. However, depending on the function and the chosen interval, sometimes it can be slow as it converges linearly.

Parabolic interpolation

Linear interpolation is often used to find a point in between two other points $f(x_1)$ and $f(x_2)$ that are sufficiently close to each other such that a line could be used to approximate $f(x)$ from $x_1$ to $x_2$ . We can build on this concept to find a given function's local minimum iteratively, but instead of fitting a line between two points, we will fit parabolas between three points.

Let's consider a function $f$ :

Graph of another function

We can fit a parabola given three arbitrary spots that belong to our curve, like so:

Parabolic interpolation (step 1)

Parabolic interpolation (step 2)

Now, by minimizing this parabola we can get closer to the minimizer of $f$ . To do that, we first pinpoint the parabola's minimizer:

Parabolic interpolation (step 3)

Remember that a parabola's minimizer will be equal to the x-coordinate of its vertex, $v_x$ . Then,

$v_x=\frac{-b}{2a} \quad ; \quad p(x) = ax^2 +bx + c \implies x_p^* = \frac{-b}{2a}$ Now, we take the parabola's minimizer $x_p^*$ and evaluate the objective function at that point, $f(x_p^*)$

Parabolic interpolation (step 4)

Using this new point, we update our three points by discarding the point with the maximum value of the three, i.e. the one that is farthest from the minimum point of the objective function. We then fit a parabola through this new set of three points:

Parabolic interpolation (step 5)

This process is repeated until the objective function's minimizer has been found.

Parabolic interpolation

Even though this method can achieve superlinear convergence, and is therefore somewhat faster than the golden-section method, it is not guaranteed to converge. Furthermore, if the three points become colinear, a new parabola won't be able to fit resulting in the method "getting stuck" repeating the same parabola for subsequent iterations.

Descent methods

There is a family of optimization algorithms that can find a minimizer $x^*$ for a function $f$ by computing steps over several iterations that get closer to the minimum until it is reached, using the following rule:

$x_{k+1} = x_k + \gamma \cdot p_k$ Where $k$ is the current iteration, $\gamma$ is the step size, and $p_k$ is the search direction. As their names imply, $p_k$ controls where to move in the next direction, and $\gamma$ by how much. Both $\gamma$ and $p_k$ are usually chosen to make $f(x_{k+1}) < f(x_k)$ , which make the entire trajectory look like $x$ is descending with each step. Despite the name, the algorithm can be used to find maxima by adopting the opposite criterion.

Each parameter contributes to the method in its own way. If the step size is too large, new iterations may miss the minimum over and over again (sometimes indefinitely), and if the step size is too small, the number of iterations required for convergence might get too big. The search direction can be determined in various ways, but almost always the derivative of the function is involved since it can tell us whether the function is increasing or decreasing at a given point.

In the next section, we will see a more concrete example of a descent method: Newton's method for optimization.

Newton's method for optimization

Traditionally, Newton's method consists of applying the following rule iteratively:

$x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}$ With $f'(x_k)$ being the derivative of a function $f$ evaluated at iteration $k$ . Newton's method is used to determine the values at which the function becomes zero, i.e. the roots of the function.

Now, we know that for a minimum (or a maximum), the derivative of a function must be zero at that point. Therefore, we must look for the values that make the derivate of the function equal to zero, i.e. the roots of the derivative of the function. Thus, by applying the traditional Newton's method to the function's derivative instead of the function itself, we can find the roots of the derivative which correspond to the minimizer (or maximizer) of the function:

$x_{k+1} = x_k - \frac{f'(x_k)}{\left(f'(x_k)\right)'} \implies x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}$ With $f''(x_k)$ being the derivative of the derivative of $f$ evaluated at iteration $k$ . This is known as Newton's method for optimization.

As an example, let's consider the function:

$f(x) = 3x - \ln(x)$ Its first and second derivatives are:

$\begin{align*} f'(x) &= 3-\frac{1}{x} \\ f''(x) &= \frac{1}{x^2} \\ \end{align*}$ Then, the sequence obtained from Newton's method is defined by:

$x_{k+1} = x_k - (x_k^2) \cdot \left( 3 - \frac{1}{x_k} \right) = x_k+ x_k -3x_k^2 \quad \Rightarrow \ x_{k+1} = 2x_k -3x_k^2$

If we start iterating form $x_0 = 0.1$ , we will be within 0.001 units of the true minimizer $x^* = \frac{1}{3}$ by the fifth iteration:

$k$	$x_k$	$f(x_k)$
0	0.100	2.603
1	0.170	2.282
2	0.253	2.133
3	0.314	2.100
4	0.332	2.099

Newton's method for optimization

However, we may get very different sequences for different values of $x_0$ ,

$k$	$x _k \ [x_0 = 0.1]$	$x _k \ [x_0 = 0.01]$	$x _k \ [x_0 = 0.8]$
0	0.100	0.010	0.800
1	0.170	0.020	- 0.320
2	0.253	0.038	- 0.947
3	0.314	0.072	- 4.586
4	0.332	0.129	- 72.265
5	0.333	0.208	- 15 844.269
6	0.333	0.286	- 7.500 x 10⁸
7	0.333	0.327	- 1.688 x 10¹⁸

As we can see, the sequence that corresponds to $x_0 = 0.01$ has a slower convergence whilst the one that corresponds to $x_0 = 0.8$ doesn't even converge at all. This is because Newton's method is based on following a quadratic approximation of the objective function, which introduces a limited range of convergence. The closer we are to the actual minimum when choosing the starting point, the more valid this approximation will be.

By comparing the descent method general rule to Newton's method for optimization, we can see that the latter is a particular case of the former, in which: $\gamma \cdot p_k = - \frac{f'(x_k)}{f''(x_k)}$

As we saw, Newton's method for optimization is sensitive to the choice of a starting point. This is a characteristic it shares with other descent methods. In addition, the method fails when its second derivative is equal to zero, which may happen if an inflection point is reached at some iteration $k$ .

Even though there's no manipulation of the algebraic expressions, note that computing the first and second derivatives over several iterations is necessary for this method. As such, Newton's method for optimization method usually results in very rapid but computationally complicated convergence.

Conclusion

In this topic, we've learned about different ways in which computers solve optimization problems without using symbolic algebra. Namely, we've learned about:

The golden search method is a way to partition a given interval into subintervals until the optimal solution is found. We refer to the golden ratio when we transition from the interval to the sub-interval: $\varphi = \frac{1+\sqrt{5}}{2}$ .
Parabolic interpolation is a way to minimize an objective function by fitting parabolas between 3 different points of a function over and over again.
Descent methods are used to reach the optimal solution following very few steps over a given search direction and step size.
A general descent method can be characterized with a formula as such: $x_{k+1} = x_k + \gamma \cdot p_k$ . Where $p_k$ is responsible for the direction and $\gamma$ for the size of a step.
Newton's method for optimization is a particular case of a descent method. We approach our solution by using the following formula: $x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}$ .

25 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo

Numerical optimization methods

Golden-section search method

Parabolic interpolation

Descent methods

Newton's method for optimization

Conclusion

Related topics