MathAlgebraLinear algebraMatrix decomposition

Least squares

6 minutes read

In the world of science and engineering, you often encounter data that appears chaotic or noisy. However, behind this apparent confusion may lie fundamental patterns and relationships you can reveal using the techniques you’ve developed. In this topic, you will explore how the pseudoinverse becomes the driving engine behind the least squares method, an essential technique for fitting models to data.

You’ll take advantage of your mathematical skills to interpret the least squares problem in matrix form. Although the problem arises naturally in 2 dimensions, your knowledge of linear algebra will allow you to reformulate it in its most general context. You’ll develop a criterion to determine the best possible solution for the pseudoinverse, so now you’ll only reap the fruits of your effort. It’s time to transition from theory to practice!

A cloud of points

Suppose you have nn data points in the plane (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n). Suppose these data represent the money invested in advertising and the corresponding profits earned by various companies in some industry. A reasonable assumption would be that the more you invest in advertising, the more profits are generated, so you could start your research assuming that the relationship between the two has a linear trend:

Data points in the plane

In an ideal case, all the points are on the same line. As any line is determined by its slope mm and intercept bb, the problem translates into finding the numbers mm and bb satisfying the linear system of equations:

y1=mx1+by2=mx2+byn=mxn+b\begin{align*} y_1 &= mx_1 +b \\ y_2 &= mx_2 +b \\ &\vdots \\ y_n &= mx_n +b \\ \\ \end{align*}

But there are too many equations with only two variables. So, it's unlikely that the system has a solution. Geometrically, this means there's no line that fits the raw data perfectly. The best line to describe this dataset isn't the one going through the points but the one that outlines the way this set is directed or oriented its increasing/decreasing pattern in the best manner. As yy is expressed in terms of xx, we say that xx is a predictor and that yy is the target. Before thinking about a way to attack this problem, let's look at this problem in larger dimensions.

The general problem

Now, suppose you're given nn data points in the space (x1,y1,z1),(x2,y2,z2),,(xn,yn,zn)(x_1, y_1, z_1), (x_2, y_2, z_2), \dots, (x_n, y_n, z_n). Your goal is to find a plane that best approximates all these points:

Data points in the space

Analogously to the two-dimensional problem, this poses a system of equations:

z1=mx1+ly1+bz2=mx2+ly2+bzn=mxn+lyn+b\begin{align*} z_1 &= m x_1 + l \, y_1 +b \\ z_2 &= m x_2 + l \, y_2 +b \\ &\vdots \\ z_n &= m x_n + l \, y_n +b \\ \end{align*}

So, the problem reduces to find the values of mm, ll and bb. When nn is a large number, it's nearly impossible that the system could have a solution. Again, since zz is expressed in terms of xx and yy, zz is called the target while xx and yy are predictors.

You already know linear algebra, so you can work with the most general scenario involving many variables and linear equations. In general, you have nn data points with pp predictors x1,x2,xp\mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_p and a target y\mathbf{y}.

Let xijx_{ij} be the ii-th observation of the jj-th predictor and yiy_i the i-th observation of the target for every 1in1 \leq i \leq n and 1jp1 \leq j \leq p. This implies that the data points have the form (x11,x12,,x1p,y1),(x21,x22,,x2p,y2),,(xn1,xn2,,xnp,yn)(x_{11},x_{12}, \dots, x_{1p}, y_1), (x_{21},x_{22}, \dots, x_{2p}, y_2), \dots, (x_{n1},x_{n2}, \dots, x_{np}, y_n). Under these conditions, the initial problem can be written as a linear system of equations:

y1=β0+β1x11+β2x12++βpx1py2=β0+β1x21+β2x22++βpx2pyn=β0+β1xn1+β2xn2++βpxnp\begin{align*} y_1 &= \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \dots + \beta_p x_{1p} \\ y_2 &= \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \dots + \beta_p x_{2p} \\ &\vdots \\ y_n &= \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} + \dots + \beta_p x_{np} \end{align*}

But as in lower dimensions, it is very unlikely that such a system (where nn is usually much larger than pp) has any solutions. But not everything is lost. Let's see how you can get your way.

An optimality criterion

You're working with a lot of variables and equations at the same time, and it's easy to get confused. Time to introduce some matrices!X=(1x11x12x1p1x21x22x2p1xn1xn2xnp)β=(β0β1βp)y=(y1y2yn)X = \begin{pmatrix}1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix} \qquad \beta = \begin{pmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \qquad \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}Thanks to them, the linear system becomes a simple matrix equation:

Xβ=yX \mathbf{\beta} = \mathbf{y}So, for any proposed vector of solutions β^\hat{\beta}, you get an approximation y^=Xβ^\hat{\mathbf{y}} = X \hat{\beta} to y\mathbf{y}. Since the system has no solution, it's clear that y^y\hat{\mathbf{y}} \neq \mathbf{y}, so the distance yy^\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert between both vectors is non-zero. You can think of this distance as an error associated with β^\hat{\beta}. In summary, for any β^\hat{\beta}, its associated estimation error is:e(β^)=yXβ^e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert

The smaller the error, the better the approximation. This suggests that the best parameter vector β^\hat{\beta} is the one that minimizes e(β^)=yXβ^e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert. But then you might be wondering if there even exists a unique vector with this property, and worse, how can you find it? But don't worry, this is precisely the problem that the pseudoinverse solves!

The optimal line

Recall you are given XX and y\mathbf{y}, think about them as fixed. Your goal is to compute β^\hat{\beta}. With it, you can estimate the target values as y^=Xβ^\hat{\mathbf{y}} = X \hat{\beta}

The best approximation

Well, as you already know, the vector you're looking for is simply:

β^=Xy\hat{\mathbf{\beta}} = X^{\dag} \mathbf{y}where XX ^{\dag} is the pseudoinverse of XX. This is because Xβ^yXβy\lVert X \hat{\mathbf{\beta}} - \mathbf{y} \rVert \leq \lVert X \mathbf{\beta} - \mathbf{y} \rVert for any other β\beta. In our current problem, this means that e(β^)e(β)e(\hat{\beta}) \leq e(\beta) . Once you've computed the best parameter vector, you can estimate the target:y^=XXy=Xβ^\hat{\mathbf{y}} = X X^{\dag} \mathbf{y} = X\hat{\mathbf{\beta}}Before putting the theory to work, let's discuss one more point. The vector y^\hat{\mathbf{y}} is the closest to y\mathbf{y} among all the vectors in the column space of XX. Actually, this is the orthogonal projection of y^\hat{\mathbf{y}} onto the column space of XX. Thus, by multiplying the vector y^\hat{\mathbf{y}} by the matrix XXX X^{\dag} you obtain its projection. For this reason H=XXH=X X^{\dag} is known as the projection matrix.

  • In the standard approach, the optimization process is carried out through derivatives. Specifically, the error function is minimized through the second derivative, which is tedious.

  • Furthermore, in the standard approach, the error is not minimized, but rather the quadratic error – this is done because the quadratic error is easier to derive.

  • For the standard approach to work, it is usually assumed that the columns of XX are linearly independent. In these circumstances, the projection matrix is reduced to y\mathbf{y}. When you have many predictors, this can easily fail, and other problems arise. In our strategy, we do not suffer from this problem, and the projection matrix is much easier to remember.

  • The least squares method is usually derived by the usage of calculus, but we’re introducing an alternative way of solving the problem, which is more straightforward. Actually, in statistics, this method allows you to go deeper. This allows you to go deeper by estimating other parameters, performing hypothesis tests, and evaluating the quality of fit, all in a unified way.

The best line

Let's start off with a simple example in two dimensions. Suppose the data points are (1,3),(0,1),(1,1)(-1,3), (0, 1), (1,-1) and (2,2)(2,2). In order to find the best estimation, the first step is to identify the data matrix, the target values, and the parameters:

X=(11101112)y=(3112)β=(bm)X=\begin{pmatrix} 1 & -1 \\ 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \qquad \mathbf{y}=\begin{pmatrix} 3 \\ 1 \\ -1 \\ 2 \\ \end{pmatrix} \qquad \beta=\begin{pmatrix} b \\ m \\ \end{pmatrix}So, you're trying to find the closest thing to a solution for the system Xβ=yX \mathbf{\beta} = \mathbf{y}. Here, the key step is to compute the pseudoinverse of XX. Although it isn't necessary, you can also calculate the projection matrix:
X=110(43213113)H=XX=110(7412432112342147)X^{\dag} = \frac{1}{10} \begin{pmatrix} 4 & 3 & 2 & 1 \\ -3 & -1 & 1 & 3 \\ \end{pmatrix} \qquad H=X X^{\dag} = \frac{1}{10} \begin{pmatrix} 7 & 4 & 1 & -2 \\ 4 & 3 & 2 & 1 \\ 1 & 2 & 3 & 4 \\ -2 & 1 & 4 & 7 \\ \end{pmatrix}Now you have everything to calculate the parameter β^\hat{\beta} and the corresponding estimate of y\mathbf{y}:β^=Xy=12(31)y^=Xβ^=12(4321)\hat{\beta} = X^{\dag}\mathbf{y} = \frac{1}{2} \begin{pmatrix}3\\-1 \end{pmatrix} \qquad \hat{\mathbf{y}} = X\hat{\beta}= \frac{1}{2} \begin{pmatrix} 4 \\ 3 \\ 2 \\ 1 \\ \end{pmatrix}In this case, the error is e(β^)=yXβ^2.73861e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert \approx 2.73861. Then, the error associated with any other beta vector is greater than or equal to this value. Geometrically, the slope of the best line is 12-\frac{1}{2} and its intercept is 32\frac{3}{2}.

Fitting the best line

The best plane

Now, let's move on to a slightly more realistic example. There is more than one predictor, and the pseudoinverse doesn't look pretty. The data points are (3,6,16),(1,4,13),(0,1,3),(2,5,13)(-3,-6, 16), (1, -4, 13), (0,1, -3), (2, 5, -13) and (5,7,18)(5, 7, -18). As before, the required pieces are:

X=(136114101125157)y=(161331318)β=(bml)X= \begin{pmatrix} 1 & -3 & -6 \\ 1 & 1 & -4 \\ 1 & 0 & 1 \\ 1 & 2 & 5 \\ 1 & 5 & 7 \\ \end{pmatrix} \qquad \mathbf{y}= \begin{pmatrix} 16 \\ 13 \\ -3 \\ -13 \\ -18 \\ \end{pmatrix} \qquad \beta=\begin{pmatrix} b \\ m \\ l \end{pmatrix}The corresponding pseudoinverse X=(0.3170.05390.2940.2580.07640.1170.2300.1320.1080.1270.0003570.1400.06210.08350.00571)X^{\dag} = \begin{pmatrix} 0.317 & 0.0539 & 0.294 & 0.258 & 0.0764 \\ -0.117 & 0.230 & -0.132 & -0.108 & 0.127 \\ -0.000357 & -0.140 & 0.0621 & 0.0835 & -0.00571 \\ \end{pmatrix}This implies that the projection matrix is then:H=XX=(0.6710.2020.3170.08140.2710.2020.8420.08570.1840.2260.3170.08570.3570.3420.07070.08140.1840.3420.4590.3020.2710.2260.07070.3020.672)H=X X^{\dag} = \begin{pmatrix} 0.671 & 0.202 & 0.317 & 0.0814 & -0.271 \\ 0.202 & 0.842 & -0.0857 & -0.184 & 0.226 \\ 0.317 & -0.0857 & 0.357 & 0.342 & 0.0707 \\ 0.0814 & -0.184 & 0.342 & 0.459 & 0.302 \\ -0.271 & 0.226 & 0.0707 & 0.302 & 0.672 \\ \end{pmatrix}Finally, the parameter vector and the corresponding best approximation are:β^=Xy=(0.1650.6292.99)y^=Xβ^=(16.212.72.8213.517.6)\hat{\beta} = X^{\dag}\mathbf{y} = \begin{pmatrix} 0.165 \\ 0.629 \\ -2.99 \end{pmatrix} \qquad \hat{\mathbf{y}} = X\hat{\beta}= \begin{pmatrix} 16.2 \\ 12.7 \\ -2.82 \\ -13.5 \\ -17.6 \end{pmatrix}

Fitting the best plane

Conclusion

  • In the least squares problem, you have nn data points of pp predictors and a target variable (x11,x12,,x1p,y1),(x21,x22,,x2p,y2),,(xn1,xn2,,xnp,yn)(x_{11},x_{12}, \dots, x_{1p}, y_1), (x_{21},x_{22}, \dots, x_{2p}, y_2), \dots, (x_{n1},x_{n2}, \dots, x_{np}, y_n).

  • Your meaningful goal is to find some linear function describing your data. The closest thing to a solution for the linear system Xβ=yX \mathbf{\beta} = \mathbf{y} is an instrument you use to reach this goal.

  • The best solution is the vector β^\hat{\beta} whose error e(β^)=yXβ^e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert is as small as possible

  • The best solution is given by β^=Xy\hat{\mathbf{\beta}} = X^{\dag} \mathbf{y} and the estimation for the target is y^=Xβ^\hat{\mathbf{y}} = X\hat{\mathbf{\beta}}.

4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo