MathAlgebraLinear algebraMatrix decomposition

Least squares

Provided by: Edvancium

6 minutes read

In the world of science and engineering, you often encounter data that appears chaotic or noisy. However, behind this apparent confusion may lie fundamental patterns and relationships you can reveal using the techniques you’ve developed. In this topic, you will explore how the pseudoinverse becomes the driving engine behind the least squares method, an essential technique for fitting models to data.

You’ll take advantage of your mathematical skills to interpret the least squares problem in matrix form. Although the problem arises naturally in 2 dimensions, your knowledge of linear algebra will allow you to reformulate it in its most general context. You’ll develop a criterion to determine the best possible solution for the pseudoinverse, so now you’ll only reap the fruits of your effort. It’s time to transition from theory to practice!

A cloud of points

Suppose you have $n$ data points in the plane $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$ . Suppose these data represent the money invested in advertising and the corresponding profits earned by various companies in some industry. A reasonable assumption would be that the more you invest in advertising, the more profits are generated, so you could start your research assuming that the relationship between the two has a linear trend:

Data points in the plane

In an ideal case, all the points are on the same line. As any line is determined by its slope $m$ and intercept $b$ , the problem translates into finding the numbers $m$ and $b$ satisfying the linear system of equations:

\begin{align*} y_1 &= mx_1 +b \\ y_2 &= mx_2 +b \\ &\vdots \\ y_n &= mx_n +b \\ \\ \end{align*}

But there are too many equations with only two variables. So, it's unlikely that the system has a solution. Geometrically, this means there's no line that fits the raw data perfectly. The best line to describe this dataset isn't the one going through the points but the one that outlines the way this set is directed or oriented its increasing/decreasing pattern in the best manner. As $y$ is expressed in terms of $x$ , we say that $x$ is a predictor and that $y$ is the target. Before thinking about a way to attack this problem, let's look at this problem in larger dimensions.

The general problem

Now, suppose you're given $n$ data points in the space $(x_1, y_1, z_1), (x_2, y_2, z_2), \dots, (x_n, y_n, z_n)$ . Your goal is to find a plane that best approximates all these points:

Data points in the space

Analogously to the two-dimensional problem, this poses a system of equations:

\begin{align*} z_1 &= m x_1 + l \, y_1 +b \\ z_2 &= m x_2 + l \, y_2 +b \\ &\vdots \\ z_n &= m x_n + l \, y_n +b \\ \end{align*}

So, the problem reduces to find the values of $m$ , $l$ and $b$ . When $n$ is a large number, it's nearly impossible that the system could have a solution. Again, since $z$ is expressed in terms of $x$ and $y$ , $z$ is called the target while $x$ and $y$ are predictors.

You already know linear algebra, so you can work with the most general scenario involving many variables and linear equations. In general, you have $n$ data points with $p$ predictors $\mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_p$ and a target $\mathbf{y}$ .

Let $x_{ij}$ be the $i$ -th observation of the $j$ -th predictor and $y_i$ the i-th observation of the target for every $1 \leq i \leq n$ and $1 \leq j \leq p$ . This implies that the data points have the form $(x_{11},x_{12}, \dots, x_{1p}, y_1), (x_{21},x_{22}, \dots, x_{2p}, y_2), \dots, (x_{n1},x_{n2}, \dots, x_{np}, y_n)$ . Under these conditions, the initial problem can be written as a linear system of equations:

\begin{align*} y_1 &= \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \dots + \beta_p x_{1p} \\ y_2 &= \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \dots + \beta_p x_{2p} \\ &\vdots \\ y_n &= \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} + \dots + \beta_p x_{np} \end{align*}

But as in lower dimensions, it is very unlikely that such a system (where $n$ is usually much larger than $p$ ) has any solutions. But not everything is lost. Let's see how you can get your way.

An optimality criterion

You're working with a lot of variables and equations at the same time, and it's easy to get confused. Time to introduce some matrices! $X = \begin{pmatrix}1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix} \qquad \beta = \begin{pmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \qquad \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$ Thanks to them, the linear system becomes a simple matrix equation:

$X \mathbf{\beta} = \mathbf{y}$ So, for any proposed vector of solutions $\hat{\beta}$ , you get an approximation $\hat{\mathbf{y}} = X \hat{\beta}$ to $\mathbf{y}$ . Since the system has no solution, it's clear that $\hat{\mathbf{y}} \neq \mathbf{y}$ , so the distance $\lVert \mathbf{y} - \hat{\mathbf{y}} \rVert$ between both vectors is non-zero. You can think of this distance as an error associated with $\hat{\beta}$ . In summary, for any $\hat{\beta}$ , its associated estimation error is: $e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert$

The smaller the error, the better the approximation. This suggests that the best parameter vector $\hat{\beta}$ is the one that minimizes $e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert$ . But then you might be wondering if there even exists a unique vector with this property, and worse, how can you find it? But don't worry, this is precisely the problem that the pseudoinverse solves!

The optimal line

Recall you are given $X$ and $\mathbf{y}$ , think about them as fixed. Your goal is to compute $\hat{\beta}$ . With it, you can estimate the target values as $\hat{\mathbf{y}} = X \hat{\beta}$

The best approximation

Well, as you already know, the vector you're looking for is simply:

$\hat{\mathbf{\beta}} = X^{\dag} \mathbf{y}$ where $X ^{\dag}$ is the pseudoinverse of $X$ . This is because $\lVert X \hat{\mathbf{\beta}} - \mathbf{y} \rVert \leq \lVert X \mathbf{\beta} - \mathbf{y} \rVert$ for any other $\beta$ . In our current problem, this means that $e(\hat{\beta}) \leq e(\beta)$ . Once you've computed the best parameter vector, you can estimate the target: $\hat{\mathbf{y}} = X X^{\dag} \mathbf{y} = X\hat{\mathbf{\beta}}$ Before putting the theory to work, let's discuss one more point. The vector $\hat{\mathbf{y}}$ is the closest to $\mathbf{y}$ among all the vectors in the column space of $X$ . Actually, this is the orthogonal projection of $\hat{\mathbf{y}}$ onto the column space of $X$ . Thus, by multiplying the vector $\hat{\mathbf{y}}$ by the matrix $X X^{\dag}$ you obtain its projection. For this reason $H=X X^{\dag}$ is known as the projection matrix.

In the standard approach, the optimization process is carried out through derivatives. Specifically, the error function is minimized through the second derivative, which is tedious.
Furthermore, in the standard approach, the error is not minimized, but rather the quadratic error – this is done because the quadratic error is easier to derive.
For the standard approach to work, it is usually assumed that the columns of $X$ are linearly independent. In these circumstances, the projection matrix is reduced to $\mathbf{y}$ . When you have many predictors, this can easily fail, and other problems arise. In our strategy, we do not suffer from this problem, and the projection matrix is much easier to remember.
The least squares method is usually derived by the usage of calculus, but we’re introducing an alternative way of solving the problem, which is more straightforward. Actually, in statistics, this method allows you to go deeper. This allows you to go deeper by estimating other parameters, performing hypothesis tests, and evaluating the quality of fit, all in a unified way.

The best line

Let's start off with a simple example in two dimensions. Suppose the data points are $(-1,3), (0, 1), (1,-1)$ and $(2,2)$ . In order to find the best estimation, the first step is to identify the data matrix, the target values, and the parameters:

$X=\begin{pmatrix} 1 & -1 \\ 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \qquad \mathbf{y}=\begin{pmatrix} 3 \\ 1 \\ -1 \\ 2 \\ \end{pmatrix} \qquad \beta=\begin{pmatrix} b \\ m \\ \end{pmatrix}$ So, you're trying to find the closest thing to a solution for the system $X \mathbf{\beta} = \mathbf{y}$ . Here, the key step is to compute the pseudoinverse of $X$ . Although it isn't necessary, you can also calculate the projection matrix:
$X^{\dag} = \frac{1}{10} \begin{pmatrix} 4 & 3 & 2 & 1 \\ -3 & -1 & 1 & 3 \\ \end{pmatrix} \qquad H=X X^{\dag} = \frac{1}{10} \begin{pmatrix} 7 & 4 & 1 & -2 \\ 4 & 3 & 2 & 1 \\ 1 & 2 & 3 & 4 \\ -2 & 1 & 4 & 7 \\ \end{pmatrix}$ Now you have everything to calculate the parameter $\hat{\beta}$ and the corresponding estimate of $\mathbf{y}$ : $\hat{\beta} = X^{\dag}\mathbf{y} = \frac{1}{2} \begin{pmatrix}3\\-1 \end{pmatrix} \qquad \hat{\mathbf{y}} = X\hat{\beta}= \frac{1}{2} \begin{pmatrix} 4 \\ 3 \\ 2 \\ 1 \\ \end{pmatrix}$ In this case, the error is $e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert \approx 2.73861$ . Then, the error associated with any other beta vector is greater than or equal to this value. Geometrically, the slope of the best line is $-\frac{1}{2}$ and its intercept is $\frac{3}{2}$ .

Fitting the best line

The best plane

Now, let's move on to a slightly more realistic example. There is more than one predictor, and the pseudoinverse doesn't look pretty. The data points are $(-3,-6, 16), (1, -4, 13), (0,1, -3), (2, 5, -13)$ and $(5, 7, -18)$ . As before, the required pieces are:

$X= \begin{pmatrix} 1 & -3 & -6 \\ 1 & 1 & -4 \\ 1 & 0 & 1 \\ 1 & 2 & 5 \\ 1 & 5 & 7 \\ \end{pmatrix} \qquad \mathbf{y}= \begin{pmatrix} 16 \\ 13 \\ -3 \\ -13 \\ -18 \\ \end{pmatrix} \qquad \beta=\begin{pmatrix} b \\ m \\ l \end{pmatrix}$ The corresponding pseudoinverse $X^{\dag} = \begin{pmatrix} 0.317 & 0.0539 & 0.294 & 0.258 & 0.0764 \\ -0.117 & 0.230 & -0.132 & -0.108 & 0.127 \\ -0.000357 & -0.140 & 0.0621 & 0.0835 & -0.00571 \\ \end{pmatrix}$ This implies that the projection matrix is then: $H=X X^{\dag} = \begin{pmatrix} 0.671 & 0.202 & 0.317 & 0.0814 & -0.271 \\ 0.202 & 0.842 & -0.0857 & -0.184 & 0.226 \\ 0.317 & -0.0857 & 0.357 & 0.342 & 0.0707 \\ 0.0814 & -0.184 & 0.342 & 0.459 & 0.302 \\ -0.271 & 0.226 & 0.0707 & 0.302 & 0.672 \\ \end{pmatrix}$ Finally, the parameter vector and the corresponding best approximation are: $\hat{\beta} = X^{\dag}\mathbf{y} = \begin{pmatrix} 0.165 \\ 0.629 \\ -2.99 \end{pmatrix} \qquad \hat{\mathbf{y}} = X\hat{\beta}= \begin{pmatrix} 16.2 \\ 12.7 \\ -2.82 \\ -13.5 \\ -17.6 \end{pmatrix}$

Fitting the best plane

Conclusion

In the least squares problem, you have $n$ data points of $p$ predictors and a target variable $(x_{11},x_{12}, \dots, x_{1p}, y_1), (x_{21},x_{22}, \dots, x_{2p}, y_2), \dots, (x_{n1},x_{n2}, \dots, x_{np}, y_n)$ .
Your meaningful goal is to find some linear function describing your data. The closest thing to a solution for the linear system $X \mathbf{\beta} = \mathbf{y}$ is an instrument you use to reach this goal.
The best solution is the vector $\hat{\beta}$ whose error $e(\hat{\beta}) = \lVert \mathbf{y} - X\hat{\beta} \rVert$ is as small as possible
The best solution is given by $\hat{\mathbf{\beta}} = X^{\dag} \mathbf{y}$ and the estimation for the target is $\hat{\mathbf{y}} = X\hat{\mathbf{\beta}}$ .

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo