Computer scienceData scienceMachine learningRegression

Multiple linear regression

16 minutes read

As you already know, simple linear regression models linear dependency between the target, or dependent variable, and a single input, or independent variable. For example, you can use such a model to predict what salary $\hat{y}$ will an employee get based on years of work experience $x$ :

$\hat{y} = w_0 + w_1x$

However, we might want to include more than one input variable into our model. For example, taking into account such additional factors as work area, current position, number of successful projects, etc., can help us obtain more accurate predictions.

In fact, you will want to include more than one predictor in the vast majority of real-world problems. So how do we generalize our simple linear regression model above to be able to do so?

In this topic, you will get familiar with multiple linear regression (MLR), a linear regression model with multiple independent variables.

Definition

Unlike in the simple linear regression case, multiple linear regression models the value of the target variable $y$ as a linear combination of the values of multiple input variables $x_1, x_2, ... x_n$ :

$\hat{y} = w_0 + w_1x_{1}+w_2x_{2}+\dots+w_nx_{n}$

Let's go over all the terms here:

$\hat{y}$ is the predicted value of our target, also called a dependent variable.
In the salary prediction example, that would be the predicted salary of an employee.
$x_1, x_2, ... x_n$ are the values of our predictors, or independent variables.
In our example, those can be employee's years of experience, age, number of people in direct supervision, etc.
$w_0, w_1, ..., w_n$ are numeric coefficients, or the weights of the model.
Those weights are estimated from the data. To fit a linear regression model means to find the optimal values of the model weights $w_0, w_1, ..., w_n$ .

You can see that the simple linear regression is nothing more than a special case of the multiple linear regression model with only one predictor. In practice, the term 'linear regression' is often used to refer to any such model irrespectively of the number of inputs.

A more compact notation

Using a bit of linear algebra, we can re-write the linear regression equation above in a more compact way.

Let's collect the values of the inputs, $x_1, x_2, ... x_n$ , in a single vector $\overrightarrow{\bf{x}} = (x_1, x_2,\dots,x_n)$ . This vector contains a full description of our employee.

Similarly, we can also introduce a vector of the model's coefficients $\overrightarrow{\bf{w}} = (w_1, w_2,\dots, w_n)^{\top}$ . Note that for now we have deliberately left out the intercept $w_0$ .

Then, the linear regression equation for employee's salary $\hat{y}$ can be written as the sum of the intercept and the scalar product of the vectors ${\bf{\overrightarrow{x}}}$ and $\overrightarrow{\bf{w}}$ :

$\hat{y} = \langle{\overrightarrow{\bf{x}} , \overrightarrow{\bf{w}} \rangle} + w_0$

Cool, right? To make things even more convenient, we can introduce an auxiliary input variable $x_0 = 1$ . Then, if $\overrightarrow{x} = (x_0=1, x_1,\dots,x_n)$ and $\overrightarrow{\bf{w}} = (w_0, w_1,\dots,w_n)^{\top}$ , the equation for $\hat{y}$ can be written just as a scalar product of the two vectors:

$\hat{y} = \langle{\overrightarrow{\bf{x}} , \overrightarrow{\bf{w}} \rangle}$

To make sure that the equation remained the same just calculate the scalar product above: $\hat{y} = \langle{\overrightarrow{\bf{x}} , \overrightarrow{\bf{w}} \rangle} = (1, x_1,\dots,x_n) \cdot \begin{pmatrix} w_0 \\ w_1 \\ \dots \\ w_n \end{pmatrix} = w_0 + w_1x_{1}+\dots+w_nx_{n}$

Neat! Now, let's move to fitting a multiple linear regression model.

Matrix notation

How do we estimate the optimal values of those coefficients $w_0, w_1, ..., w_n$ from the data?

Imagine that we have collected data about $m$ employees in our company. For every employee $i = 1, ..., m$ , we know their salary that we can denote as $y_i$ . We can collect all those $m$ salaries in one column vector $\overrightarrow{\bf{y}}$ :

$\overrightarrow{\bf{y}} = \begin{pmatrix} y_1 \\ y_2 \\ \dots \\ y_m \end{pmatrix}$

Besides, for each of the $m$ employees, we know the values of the $n+1$ input variables ( $n$ "real" inputs and one auxiliary input variable which value is always 1, remember?). We can denote the value of the input variable $j = 0, ..., n$ of the employee $i = 1, ..., m$ as $x_{ij}$ .

Naturally, we can represent this data in a form of a matrix:

$X = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1n} \\ 1 & x_{21} & x_{22} & \dots & x_{2n} \\ \vdots & \vdots & \vdots &\ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \dots & x_{mn} \end{pmatrix}$

Here, each row $\overrightarrow{{\bf{x}}_i} = (x_{i1}, x_{i2}, ..., x_{in})$ corresponds to a particular employee, and there are $m$ rows in total. Each column, in turn, corresponds to a particular input variable. There are $n+1$ columns in total, with the first column fully comprising of 1's because it corresponds to the auxiliary feature we introduced for convenience.

Finally, let's introduce a column vector of model coefficients that are to be estimated:

$\overrightarrow{\bf{w}} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_n \end{pmatrix}$

If we take the scalar product of the $i$ -th row of the matrix $X$ with the coefficients vector $w$ , we will get the following sum. Does it remind you of something? Right, this is how our model estimates the salary of the $i$ -th employee $y_i$ :

$\hat{y_i} = w_0 + w_1x_{i1} + w_{2}x_{i2} + ... + w_nx_{in}$ .

Unless there is a perfect linear dependency between employee's salary and our model's inputs, the estimate $\hat{y_i}$ might be different from the actual salary $y_i$ . That is, there is prediction error, or model residual $e_i$ :

$y_i = \hat{y_i} + e_i = w_0 + w_1x_{i1} + w_{2}x_{i2} + ... + w_nx_{in} + e_i$ .

Let's introduce the residual vector $e:$

$\overrightarrow{\bf{e}} = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_m \end{pmatrix}$ .

Then, we can compactly write $m$ equations for every employee's salary, just like the one above, in a form of a single matrix equation:

$\overrightarrow{\bf{y}} = X\overrightarrow{\bf{w}} + \overrightarrow{\bf{e}}$

Why did we do all that? Oh, right, we want to find the optimal values of the model coefficients $w_0, w_1, ..., w_n$ . How? We want to choose those values in such a way that all the target values $y_1, ..., y_m$ are predicted as accurately as possible or, in other words, we want to keep residuals $e_1, ..., e_m$ small.

Least squares method

Just like in the simple linear regression case, coefficients of the multiple linear regression model are found via the least squared method.

We want to minimize the sum of the squares of the differences between the real values of target variable (denoted by $y_i$ ) and the predicted ones (denoted by $\hat{y_i}$ ):

$Err = \sum_{i=1}^m e_i^2 = \sum_{i=1}^m (y_i-\hat{y_i})^2 = \sum_{i=1}^m (y_i- \langle \overrightarrow{{\bf{x}}_i}, \overrightarrow{\bf{w}} \rangle )^2 \to \min$

We can re-write this in the matrix form:

$Err = \sum_{i=1}^m e_i^2 = \overrightarrow{\bf{e}}^\top \overrightarrow{\bf{e}} = (\overrightarrow{\bf{y}} -X\overrightarrow{\bf{w}})^\top (\overrightarrow{\bf{y}} -X\overrightarrow{\bf{w}}) \to \min$

Solving the minimization problem above gives us exact formula to calculate optimal coefficients. Here it is:

$\overrightarrow{\bf{w}}=(X^\top X)^{-1}X^\top \overrightarrow{\bf{y}}$

Using this formula you must be sure that all columns of $X$ are linearly independent, otherwise the matrix $(X^\top X)^{-1}$ will not exist. In practice, this means that the input variables of the model may not be linearly dependent.

Toy example

Let's now apply everything we've just learned about building a multiple linear regression model to solve a tiny problem below.

Imagine that you want to apply for a job at HyperMarket. You have 5 years of experience, and the job implies leading a group of 8 people. What kind of salary could you expect?

You decide to take a scientific approach and model the salary of the HyperMarket employees company based on the years of experience and number of people in supervision:

$\hat{y} = w_0 + w_1x_1 + w_2x_2$ .

We have the data about $n=3$ employees:

Employee ID	Years of experience	People in supervision	Salary
1	2	1	2
2	2	0	1
3	3	4	4

Now, all you need to do is to estimate the optimal values of the model coefficients $\overrightarrow{\bf{w}} = (w_0, w_1, w_2)$ and use the resulting model to get an estimate of the salary they might offer you.

Let's first re-write the data from the table above in terms of the input matrix $X$ and target vector $\overrightarrow{\bf{y}}$ :

$X = \begin{pmatrix} 1 & 2 & 1 \\ 1 & 2 & 0 \\ 1 & 3 & 4 \end{pmatrix},\ \overrightarrow{\bf{y}} = \begin{pmatrix} 2 \\ 1 \\ 4 \end{pmatrix}$

We use least-squares method to find the optimal values of the coefficients. They can be computed as follows:

$\overrightarrow{\bf{w}} =(X^\top X)^{-1}X^\top \overrightarrow{\bf{y}}$ .

Let's compute this expression step-by-step:

$X^\top X = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 3 \\ 1 & 0 & 4 \end{pmatrix} \times \begin{pmatrix} 1 & 2 & 1 \\ 1 & 2 & 0 \\ 1 & 3 & 4 \end{pmatrix} = \begin{pmatrix} 3 & 7 & 5 \\ 7 & 17 & 14 \\ 5 & 14 & 17\end{pmatrix}$

$(X^\top X)^{-1} = \begin{pmatrix} 93 & -49 & 13 \\ -49 & 26 & -7 \\ 13 & -7 & 2 \end{pmatrix}$

$X^\top \overrightarrow{\bf{y}} = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 3 \\ 1 & 0 & 4 \end{pmatrix} \times \begin{pmatrix} 2 \\ 1 \\ 4 \end{pmatrix}= \begin{pmatrix} 7 \\ 18 \\ 18 \end{pmatrix}$

$\overrightarrow{\bf{w}} = (X^\top X)^{-1} X^\top \overrightarrow{\bf{y}} = \begin{pmatrix} 93 & -49 & 13 \\ -49 & 26 & -7 \\ 13 & -7 & 2 \end{pmatrix} \times \begin{pmatrix} 7 \\ 18 \\ 18 \end{pmatrix} = \begin{pmatrix} 3 \\ -1 \\ 1 \end{pmatrix}$

So, this means that our model looks as follows:

$\hat{y} = 3 - x_1 + x_2$

Thus, according to our model, as someone with 5 years of experience ( $x_1$ ) leading a group of 8 people ( $x_2$ ), you can get the salary of $\hat{y} = 3 - 5 + 8 = 6$ conventional units.

Even though the coefficient before the years of experience (first input variable) is negative, it doesn't necessarily mean that more experience means a lower salary. In general, you should be careful with interpreting the model: the values of the coefficients depend, among other things, on the scale of the model's variables.

Note that you can get the same result using the vector notation:

$\hat{y} = < \overrightarrow{\bf{x}} , \overrightarrow{\bf{w}} > = <(1, 5, 8), (3, -1, 1) > = 1\times3 - 5\times1 + 8\times1 =6$ .

A few assumptions of multiple linear regression

After reviewing the background, let's look at the assumptions of MLR. The linearity assumption states that the relationship between the dependent variable and each of the independent variables is linear. If the relationship is not linear, the model will fail to capture the true relationship between the variables, leading to poor predictive performance.

Also, MLR assumes that the errors (residuals) are independent of each other. This is known as the independence assumption.

Multicollinearity, which occurs when two or more independent variables are highly correlated, is a potential issue in MLR. Multicollinearity can lead to unreliable estimates of the regression coefficients, making it difficult to interpret the individual effects of each independent variable.

Conclusion

Multiple linear regression model allows one to model target variable as a linear combination of the input variables.
We can use least-squares method to estimate model's coefficients.
The exact expression to estimate model's coefficients is as follows: $\overrightarrow{\bf{w}} =(X^\top X)^{-1}X^\top \overrightarrow{\bf{y}}$ .

46 learners liked this piece of theory. 4 didn't like it. What about you?

Report a typo