Computer scienceData scienceMachine learningRegression

Simple linear regression

9 minutes read

Regression problems are common in real life. For example, you might want to predict how your salary in a company will change depending on years of experience, and a real estate agency might want to estimate the cost of an apartment depending on the distance to the city center. Linear regression is one of the simplest models to solve such problems.
In this topic, we will consider a simple, one-dimensional linear regression model, which predicts the value of the target based on the value of just one input feature.

Problem statement

Imagine that you have just started to work in a company. You are wondering how your salary will change as you get more experience, so you head to the financial department and ask for the data on the salaries of the employees (in conventional units) and the number of months they've been working for your company. Here is what you've got:

Work experience	Salary
3	0.5
6	1.5
9	1.5
12	4
15	6

The simplest way to model the dependency between the two is to assume that it's linear. If we denote salary as $y$ and years of experience as $x$ , salary of an employee can be computed as the following function:

$y=ax+b$ where

$y$ — dependent variable (salary);
$x$ — factor, regressor, independent variable (work experience);
$a,b$ — numerical coefficients ( $b$ is often called an intercept).

The task is to find coefficients $a,b$ , at which the model will describe the dependence as accurately as possible. Based on this dependence, we can predict the values of the dependent variable.
As mentioned in the topic "Introduction to regression", the accuracy can be evaluated by various metrics. These are mainly $\text{MSE},\ \text{RMSE}$ and $\text{MAE}$ .

Least squares method

The main method for finding dependence coefficients is the least squares method (LS).
Its essence is to minimize the sum of the squares of the differences between the real values of $y$ and the values of $\hat{y}$ , predicted by the target dependence. Formally, it is written in this way.

$\sum_{i=1}^n (y_i-\hat{y_i})^2 = \sum_{i=1}^n (y_i-ax_i-b)^2 \to \min$

Sometimes the difference $(y_i−\hat{y_i})$ is denoted as e, then the formula takes the form $\sum\limits_{i=1}^n e^2 \to \min$ For a better understanding take a look at the picture:

A linear regression plot with the differences between the true and the predicted values

In the case when we have only one independent variable, exact formulas for the optimal coefficients are known.

$a = {n \cdot \sum\limits_{i=1}^n x_i y_i - (\sum\limits_{i=1}^n x_i) \cdot (\sum\limits_{i=1}^n y_i ) \over n \cdot \sum\limits_{i=1}^n x_i^2-(\sum\limits_{i=1}^n x_i)^2}$

$b =\frac { \sum\limits_{i=1}^n y_i - a \cdot \sum\limits_{i=1}^n x_i }{n}$

Using these values we can get the target dependence.

Example

Let's continue with the same example with work experience in months and salary in conventional units.
Our task is to predict what salary you can receive with 2 years (24 months) work experience.

Work experience	Salary
3	0.5
6	1.5
9	1.5
12	4
15	6

By plotting this data on a graph, we can see that the relationship is really similar to linear.

The data plot from the table above

Let's use the least squares method and immediately apply the formulas we know.
First, we calculate all the necessary components. In our case the number of entries is $n=5$ .

$\sum\limits_{i=1}^5 x_i = 3 + 6 + ... + 15 = 45$
$\sum\limits_{i=1}^5 y_i = 0.5 + ... + 6 = 13.5$
$\sum\limits_{i=1}^5 x_i y_i = 3 \cdot 0.5 + ... + 15 \cdot 6 = 162$
$\sum\limits_{i=1}^5 x_i^2 = 3 \cdot 3 + ... + 15 \cdot 15 = 495$

Now we can just substitute all these values into formulas for calculating the coefficients.

$a={5 \cdot 162 - 45 \cdot 13.5 \over 5 \cdot 495 - 45^2}=0.45, \\b = { 13.5-0.45 \cdot 45 \over 5} = -1.35$

So we get the dependency $y=0.45x−1.35$ .

Now substitute $x=24$ into this and get the result:

$y=0.45⋅24−1.35=9.45$
Based on this, we can suppose that a worker with 2 years of experience will receive about 9.45 conventional units.

For further study

Of course, this model is very simple and rarely used in practice. But now, with the knowledge you have gained, you can read about multiple linear regression and regularized regression, which are already quite applicable to uncomplicated real-world tasks. Multiple linear regression is a more flexible and frequently used model.

Take a look at the example for multiple linear regression:

A plot of multiple linear regression with 2 different functions

Conclusion

Simple linear regression model is dependency with the only one factor ( $y=ax+b$ ).
The most commonly used method for finding coefficients $a,b$ is the least squares method.

61 learners liked this piece of theory. 3 didn't like it. What about you?

Report a typo