Regression problems are common in real life. For example, you might want to predict how your salary in a company will change depending on years of experience, and a real estate agency might want to estimate the cost of an apartment depending on the distance to the city center. Linear regression is one of the simplest models to solve such problems.
In this topic, we will consider a simple, one-dimensional linear regression model, which predicts the value of the target based on the value of just one input feature.
Problem statement
Imagine that you have just started to work in a company. You are wondering how your salary will change as you get more experience, so you head to the financial department and ask for the data on the salaries of the employees (in conventional units) and the number of months they've been working for your company. Here is what you've got:
| Work experience | Salary |
| 3 | 0.5 |
| 6 | 1.5 |
| 9 | 1.5 |
| 12 | 4 |
| 15 | 6 |
The simplest way to model the dependency between the two is to assume that it's linear. If we denote salary as and years of experience as , salary of an employee can be computed as the following function:
where
- — dependent variable (salary);
- — factor, regressor, independent variable (work experience);
- — numerical coefficients ( is often called an intercept).
The task is to find coefficients , at which the model will describe the dependence as accurately as possible. Based on this dependence, we can predict the values of the dependent variable.
As mentioned in the topic "Introduction to regression", the accuracy can be evaluated by various metrics. These are mainly and .
Least squares method
The main method for finding dependence coefficients is the least squares method (LS).
Its essence is to minimize the sum of the squares of the differences between the real values of and the values of , predicted by the target dependence. Formally, it is written in this way.
Sometimes the difference is denoted as e, then the formula takes the form For a better understanding take a look at the picture:
In the case when we have only one independent variable, exact formulas for the optimal coefficients are known.
Using these values we can get the target dependence.
Example
Let's continue with the same example with work experience in months and salary in conventional units.
Our task is to predict what salary you can receive with 2 years (24 months) work experience.
| Work experience | Salary |
| 3 | 0.5 |
| 6 | 1.5 |
| 9 | 1.5 |
| 12 | 4 |
| 15 | 6 |
By plotting this data on a graph, we can see that the relationship is really similar to linear.
Let's use the least squares method and immediately apply the formulas we know.
First, we calculate all the necessary components. In our case the number of entries is .
Now we can just substitute all these values into formulas for calculating the coefficients.
So we get the dependency .
Now substitute into this and get the result:
Based on this, we can suppose that a worker with 2 years of experience will receive about 9.45 conventional units.
For further study
Of course, this model is very simple and rarely used in practice. But now, with the knowledge you have gained, you can read about multiple linear regression and regularized regression, which are already quite applicable to uncomplicated real-world tasks. Multiple linear regression is a more flexible and frequently used model.
Take a look at the example for multiple linear regression:
Conclusion
- Simple linear regression model is dependency with the only one factor ().
- The most commonly used method for finding coefficients is the least squares method.