Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnRegression in scikit-learn

Linear Regression with scikit-learn

15 minutes read

As you already know, Linear Regression models the output $Y$ as a linear combination of the inputs $X_1, X_2, ..., X_m$ :

$Y = \alpha_o + \alpha_1 \cdot X_1 + \alpha_2 \cdot X_2 + ... + \alpha_m \cdot X_m$

The model coefficients $\alpha_0, \alpha_1, ..., \alpha_m$ are chosen in such a way that the Mean Squared Error (MSE) of the prediction across the available training examples is minimized. In other words, training a linear regression model means solving the following optimization problem:

$\min \ \frac{1}{n}\sum _{i=1}^n (y_i - \hat{y_i})^2 \ \text{with respect to} \ \alpha_0,...,\alpha_m$

Luckily, you don't have to solve it manually, since Linear Regression is already implemented in sklearn. In this topic, you'll learn how to build such a model on a simple example.

Loading the data

sklearn already comes with some built-in datasets that one can use to experiment with different ML models. Let's load one of them, namely the California house prices dataset.

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

This dataset contains information about housing in California. Along with the median house value for California districts(MedHouseVal), expressed in hundreds of thousands of dollars ($100,000), the following 8 features are available for every object:

MedInc — median income in block group;
HouseAge — median house age in block group;
AveRooms — average number of rooms per household;
AveBedrms — average number of bedrooms per household;
Population — block group population;
AveOccup — average number of household members;
Latitude — block group latitude;
Longitude — block group longitude.

Here are the first 5 rows of the dataset:

+----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+
|    |   MedInc |   HouseAge |   AveRooms |   AveBedrms |   Population |   AveOccup |   Latitude |   Longitude |   MedHouseVal |
|----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------|
|  0 |   8.3252 |         41 |    6.98413 |     1.02381 |          322 |    2.55556 |      37.88 |     -122.23 |         4.526 |
|  1 |   8.3014 |         21 |    6.23814 |     0.97188 |         2401 |    2.10984 |      37.86 |     -122.22 |         3.585 |
|  2 |   7.2574 |         52 |    8.28814 |     1.07345 |          496 |    2.80226 |      37.85 |     -122.24 |         3.521 |
|  3 |   5.6431 |         52 |    5.81735 |     1.07306 |          558 |    2.54795 |      37.85 |     -122.25 |         3.413 |
|  4 |   3.8462 |         52 |    6.28185 |     1.08108 |          565 |    2.18147 |      37.85 |     -122.25 |         3.422 |
+----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+

The task is to predict the value of the housing $Y$ (in $100,000) as a linear combination of the features listed above:

$Y = \alpha_0 + \alpha_1 \cdot \text{MedInc} + \alpha_2 \cdot \text{HouseAge} + \alpha_3 \cdot \text{AveRooms} + \alpha_4 \cdot \text{AveBedrms} + \alpha_5 \cdot \text{Population} + \alpha_6 \cdot \text{AveOccup} + \alpha_7 \cdot \text{Latitude} + \alpha_8 \cdot \text{Longitude}$

Let's save the data corresponding to these input features to X and the target attribute to y:

# Extracting the features
X = data.data
# Extracting the target attribute
Y = data.target

The full dataset contains 20640 samples. Let's split this dataset into train and test subsets, leaving 80% of the data for training and the remaining 20% for test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.8, random_state=42)

Alright, now we are ready to train our Linear Regression model!

Training a Linear Regression model

Linear Regression is implemented in the linear_model module of sklearn. We can therefore import it like this:

from sklearn.linear_model import LinearRegression

To build a Linear Regression model we should first create a model instance:

model = LinearRegression()

Then, we can call the fit() method to fit the model to the training data available. The method takes in the features and the values of the target. In our example, those are the arrays X_train and y_train respectively:

model.fit(X_train, y_train)

# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

And that's it, your Linear Regression model is trained! Cool, right? Let's inspect the resulting model in more detail.

Inspecting a Linear Regression model

Building a Linear Regression model means estimating the optimal values of the model parameters, which are all the coefficients $\alpha_1, ..., \alpha_m$ .

After the model has been fit with the fit() method, you can see the obtained values of the coefficients $\alpha_1, ..., \alpha_m$ by accessing the coef_ attribute of the model. It contains a numpy array with the coefficient for every input feature. In our case, there are 8 of them:

+------------+----------------+
|            |   Coefficients |
|------------+----------------|
| MedInc     |    0.448675    |
| HouseAge   |    0.00972426  |
| AveRooms   |   -0.123323    |
| AveBedrms  |    0.783145    |
| Population |   -2.02962e-06 |
| AveOccup   |   -0.00352632  |
| Latitude   |   -0.419792    |
| Longitude  |   -0.433708    |
+------------+----------------+

You might have noticed that one coefficient hasn't been included in the coef_ array, namely the $\alpha_0$ , also called an intercept. Not all the Linear Regression models have it (we'll learn how to avoid modeling the intercept in a minute), which is why its value is stored in a separate attribute called intercept_:

print(model.intercept_)

# -37.02327770606391

Alright, you probably can't wait to actually use the model we've just built. Let's do it!

Making predictions

As you already know, to make predictions with our model, we can use the predict() method, passing to it the values of the input features of the instances for which we want to predict the target.

For example, let's make predictions for all the real estate objects from the training data:

predictions_train = model.predict(X_train)

We'll get predictions of the price of every single object from the training set, 406 estimates in total:

print(predictions_train.shape)

# (16512,)

Similarly, we can predict the real estate prices for the test samples which were not used for training the model:

predictions_test = model.predict(X_test)

Here are the plotted residuals of the model on the test data, which are defined as the difference between the true and predicted values of the test samples:

A scatter plot with 'Test sample' on the X axis and 'Residual' on the Y axis

How good is our model? Does it make accurate predictions? Let's compute some common evaluation metrics to find that out!

Evaluating the model

As you remember, the common evaluation metrics for assessing the quality of the regression models are Mean Squared Error (MSE) or Root Mean Squared Error (RMSE), as well as Mean Absolute Error (MAE).

You can easily compute it yourself, but the corresponding functions are also implemented in the metrics module of sklearn. Let's import them:

from sklearn.metrics import mean_squared_error, mean_absolute_error

Now, we can compute the MSE of the prediction on the training and test sets with the mean_squared_error() function:

mse_train = mean_squared_error(y_train, predictions_train)
print(mse_train)

# 0.5179331255246699

mse_test = mean_squared_error(y_test, predictions_test)
print(mse_test)

# 0.5558915986952422

Obviously, you can compute RMSE by taking a square root of the computed MSE. Since these scores are somewhat difficult to interpret, you might also want to compute the MAE score. This can be done with the mean_absolute_error() function:

mae_train = mean_absolute_error(y_train, predictions_train)
print(mae_train)

# 0.5286283596582376

mae_test = mean_absolute_error(y_test, predictions_test)
print(mae_test)

# 0.533200130495698

So, on average the prices predicted by our model are about 53 thousand dollars off. Is it any good? Are the predictions accurate enough?

Well, it's impossible to answer this question just from the MSE or MAE score since it depends on the end goal of the modeling. For example, if you are planning to use this model to get a rough estimate of the price, it's probably good enough. However, if the profit your company will make depends strongly on the quality of the prediction, you would want a better model.

Fitting a model with no intercept

By default, intercept is included in the Linear Regression equation by sklearn. However, as has been mentioned before, sometimes you might prefer to train a linear model without it:

$Y = \alpha_1 \cdot X_1 + \alpha_2 \cdot X_2 + ... + \alpha_m \cdot X_m$

To do so, you need to set the value of the fit_intercept parameter to False when creating a LinearRegression object:

model = LinearRegression(fit_intercept=False)

In this case, the model will be fit without the intercept term or, in other words, the value of the corresponding parameter will be explicitly set to 0.

Note, however, that in principle you should not force the intercept to be zero unless you are certain that this should be the case. Otherwise, this will introduce bias to the model and decrease the quality of its predictions.

Conclusions

To train a Linear Regression model, use the fit() method.
Once the model has been fit, you can make predictions with the predict() method.
To access the model's parameters, use the intercept_ argument for the intercept and coef_ for the other coefficients.

29 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo