Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnRegression in scikit-learn

Linear Regression with scikit-learn

15 minutes read

As you already know, Linear Regression models the output YYas a linear combination of the inputs X1,X2,...,XmX_1, X_2, ..., X_m:

Y=αo+α1X1+α2X2+...+αmXmY = \alpha_o + \alpha_1 \cdot X_1 + \alpha_2 \cdot X_2 + ... + \alpha_m \cdot X_m

The model coefficients α0,α1,...,αm\alpha_0, \alpha_1, ..., \alpha_m are chosen in such a way that the Mean Squared Error (MSE) of the prediction across the available training examples is minimized. In other words, training a linear regression model means solving the following optimization problem:

min 1ni=1n(yiyi^)2 with respect to α0,...,αm\min \ \frac{1}{n}\sum _{i=1}^n (y_i - \hat{y_i})^2 \ \text{with respect to} \ \alpha_0,...,\alpha_m

Luckily, you don't have to solve it manually, since Linear Regression is already implemented in sklearn. In this topic, you'll learn how to build such a model on a simple example.

Loading the data

sklearn already comes with some built-in datasets that one can use to experiment with different ML models. Let's load one of them, namely the California house prices dataset.

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

This dataset contains information about housing in California. Along with the median house value for California districts(MedHouseVal), expressed in hundreds of thousands of dollars ($100,000), the following 8 features are available for every object:

  1. MedInc — median income in block group;

  2. HouseAge — median house age in block group;

  3. AveRooms — average number of rooms per household;

  4. AveBedrms — average number of bedrooms per household;

  5. Population — block group population;

  6. AveOccup — average number of household members;

  7. Latitude — block group latitude;

  8. Longitude — block group longitude.

Here are the first 5 rows of the dataset:

+----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+
|    |   MedInc |   HouseAge |   AveRooms |   AveBedrms |   Population |   AveOccup |   Latitude |   Longitude |   MedHouseVal |
|----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------|
|  0 |   8.3252 |         41 |    6.98413 |     1.02381 |          322 |    2.55556 |      37.88 |     -122.23 |         4.526 |
|  1 |   8.3014 |         21 |    6.23814 |     0.97188 |         2401 |    2.10984 |      37.86 |     -122.22 |         3.585 |
|  2 |   7.2574 |         52 |    8.28814 |     1.07345 |          496 |    2.80226 |      37.85 |     -122.24 |         3.521 |
|  3 |   5.6431 |         52 |    5.81735 |     1.07306 |          558 |    2.54795 |      37.85 |     -122.25 |         3.413 |
|  4 |   3.8462 |         52 |    6.28185 |     1.08108 |          565 |    2.18147 |      37.85 |     -122.25 |         3.422 |
+----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+

The task is to predict the value of the housing YY (in $100,000) as a linear combination of the features listed above:

Y=α0+α1MedInc+α2HouseAge+α3AveRooms+α4AveBedrms+α5Population+α6AveOccup+α7Latitude+α8LongitudeY = \alpha_0 + \alpha_1 \cdot \text{MedInc} + \alpha_2 \cdot \text{HouseAge} + \alpha_3 \cdot \text{AveRooms} + \alpha_4 \cdot \text{AveBedrms} + \alpha_5 \cdot \text{Population} + \alpha_6 \cdot \text{AveOccup} + \alpha_7 \cdot \text{Latitude} + \alpha_8 \cdot \text{Longitude}

Let's save the data corresponding to these input features to X and the target attribute to y:

# Extracting the features
X = data.data
# Extracting the target attribute
Y = data.target

The full dataset contains 20640 samples. Let's split this dataset into train and test subsets, leaving 80% of the data for training and the remaining 20% for test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.8, random_state=42)

Alright, now we are ready to train our Linear Regression model!

Training a Linear Regression model

Linear Regression is implemented in the linear_model module of sklearn. We can therefore import it like this:

from sklearn.linear_model import LinearRegression

To build a Linear Regression model we should first create a model instance:

model = LinearRegression()

Then, we can call the fit() method to fit the model to the training data available. The method takes in the features and the values of the target. In our example, those are the arrays X_train and y_train respectively:

model.fit(X_train, y_train)

# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

And that's it, your Linear Regression model is trained! Cool, right? Let's inspect the resulting model in more detail.

Inspecting a Linear Regression model

Building a Linear Regression model means estimating the optimal values of the model parameters, which are all the coefficients α1,...,αm\alpha_1, ..., \alpha_m.

After the model has been fit with the fit() method, you can see the obtained values of the coefficients α1,...,αm\alpha_1, ..., \alpha_m by accessing the coef_ attribute of the model. It contains a numpy array with the coefficient for every input feature. In our case, there are 8 of them:

+------------+----------------+
|            |   Coefficients |
|------------+----------------|
| MedInc     |    0.448675    |
| HouseAge   |    0.00972426  |
| AveRooms   |   -0.123323    |
| AveBedrms  |    0.783145    |
| Population |   -2.02962e-06 |
| AveOccup   |   -0.00352632  |
| Latitude   |   -0.419792    |
| Longitude  |   -0.433708    |
+------------+----------------+

You might have noticed that one coefficient hasn't been included in the coef_ array, namely the α0\alpha_0, also called an intercept. Not all the Linear Regression models have it (we'll learn how to avoid modeling the intercept in a minute), which is why its value is stored in a separate attribute called intercept_:

print(model.intercept_)

# -37.02327770606391

Alright, you probably can't wait to actually use the model we've just built. Let's do it!

Making predictions

As you already know, to make predictions with our model, we can use the predict() method, passing to it the values of the input features of the instances for which we want to predict the target.

For example, let's make predictions for all the real estate objects from the training data:

predictions_train = model.predict(X_train)

We'll get predictions of the price of every single object from the training set, 406 estimates in total:

print(predictions_train.shape)

# (16512,)

Similarly, we can predict the real estate prices for the test samples which were not used for training the model:

predictions_test = model.predict(X_test)

Here are the plotted residuals of the model on the test data, which are defined as the difference between the true and predicted values of the test samples:

A scatter plot with 'Test sample' on the X axis and 'Residual' on the Y axis

How good is our model? Does it make accurate predictions? Let's compute some common evaluation metrics to find that out!

Evaluating the model

As you remember, the common evaluation metrics for assessing the quality of the regression models are Mean Squared Error (MSE) or Root Mean Squared Error (RMSE), as well as Mean Absolute Error (MAE).

You can easily compute it yourself, but the corresponding functions are also implemented in the metrics module of sklearn. Let's import them:

from sklearn.metrics import mean_squared_error, mean_absolute_error

Now, we can compute the MSE of the prediction on the training and test sets with the mean_squared_error() function:

mse_train = mean_squared_error(y_train, predictions_train)
print(mse_train)

# 0.5179331255246699

mse_test = mean_squared_error(y_test, predictions_test)
print(mse_test)

# 0.5558915986952422

Obviously, you can compute RMSE by taking a square root of the computed MSE. Since these scores are somewhat difficult to interpret, you might also want to compute the MAE score. This can be done with the mean_absolute_error() function:

mae_train = mean_absolute_error(y_train, predictions_train)
print(mae_train)

# 0.5286283596582376

mae_test = mean_absolute_error(y_test, predictions_test)
print(mae_test)

# 0.533200130495698

So, on average the prices predicted by our model are about 53 thousand dollars off. Is it any good? Are the predictions accurate enough?

Well, it's impossible to answer this question just from the MSE or MAE score since it depends on the end goal of the modeling. For example, if you are planning to use this model to get a rough estimate of the price, it's probably good enough. However, if the profit your company will make depends strongly on the quality of the prediction, you would want a better model.

Fitting a model with no intercept

By default, intercept is included in the Linear Regression equation by sklearn. However, as has been mentioned before, sometimes you might prefer to train a linear model without it:

Y=α1X1+α2X2+...+αmXmY = \alpha_1 \cdot X_1 + \alpha_2 \cdot X_2 + ... + \alpha_m \cdot X_m

To do so, you need to set the value of the fit_intercept parameter to False when creating a LinearRegression object:

model = LinearRegression(fit_intercept=False)

In this case, the model will be fit without the intercept term or, in other words, the value of the corresponding parameter will be explicitly set to 0.

Note, however, that in principle you should not force the intercept to be zero unless you are certain that this should be the case. Otherwise, this will introduce bias to the model and decrease the quality of its predictions.

Conclusions

  • To train a Linear Regression model, use the fit() method.

  • Once the model has been fit, you can make predictions with the predict() method.

  • To access the model's parameters, use the intercept_ argument for the intercept and coef_ for the other coefficients.

29 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo