Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnRegression in scikit-learn

Regularized regression with scikit-learn

6 minutes read

By now, you've learned to build linear regression models with the functionality of sklearn. However, vanilla linear regression will likely overfit noise in the data and may skew the model towards outliers. As you probably remember, there's a solution to this problem: a regularized regression. In this topic, we'll find out how to employ Ridge and Lasso regression models to prevent overfitting with sklearn.

The problem

First, let’s take a look at the toy mtcars dataset to illustrate the problem.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://gist.githubusercontent.com/seankross/
a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
df.head()
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

The dataset contains information about different car models and their characteristics. In this topic, we will predict the fuel consumption in mpg (miles per gallon) from the other numerical features. First, we need to prepare and split the data:

y = df.mpg
X = df.drop(['model', 'mpg'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
random_state=10, test_size=0.5)
X_train.shape, X_test.shape

>>> ((16, 10), (16, 10))

If we train an ordinary linear regression, it achieves an excellent R2 score close to 1 (a constant predictor would yield 0):

model = linear_model.LinearRegression(normalize=True)
model.fit(X_train, y_train)
model.score(X_train, y_train)

>>> 0.9045104417964017

However, the predictions on unseen test data are random:

model.score(X_test, y_test)

>>> -0.041647227814543886

Regularization can help tackle this problem by penalizing the size of learned coefficients and decreasing the model complexity. The simpler model will likely be more robust for unseen test data.

Ridge regression

As you already know, Ridge regression minimizes the residual sum of the squares and at the same time penalizes the sum of squares of individual regression coefficients:

minw1Ni=0N(Xiwyi)2+αj=0kwj2\min_w \frac{1}{N} \sum_{i=0}^N (X_i w - y_i) ^ 2 + \alpha \sum_{j=0}^kw_j^2In this formula, α\alpha is the only hyperparameter to tune. It controls the strictness of weight regularization. With α=0\alpha = 0, Ridge regression becomes an ordinary linear regression.

Ridge regression follows the sklearn model API. First, you initialize the model:

model = linear_model.Ridge(alpha=1, normalize=True, fit_intercept=True)

We need to normalize the data because the feature range differs a lot in the dataset. Without normalization, the model will concentrate on columns with larger values. fit_intercept=True tells the model to find an intercept. Otherwise, we assume the data to be centered and set the intercept to 0.

Then, you pass the training data to the fit method:

model.fit(X_train, y_train)

The model will store learned weights and intercept as its properties: coef_ and intercept_.

Once the model is trained, you may apply it to other data to obtain the predictions:

pred = model.predict(X_test)

Alternatively, you can get the coefficient of determination of the model’s predictions right away to evaluate their quality.

model.score(X_train, y_train)

>>> 0.8102495051641695

model.score(X_test, y_test)

>>> 0.7985522674994942

We already obtained a much better score than without regularization!

Lasso regression

When we expect the model to be sparse (i.e., some features presumably bear only noise and have zero influence on the target variable), Lasso regression is the model of choice. It penalizes the sum of absolute values of weights:

minw1Ni=0N(Xiwyi)2+αj=0kwj\min_w \frac{1}{N} \sum_{i=0}^N (X_i w - y_i) ^ 2 + \alpha \sum_{j=0}^k |w_j|

As in Ridge, α\alpha here controls the weight shrinkage.

The usage is also very similar:

model = linear_model.Lasso(alpha=0.2, normalize=True, fit_intercept=True)
model.fit(X_train, y_train)
model.score(X_train, y_train)

>>> 0.8335532812498246

model.score(X_test, y_test)

>>> 0.7511002616979463

In our case, Lasso performs worse than Ridge but does a decent job anyway.

The learned weights and intercept are stored in coef_ and intercept_. If we check coef_ for our model, we will see a lot of zeros. It happens because Lasso assumes sparse features.

model.coef_

>>> array([-0.92676425, -0. , -0. , 0. , -2.53494976, 0. , 0. , 0. , 0. , -0.41875697])

One specific property of Lasso is sparse_coef_ that keeps the sparse representation of learned weights and may help the feature selection process.

Parameter tuning for regularization

Both Lasso and Ridge regression in sklearn have built-in parameter tuning functionality implemented in LassoCV and RidgeCV. Cross-validation allows finding the best regularization parameter.

First, you should initialize the model with a range of regularization alphas to try and a cross-validation strategy:

ridge_cv = linear_model.RidgeCV(alphas=np.linspace(0.1, 10, 1000), normalize=True, fit_intercept=True, cv=None)
lasso_cv = linear_model.LassoCV(alphas=np.linspace(0.01, 1, 1000), normalize=True, fit_intercept=True, cv=None)

For LassoCV , there is an alternative way to specify a range of alphas:

lasso_cv = linear_model.LassoCV(n_alphas=1000, eps=0.01, normalize=True, fit_intercept=True, cv=None)

Here, alphas will be selected automatically so that the minimal one is 100 times smaller than the maximal (set eps to adjust this rate). The algorithm will check 1000 alphas in total along the way.

cv parameter specifies the cross-validation strategy. By default (cv=None), RidgeCV uses leave-one-out, but you can also set the number of folds (e.g., cv=10 for the ten-fold cross-validation) or explicitly pass train and validation indices. LassoCV behaves differently and will use a five-fold strategy if cv=None. Be aware of that when trying different regularizations with the same parameter set!

Next, we train our models and check their performance:

ridge_cv.fit(X_train, y_train)
ridge_cv.score(X_test, y_test)

>>> 0.810110110515111

lasso_cv.fit(X_train, y_train)
lasso_cv.score(X_test, y_test)

>>> 0.7698660591624589

The best alpha in range was used automatically. You can find its value in alpha_:

ridge_cv.alpha_

>>> 0.24864864864864866

lasso_cv.alpha_

>>> 0.13053239441306014

You may notice that selected alphas differ from our initial arbitrary guesses and achieve better scores!

After training, you can obtain the prediction as usual:

pred_ridge = ridge_cv.predict(X_test)
pred_lasso = lasso_cv.predict(X_test)

Conclusion

To sum up:

  • sklearn contains the main regularized regression algorithms, Ridge and Lasso;
  • parameter alpha controls the strength of regularization;
  • it is recommended to use cross-validation to search for the best regularization parameter;
  • default values of different algorithms may vary and you should pay attention to them.

Now let's do some practice!

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo