By now, you've learned to build linear regression models with the functionality of sklearn. However, vanilla linear regression will likely overfit noise in the data and may skew the model towards outliers. As you probably remember, there's a solution to this problem: a regularized regression. In this topic, we'll find out how to employ Ridge and Lasso regression models to prevent overfitting with sklearn.
The problem
First, let’s take a look at the toy mtcars dataset to illustrate the problem.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://gist.githubusercontent.com/seankross/
a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
df.head()
| model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| 1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| 2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| 3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| 4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
The dataset contains information about different car models and their characteristics. In this topic, we will predict the fuel consumption in mpg (miles per gallon) from the other numerical features. First, we need to prepare and split the data:
y = df.mpg
X = df.drop(['model', 'mpg'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=10, test_size=0.5)
X_train.shape, X_test.shape
>>> ((16, 10), (16, 10))
If we train an ordinary linear regression, it achieves an excellent R2 score close to 1 (a constant predictor would yield 0):
model = linear_model.LinearRegression(normalize=True)
model.fit(X_train, y_train)
model.score(X_train, y_train)
>>> 0.9045104417964017
However, the predictions on unseen test data are random:
model.score(X_test, y_test)
>>> -0.041647227814543886
Regularization can help tackle this problem by penalizing the size of learned coefficients and decreasing the model complexity. The simpler model will likely be more robust for unseen test data.
Ridge regression
As you already know, Ridge regression minimizes the residual sum of the squares and at the same time penalizes the sum of squares of individual regression coefficients:
In this formula, is the only hyperparameter to tune. It controls the strictness of weight regularization. With , Ridge regression becomes an ordinary linear regression.
Ridge regression follows the sklearn model API. First, you initialize the model:
model = linear_model.Ridge(alpha=1, normalize=True, fit_intercept=True)
We need to normalize the data because the feature range differs a lot in the dataset. Without normalization, the model will concentrate on columns with larger values. fit_intercept=True tells the model to find an intercept. Otherwise, we assume the data to be centered and set the intercept to 0.
Then, you pass the training data to the fit method:
model.fit(X_train, y_train)
The model will store learned weights and intercept as its properties: coef_ and intercept_.
Once the model is trained, you may apply it to other data to obtain the predictions:
pred = model.predict(X_test)
Alternatively, you can get the coefficient of determination of the model’s predictions right away to evaluate their quality.
model.score(X_train, y_train)
>>> 0.8102495051641695
model.score(X_test, y_test)
>>> 0.7985522674994942
We already obtained a much better score than without regularization!
Lasso regression
When we expect the model to be sparse (i.e., some features presumably bear only noise and have zero influence on the target variable), Lasso regression is the model of choice. It penalizes the sum of absolute values of weights:
As in Ridge, here controls the weight shrinkage.
The usage is also very similar:
model = linear_model.Lasso(alpha=0.2, normalize=True, fit_intercept=True)
model.fit(X_train, y_train)
model.score(X_train, y_train)
>>> 0.8335532812498246
model.score(X_test, y_test)
>>> 0.7511002616979463
In our case, Lasso performs worse than Ridge but does a decent job anyway.
The learned weights and intercept are stored in coef_ and intercept_. If we check coef_ for our model, we will see a lot of zeros. It happens because Lasso assumes sparse features.
model.coef_
>>> array([-0.92676425, -0. , -0. , 0. , -2.53494976, 0. , 0. , 0. , 0. , -0.41875697])
One specific property of Lasso is sparse_coef_ that keeps the sparse representation of learned weights and may help the feature selection process.
Parameter tuning for regularization
Both Lasso and Ridge regression in sklearn have built-in parameter tuning functionality implemented in LassoCV and RidgeCV. Cross-validation allows finding the best regularization parameter.
First, you should initialize the model with a range of regularization alphas to try and a cross-validation strategy:
ridge_cv = linear_model.RidgeCV(alphas=np.linspace(0.1, 10, 1000), normalize=True, fit_intercept=True, cv=None)
lasso_cv = linear_model.LassoCV(alphas=np.linspace(0.01, 1, 1000), normalize=True, fit_intercept=True, cv=None)
For LassoCV , there is an alternative way to specify a range of alphas:
lasso_cv = linear_model.LassoCV(n_alphas=1000, eps=0.01, normalize=True, fit_intercept=True, cv=None)
Here, alphas will be selected automatically so that the minimal one is 100 times smaller than the maximal (set eps to adjust this rate). The algorithm will check 1000 alphas in total along the way.
cv parameter specifies the cross-validation strategy. By default (cv=None), RidgeCV uses leave-one-out, but you can also set the number of folds (e.g., cv=10 for the ten-fold cross-validation) or explicitly pass train and validation indices. LassoCV behaves differently and will use a five-fold strategy if cv=None. Be aware of that when trying different regularizations with the same parameter set!
Next, we train our models and check their performance:
ridge_cv.fit(X_train, y_train)
ridge_cv.score(X_test, y_test)
>>> 0.810110110515111
lasso_cv.fit(X_train, y_train)
lasso_cv.score(X_test, y_test)
>>> 0.7698660591624589
The best alpha in range was used automatically. You can find its value in alpha_:
ridge_cv.alpha_
>>> 0.24864864864864866
lasso_cv.alpha_
>>> 0.13053239441306014
You may notice that selected alphas differ from our initial arbitrary guesses and achieve better scores!
After training, you can obtain the prediction as usual:
pred_ridge = ridge_cv.predict(X_test)
pred_lasso = lasso_cv.predict(X_test)Conclusion
To sum up:
sklearncontains the main regularized regression algorithms, Ridge and Lasso;- parameter alpha controls the strength of regularization;
- it is recommended to use cross-validation to search for the best regularization parameter;
- default values of different algorithms may vary and you should pay attention to them.
Now let's do some practice!