Computer scienceData scienceMachine learningRegression

Regularized regression

10 minutes read

Imagine that a hospital hired you to build a regression model that predicts the duration of specific treatments based on the patient data. Physicians included blood test results, patient age and gender, moon phase, and air temperature in the data table. Would you use all of these features to build your model? You would probably exclude the last two predictors in the pre-processing stage as non-relevant to the given problem. But what would you do if physicians obfuscated the data due to the patients’ privacy? With column names removed and numerical values normalized, you don’t have a clear understanding of which features are useful and which are just noise, that might've got there by mistake. Luckily, regularized regression will help you to build the model achieving optimal performance. Let's find out how.

Why regularization

Real-life data are often much more sophisticated than the example above. Regularization allows us to disregard unimportant (and even harmful) features automatically. The main idea of regularization is to penalize complex models that pay attention to many features. This attention is often redundant because if two models make similar errors on the training data, the simpler one will perform better on unseen test data.

In other words, simple models are less prone to overfitting — memorizing the noise.

We build a linear regression model by finding a set of weights to minimize an error (e.g. a mean squared error) on the training set.

minwMSE=minw1Ni=0N(Xiwyi)2\min_w MSE = \min_w \frac{1}{N} \sum_{i=0}^N (X_i w - y_i) ^ 2We can measure the complexity of this model by the magnitude of its weights ww. Consider a case when our target variable does not depend on the features. The best predictor here would be a constant mean predictor, and we do not need the weights at all! However, if you try to fit an ordinary least squares regression, it will do its best to memorize the noise and perform exceptionally poorly on future data. On the other hand, if some columns bear useful signals, we would like the model to have non-zero weights for them.

In this case, the mean predictor will not be an optimal choice — this is an example of underfitting.

In regularization, we want to find a trade-off between weights’ magnitudes and train set error. This trade-off is usually denoted in formulas as α\alpha or λ\lambda. The higher it is, the stronger regularization we apply. With α=0\alpha = 0, we obtain an ordinary least squares regression.

Ridge regression

One way to apply regularization is to penalize the squared weights of our model. This method is called Ridge regression. Here, we minimize not only the mean squared error of the model but also the sum of squares of its coefficients:

minw1Ni=0N(Xiwyi)2+αj=0kwj2\min_w \frac{1}{N} \sum_{i=0}^N (X_i w - y_i) ^ 2 + \alpha \sum_{j=0}^kw_j^2Ridge regression task has a closed-form solution analogous to the ordinary least squares regression (X\mathbf{X}^\top denotes matrix transpose, I\mathbf{I} denotes an identity matrix with ones on the main diagonal and zeros elsewhere, and X1\mathbf{X}^{-1} denotes matrix inverse):

w=(XX+αI)1Xy\mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}Ridge regression is also known as linear regression with Tikhonov or L2-norm regularization. It is effective in reducing overfitting. However, the resulting coefficients will always be non-zero even if the actual model is sparse. You can see that α\alpha affects weight magnitudes (ww) in an example model: the larger is α\alpha the stronger the regularization push coefficients to zero.

A plot with alpha(ridge regression coefficient) on the X axis and w(weight magnitutes) on the Y axis that shows the behaviour of different alpha values
Ridge regression model coefficients depend on the regularization parameter α\alpha.

Lasso regression

Another approach to regularize a regression model is called Lasso (Least Absolute Shrinkage and Selection Operator). It penalizes the sum of absolute values of weights:

minw1Ni=0N(Xiwyi)2+αj=0kwj\min_w \frac{1}{N} \sum_{i=0}^N (X_i w - y_i) ^ 2 + \alpha \sum_{j=0}^k |w_j|Lasso regression can produce sparse solutions setting some of the learned coefficients to zero. Thus, you can use this model for feature selection purposes as well: we can filter the features for more complex models based on Lasso coefficients. However, you should be careful selecting α\alpha: when its value is too large, all the weights will be zeroed. Also, if several features are highly correlated, Lasso keeps one of them arbitrarily.

A plot with alpha(Lasso regression coefficient) on the X axis and w(weight magnitutes) on the Y axis that shows the behaviour of different alpha values
Lasso regression model coefficients depend on the regularization parameter α\alpha).

Unlike Ridge and ordinary least squares, Lasso regression cannot be solved in closed-form and must be optimized numerically.

How to choose α\alpha?

For regularized regression, α\alpha is the only hyperparameter to tune. One common way to do that is to try several models with different α\alpha values and see which one performs best on unseen validation data. Another way is to run a cross-validation scheme and aggregate the results by averaging or voting.

Conclusion

  • Regularization prevents overfitting by penalizing the model’s complexity.
  • Ridge regression penalizes the squared sum of weights and keeps the weights non-zero.
  • Lasso regression penalizes the sum of absolute values of weights and sets some of the weights to zero.
  • You can select a regularization hyperparameter α\alpha by trying different values and comparing the model’s performance on the validation set.
6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo