Computer scienceData scienceInstrumentsScikit-learnData preprocessing with scikit-learn

Feature scaling in scikit-learn

10 minutes read

In data preprocessing, we aim to transform raw data into a form that will help to increase the performance of a machine learning algorithm. Feature scaling is one of the techniques in data preprocessing, it's used to transform the independent features in a dataset so the model can interpret the features to the same degree. In today's topic, we will look at the most popular ways to perform feature scaling with the scikit-learn package.

Initial setup

The scikit-learn library comes with the sklearn.preprocessing module, which provides utilities for feature scaling. We will be using the California housing dataset with 2 out of 8 features, available in the sklearn.datasets module:

import pandas as pd
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)
df = data.frame[['MedInc', 'Population']]

Here are the first few rows of our data:

+----+----------+--------------+
|    |   MedInc |   Population |
|----+----------+--------------|
|  0 |   8.3252 |          322 |
|  1 |   8.3014 |         2401 |
|  2 |   7.2574 |          496 |
+----+----------+--------------+

Let's take a look at descriptive statistics for the two selected features by calling df.describe(). MedInc – median income in the block, Population - block group population.

Descriptive statistics

+-------+-------------+--------------+
|       |      MedInc |   Population |
|-------+-------------+--------------|
| count | 20640       |     20640    |
| mean  |     3.87067 |      1425.48 |
| std   |     1.89982 |      1132.46 |
| min   |     0.4999  |         3    |
| 25%   |     2.5634  |       787    |
| 50%   |     3.5348  |      1166    |
| 75%   |     4.74325 |      1725    |
| max   |    15.0001  |     35682    |
+-------+-------------+--------------+

The feature ranges look like this:

The distribution of the unscaled features

A note on fit_transform()

In this topic, we are using the .fit_transform() method on the training data, which combines the .fit() method that finds the parameters for a particular scaler and saves them as an internal object state, and .transform(), which applies the parameters from the .fit() to transform the data. To scale the test data, you only need to call .transform() since the parameters are already present from applying .fit() on the training data.

.fit(), .transform(), and .fit_transform() scheme for the sklearn estimator

An important note: you should never scale the entire dataset before splitting it into the train and test sets. Scaling the entire data before the split may result in the leakage of the mean and variance from the test set into the training process. Always split the data before scaling, then use the statistics from the train split to scale the other splits.

StandardScaler

The StandardScaler assumes the data is normally distributed within each feature and will scale them such that the distribution is now centered around 0 (with a mean, $\mu$ , being 0), with a standard deviation ( $\sigma$ ) of 1. The mean and standard deviation are calculated for the feature and then it is scaled. StandardScaler doesn't change the distribution shape.

from sklearn.preprocessing import StandardScaler

scaler_std = StandardScaler()
df_standard = scaler_std.fit_transform(df)
df_standard = pd.DataFrame(df_standard, columns=df.columns)

Distribution plots for before and after StandardScaler on 2 selected features

We can see that now the mean is 0 and the standard deviation is 1 after standardization:

+----+------------+--------+-------+
|    | feature    |   mean |   std |
|----+------------+--------+-------|
|  0 | MedInc     |      0 |     1 |
|  1 | Population |      0 |     1 |
+----+------------+--------+-------+

To name a few applications, standardization is a necessary preprocessing step in PCA. It's also applied in clustering since the feature comparison is based on distance measures.

Scaling features to a range

Feature scaling has some terminology ambiguities. In this topic, we refer to normalization as scaling features into the $[0,1]$ range with the MinMaxScaler and the MaxAbsScaler which operate on features, or columns. However, the term "normalization" is also used in the context of normalizing the samples, or the rows, with the Normalizer class.

The MinMaxScaler works by transforming the values into a specified range. By default, the MinMaxScaler will rescale the data into the $[0, 1]$ range, but that can be changed by passing the required range to the feature_range parameter. The MinMaxScaler does what we previously defined as normalization. Both MinMaxScalerand MaxAbsScaler don't affect the distribution shape.

Below is an example of using the MinMaxScaler class:

from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()
df_minmax = scaler_minmax.fit_transform(df)
df_minmax = pd.DataFrame(df_minmax, columns=df.columns)

We can see that our output now ranges between 0 and 1.

+----+------------+-------+-------+
|    | feature    |   min |   max |
|----+------------+-------+-------|
|  0 | MedInc     |     0 |     1 |
|  1 | Population |     0 |     1 |
+----+------------+-------+-------+

Distribution plots for before and after MinMaxScaler on 2 selected features

MaxAbsScaler scales each feature by its maximum absolute value. It translates the feature individually such that the maximal absolute value of each feature in the training set will be 1.0.

from sklearn.preprocessing import MaxAbsScaler

scaler_maxabs = MaxAbsScaler()
df_maxabs = scaler_maxabs.fit_transform(df)
df_maxabs = pd.DataFrame(df_maxabs, columns=df.columns)

The maximum value of transformed data is 1 and the minimum is 0 since the feature values here are strictly positive. If the data only had negative values it would be scaled to a minimum of -1.0 and a maximum of 0. If the dataset contained both negative and positive values the maximum value would be 1 and the minimum would be -1. MaxAbsScaler is applied when the data is sparse, where a lot of the values don't contain data (such as NaN values) — the MaxAbsScaler doesn't shift or center and the sparsity is preserved. One of the possible applications is time series analysis.

Normalizer

The Normalizer rescales each sample (the row) individually for its norm ( $L_1$ , $L_2$ or $\max$ ) to equal 1, with $L_2$ being a default norm. The counter-intuitive part here is how the rows, and not the columns, are scaled to have a unit norm. After each element is squared and summed up, the total for the $L_2$ norm will be 1. The data is squeezed between 0 and 1. Normalizer changes the distribution shape.

from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
df_norm = normalizer.fit_transform(df)
df_norm = pd.DataFrame(df_norm, columns=df.columns)

Normalizer helps to avoid gradient explosion during training, since the features decrease in range and magnitude. It's also widely used in information retrieval and clustering.

A comparative table of the transformations

The transformation class in `scikit-learn`	The main takeaway	Does the distribution change?
`StandardScaler`	Less sensitive to outliers, one of the most widely used transformations	No
`MinMaxScaler`, `MaxAbsScaler`	Suitable for data without outliers, transforms the features to lie in a certain range	No
`Normalizer`	Rescales each sample individually to have a unit norm	Yes

Conclusion

We have covered some of the most popular techniques for feature scaling using the scikit-learn library. It is an important step of data preprocessing when dealing with machine learning algorithms such as:

Distance-based algorithms (K-nearest neighbors, K-Means)
Clustering and principal component analysis (PCA)
Gradient-based algorithms, logistic regression, and SVMs.

The rule-based algorithms are one of the few solutions that are not affected by scaling.

6 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo