Computer scienceData scienceInstrumentsScikit-learnData preprocessing with scikit-learn

Additional feature scaling techniques in scikit-learn

2 minutes read

In this topic, we will look at more transformation methods available in the scikit-learn's sklearn.preprocessing module, in particular, the methods discussed in the current topic are applicable to features with outliers, with the differences lying in the resulting data distributions.

Initial setup

The sklearn.preprocessing module provides utilities for feature scaling. We will be using the California housing dataset with 2 out of 8 features, available in the sklearn.datasets module:

import pandas as pd
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)
df = data.frame[['MedInc', 'Population']]

Here are the first few rows of our data:

+----+----------+--------------+
|    |   MedInc |   Population |
|----+----------+--------------|
|  0 |   8.3252 |          322 |
|  1 |   8.3014 |         2401 |
|  2 |   7.2574 |          496 |
+----+----------+--------------+

Let's take a look at descriptive statistics for the two selected features by calling df.describe(). MedInc – median income in a block, Population – block group population.

Descriptive statistics

+-------+-------------+--------------+
|       |      MedInc |   Population |
|-------+-------------+--------------|
| count | 20640       |     20640    |
| mean  |     3.87067 |      1425.48 |
| std   |     1.89982 |      1132.46 |
| min   |     0.4999  |         3    |
| 25%   |     2.5634  |       787    |
| 50%   |     3.5348  |      1166    |
| 75%   |     4.74325 |      1725    |
| max   |    15.0001  |     35682    |
+-------+-------------+--------------+

The feature ranges look like this:

The distribution of the unscaled features

RobustScaler

RobustScaler is used to scale data with many outliers or if the algorithm is prone to overfitting without changing the distribution shape. It utilizes an interquartile range for scaling. It scales the features individually by removing the median(which could be overwritten by with_centering, defaults to True) and scaling the data in the ranges between the 1st quartile (25th quantile) and the third quartile (75th quantile). The 25th and the 75th quantiles could be overwritten by the quantile_range parameter.

The RobustScaler transformation is given by

$X = \frac{X - X_{\text{median}}}{Q_3 - Q_1}$ Here is an example of using the RobustScaler in scikit-learn:

from sklearn.preprocessing import RobustScaler

scaler_robust = RobustScaler(quantile_range=(25, 75))
df_rob = scaler_robust.fit_transform(df)
df_rob = pd.DataFrame(df_rob, columns=df.columns)

Distribution plots for before and after RobustScaler on 2 selected features

The resulting data will have a zero mean and median, and the outliers, if present, will be preserved. Also, the transformed features can have unit variance, by setting unit_variance parameter to True.

Non-linear transformations

QuantileTransformer is a robust processing scheme that scales features independently to a uniform distribution, so it's suitable for dealing with outliers. All data, including the outliers, is mapped on a uniform distribution within the range of [0,1]. QuantileTransformer uses quantile information to spread the dataset to the most frequent values and changes the distribution shape.

from sklearn.preprocessing import QuantileTransformer

q_transformer = QuantileTransformer(n_quantiles=5000, random_state=0)
df_qt = q_transformer.fit_transform(df)
df_qt = pd.DataFrame(df_qt, columns=df.columns)

You have to specify the n_quantiles parameter, which is the number of quantiles to be computed and it has to be less than or equal to the number of samples or rows in your dataset. All the data is mapped on a uniform distribution within the range of [0,1].

Distribution plots for before and after QuantileTransformer on 2 selected features

PowerTransformer scales data to normality by making it more Gaussian-like. Hence the values will have a normal distribution with an equal number of measurements above and below the mean value. There are 2 variants available: the Box-Cox transform(which only applies to the positive values) and the Yeo-Johnson transform(the default, which works on both the negative and the positive values).

Note that by default, the standardizeparameter is set to Trueso that the data has zero mean and unit variance. PowerTransformer changes the distribution shape.

from sklearn.preprocessing import PowerTransformer

power_transformer = PowerTransformer()
df_pt = power_transformer.fit_transform(df)
df_pt = pd.DataFrame(df_pt, columns=df.columns)

Distribution plots for before and after PowerTransformer on 2 selected features

If box-cox is passed to the method of PowerTransformer, the distribution for these particular features will be the same. We can see that the variance has been stabilized.

A comparative table of the transformations

The transformation class in `scikit-learn`	The main takeaway	Does the distribution change?
`RobustScaler`	Works well for data with outliers or a skewed distribution	No
`QuantileTransformer`	Scales features independently to a uniform distribution, suitable for dealing with outliers	Yes
`PowerTransformer`	Rescales the values to a normal distribution	Yes

Conclusion

In this topic, you learned about:

Using the QuantileTransformer to transform data with many outliers to a uniform distribution;
Using power transformations to rescale to a normal distribution;
Applying RobustScaler to preserve the original distribution.

How did you like the theory?

Report a typo