In this topic, we will look at more transformation methods available in the scikit-learn's sklearn.preprocessing module, in particular, the methods discussed in the current topic are applicable to features with outliers, with the differences lying in the resulting data distributions.
Initial setup
The sklearn.preprocessing module provides utilities for feature scaling. We will be using the California housing dataset with 2 out of 8 features, available in the sklearn.datasets module:
import pandas as pd
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame[['MedInc', 'Population']]
Here are the first few rows of our data:
+----+----------+--------------+
| | MedInc | Population |
|----+----------+--------------|
| 0 | 8.3252 | 322 |
| 1 | 8.3014 | 2401 |
| 2 | 7.2574 | 496 |
+----+----------+--------------+
Let's take a look at descriptive statistics for the two selected features by calling df.describe(). MedInc – median income in a block, Population – block group population.
Descriptive statistics
+-------+-------------+--------------+
| | MedInc | Population |
|-------+-------------+--------------|
| count | 20640 | 20640 |
| mean | 3.87067 | 1425.48 |
| std | 1.89982 | 1132.46 |
| min | 0.4999 | 3 |
| 25% | 2.5634 | 787 |
| 50% | 3.5348 | 1166 |
| 75% | 4.74325 | 1725 |
| max | 15.0001 | 35682 |
+-------+-------------+--------------+
The feature ranges look like this:
RobustScaler
RobustScaler is used to scale data with many outliers or if the algorithm is prone to overfitting without changing the distribution shape. It utilizes an interquartile range for scaling. It scales the features individually by removing the median(which could be overwritten by with_centering, defaults to True) and scaling the data in the ranges between the 1st quartile (25th quantile) and the third quartile (75th quantile). The 25th and the 75th quantiles could be overwritten by the quantile_range parameter.
The RobustScaler transformation is given by
Here is an example of using the RobustScaler in scikit-learn:
from sklearn.preprocessing import RobustScaler
scaler_robust = RobustScaler(quantile_range=(25, 75))
df_rob = scaler_robust.fit_transform(df)
df_rob = pd.DataFrame(df_rob, columns=df.columns)
The resulting data will have a zero mean and median, and the outliers, if present, will be preserved. Also, the transformed features can have unit variance, by setting unit_variance parameter to True.
Non-linear transformations
QuantileTransformer is a robust processing scheme that scales features independently to a uniform distribution, so it's suitable for dealing with outliers. All data, including the outliers, is mapped on a uniform distribution within the range of [0,1]. QuantileTransformer uses quantile information to spread the dataset to the most frequent values and changes the distribution shape.
from sklearn.preprocessing import QuantileTransformer
q_transformer = QuantileTransformer(n_quantiles=5000, random_state=0)
df_qt = q_transformer.fit_transform(df)
df_qt = pd.DataFrame(df_qt, columns=df.columns)
You have to specify the n_quantiles parameter, which is the number of quantiles to be computed and it has to be less than or equal to the number of samples or rows in your dataset. All the data is mapped on a uniform distribution within the range of [0,1].
PowerTransformer scales data to normality by making it more Gaussian-like. Hence the values will have a normal distribution with an equal number of measurements above and below the mean value. There are 2 variants available: the Box-Cox transform(which only applies to the positive values) and the Yeo-Johnson transform(the default, which works on both the negative and the positive values).
Note that by default, the standardizeparameter is set to Trueso that the data has zero mean and unit variance. PowerTransformer changes the distribution shape.
from sklearn.preprocessing import PowerTransformer
power_transformer = PowerTransformer()
df_pt = power_transformer.fit_transform(df)
df_pt = pd.DataFrame(df_pt, columns=df.columns)
If box-cox is passed to the method of PowerTransformer, the distribution for these particular features will be the same. We can see that the variance has been stabilized.
A comparative table of the transformations
The transformation class in scikit-learn |
The main takeaway | Does the distribution change? |
RobustScaler |
Works well for data with outliers or a skewed distribution | No |
QuantileTransformer |
Scales features independently to a uniform distribution, suitable for dealing with outliers | Yes |
PowerTransformer |
Rescales the values to a normal distribution | Yes |
Conclusion
In this topic, you learned about:
- Using the
QuantileTransformerto transform data with many outliers to a uniform distribution; - Using power transformations to rescale to a normal distribution;
- Applying
RobustScalerto preserve the original distribution.