12 minutes read

In today's topic, we will take a look at feature scaling, which is one of the most important steps in data preprocessing. It has a direct impact on the performance of many ML algorithms. Feature scaling usually transforms the features to lie in a predefined range, e.g., [1,1][-1, 1], or to possess certain statistical characteristics, such as zero mean, for example.

Why do we need feature scaling in the first place?

Generally speaking, feature scaling affects the algorithms where distance or some similarity measure factors in, for instance, clustering or K-nearest neighbors. Other examples of algorithms that benefit from scaling include, but are not limited to, logistic regression, principal component analysis, support vector machines, and gradient descent.

Let's consider one of the most important methods in optimization — gradient descent. Gradient descent finds the minimum of the differentiable function and is widely used during the neural network training process. Suppose there are two features: one belongs in the [0,1][0, 1] range, and the other in the [100,1000][100, 1000] range. During the derivative calculation, the features in the larger range are likely to have vastly different derivatives than the features in the smaller range, which leads to unequal updating and the introduction of bias, even if both features are equally important. Thus, the scaled features will lead to faster convergence, while the unscaled ones will make the convergence slower, which is illustrated as follows:

Two examples of gradient descent on the unscaled and scaled data

Here is another example of an algorithm that is affected by feature scaling: K-nearest neighbors (KNN). KNN is a supervised algorithm that operates on the idea that similar points are close to each other. Suppose we keep the two features and their ranges remain the same as in the gradient descent example, and there are nn samples present. Since KNN relies on the Euclidean distance, which is defined as:

i=1n(piqi)2,\sqrt{\sum_{i=1}^n(p_i - q_i})^2,

where pip_i is a test point and qiq_i is the training point, one can observe that the feature with the smaller range will be insignificant, and the algorithm will almost solely rely on the second feature by assigning a higher weight to the variable with the highest magnitude.

Suppose there are 2 observations: (1,900)(1, 900) and (0.5,700)(0.5, 700). The Euclidean distance between these observations is

(10.5)2+(900700)2200\sqrt{(1 - 0.5 )^2 + (900-700)^2} \approx 200

We can see that mainly the second feature contributed to the resulting distance, and the first feature practically doesn't factor in.

However, there are a few machine learning algorithms that do not benefit from feature scaling, such as decision trees and tree-based ensemble methods (e.g., random trees), since they are scale-invariant. Graphical models, like the Naive Bayes, are also unaffected by the feature distributions.

A brief definition note

There is some terminological confusion when it comes to the two most popular approaches to scaling — normalization, and standardization. In this topic, normalization (also sometimes called Min-Max scaling) refers to scaling the features, or the columns, into a specific range, most often [0,1][0, 1], which is the behavior of the MinMaxScaler in the sklearn.preprocessing module.

When speaking about scaling, some authors refer to normalization in the sense of transforming the individual sample, or the row, to a unit norm, which corresponds to the Normalizer from sklearn.preprocessing module.

Standardization, in the current context, refers to scaling the feature to have the properties of the standard normal distribution with a mean of 00 and a standard deviation of 11. In some sources, standardization might be referred to as the Z-score normalization, but usually, the specifics can be derived from the context.

Standardization

The result of standardization is that the features will be rescaled such that they'll have the properties of a standard normal distribution, with the zero mean (μ\mu) and the standard deviation (σ\sigma) equal to 11. The transformation is given by:

z=xμσz = \frac{x - \mu}{\sigma}

Let's see how standardization affects the median income feature from the California housing dataset:

Distribution plots for the MedInc feature of the California housing dataset before and after standardization

We can observe that the distribution remains and is only shifted. Standardization preserves the outliers but is also less sensitive to them. The feature ranges after the transformation will vary. Some of the applications include SVM, logistic regression, and neural networks. Standardization is more widely applicable when compared to normalization.

Feature normalization

Normalization (the Min-Max scaling) is mostly about transforming the features into the [0,1][0, 1], or sometimes [1,1][-1, 1] range. The transformation is defined as:

Xnorm=XXminXmaxXminX_{\text{norm}}=\frac{X - X_{\min}}{X_{\max} - X_{\min}}

Consider the example of normalizing the HouseAge feature (in years) from the California housing dataset:

Distribution plots for the HouseAge feature of the California housing dataset before and after feature normalization

From the plots above, it can be observed that normalization won't affect the original distribution, but simply shifts it to have a minimum of zero and a maximum of 11.

Scaling to a range is suitable when the 2 following conditions apply:

  1. The lower and the upper bounds are known and there are none to very few outliers;

  2. The data roughly follows the normal distribution within the lower and the upper bound.

The mean and the variance will be different across the features. If there are extreme outliers, the majority of the data will be centered in a small range. Normalization is applicable when we want the features to lie in a specific range, e.g., in image preprocessing, where the pixel intensities are rescaled to fit within a range.

Row normalization

There is a different form of normalization, which acts on the individual samples and scales them to a unit norm. Row normalization affects the distribution. For each sample, the norm is calculated (usually the L2L^2 norm, but the L1L^1 or the LL^{\infty} also could be used). The L2L^2 norm is given by:

x2=i=1nxi2\lVert x \rVert_{2} = \sqrt{ \sum_{i=1}^nx_i^2}

Then, each value in the sample is divided by the norm.

Row normalization could be applied when dealing with sparse (mostly containing zeroes or missing values) data.

Which one should be chosen?

It depends on the dataset and the algorithm in question. It might be a good idea to consider multiple scalers on a case-to-case basis. PCA, for example, greatly benefits from standardization, because in PCA we are aiming at finding the directions that maximize the variance, and if certain feature variances are higher, it will skew the components in that direction, and thus it's beneficial to have the same variance. In general, standardizing the features won't hurt.

Here is a comparative table of scalers with the main information about each of them:

Transformation

Range

Mean

Distribution

Feature normalization

Fixed

Varies

Preserved

Standardization

Varies

0

Preserved

Row normalization

Varies

0

Changed, unit norm

Conclusion

As a result, you are now familiar with the following:

  • Normalization (the Min-Max scaling) refers to transforming the data in a way that it lies in the specified range and should be applied when outliers don't play a significant role and the data roughly follows a normal or a uniform distribution;

  • Standardization transforms the features such that after the scaling they have a zero mean and a unit variance;

  • Feature scaling is beneficial to many machine learning algorithms and only a few of them are unaffected by the scales.

11 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo