5 minutes read

In this topic, we will look at two widely adopted metrics that measure the performance of the regression models — Mean Squared Error and the Root Mean Squared Error, consider their main properties and walk through the process of calculating them on a synthetic dataset.

MSE

We will start with the Mean Squared Error (MSE), which is nothing but the average of the squared errors. Suppose Y={y1,y2,,yn}Y = \{y_1, y_2, \dots, y_n\} is the set of true target values, and Y^={y^1,y^2,,y^n}\hat{Y} = \{\hat{y}_1, \hat{y}_2, \dots, \hat{y}_n\} is the set of predicted target values. Then, we introduce the MSE as follows:

MSE(Y,Y^)=1ni=1n(yiy^i)2=(y1y^1)2+(y2y^2)2++(yny^n)2n\text{MSE}(Y, \hat{Y}) = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2 = \frac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_n - \hat{y}_n)^2}{n}

Because we take the squared errors, the values will always be non-negative. MSE will square the units, for example, it will operate in squared US dollars instead of regular US dollars. That complicates the interpretation of MSE and doesn't allow us to compare datasets with different units.

MSE will penalize the models that predict with considerable errors – the outliers are squared, and it results in large errors having more weight in the final score, and the smaller errors tend to diminish, making the metric sensitive to outliers. For example, if the difference between the ground truth (yy) and the predicted label (y^\hat{y}) is 10, yy^=10(yy^)2=100|y - \hat{y}| = 10 \rightarrow (y - \hat{y})^2 = 100, so the penalty is 10 times bigger.

But, if the difference between the ground truth and the predicted value is 0.1,

yy^=0.1(yy^)2=0.01,|y - \hat{y}| = 0.1 \rightarrow (y - \hat{y})^2 = 0.01,

the penalty becomes 10 times less than the error and vanishes.

MSE is easily differentiable (has a continuous derivative), and computing the derivatives is important during the training of ML models.

A note on the MSE derivatives

In general, the full MSE derivative with respect to the prediction vector, y^\hat{y}, takes the following form:

MSE=2n(y^y)\text{MSE}' = \frac{2}{n}\Big(\hat{y} - y\Big)

The partial derivative of MSE with respect to a single prediction, y^i\hat{y}_i, is computed as

MSEy^i=(yiy^i)2y^i=2(yiy^i)=2(y^iyi)\frac{\partial \text{MSE}}{\partial \hat{y}_i} = \frac{\partial (y_i - \hat{y}_i)^2}{\partial \hat{y}_i} = -2(y_i - \hat{y}_i)=2(\hat{y}_i - y_i)

Note the emphasis, since we are talking about a single prediction, the partial derivative won't be divided by nn.

RMSE and nRMSE

To avoid the huge values on the MSE metric, we can take the squared root of the MSE. Taking the square root from MSE comes with the advantage of putting the errors on the same scale as the targets, which increases the interpretability of the metric.

This resulting metric is called root mean squared error (RMSE):

RMSE(Y,Y^)=MSE(Y,Y^)=(y1y^1)2+(y2y^2)2++(yny^n)2n\text{RMSE}(Y, \hat{Y}) = \sqrt{\text{MSE}(Y, \hat{Y})} = \sqrt{\frac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_n - \hat{y}_n)^2}{n}}

It has the same behavior as MSE, in the sense that they both are minimum at the same point, due to x\sqrt{x} increasing (the square root increases monotonically and preserves the sign, so it doesn't affect the minimum of the square sums). RMSE, just like MSE, is also sensitive to outliers in the data.

RMSE is scale-dependent, since we take the root from the previously squared units, so that the results will have the same units as the data being analyzed. To facilitate the comparison between datasets or models with different scales, the RMSE score is often normalized, for example, by the average value of the target. The resulting score is referred to as the normalized RMSE (nRMSE),

nRMSE=RMSEyˉ,\text{nRMSE} = \frac{\text{RMSE}}{\bar{y}},

where

yˉ=1ni=1nyi\bar{y} = \frac{1}{n}\sum_{i = 1}^n y_i

The calculation example

Let's consider the following dataset with the model being f(x)=0.5xf(x) = 0.5 x:

xx

yy

y^=f(x)=0.5x\hat{y} = f(x) = 0.5 x

0

0.06

0.0

1

0.21

0.5

2

0.80

1.0

3

0.16

1.5

4

0.24

2.0

5

0.18

2.5

MSE=16[(0.060)2+(0.210.5)2+(0.81)2+(0.161.5)2+(0.242)2+(0.182.5)2]MSE1.733883\text{MSE} = \frac{1}{6}\left[ (0.06 - 0)^2 + (0.21 - 0.5)^2 + (0.8 - 1)^2 + (0.16 - 1.5)^2 + (0.24 - 2)^2 + (0.18 - 2.5)^2 \right] \newline \text{MSE} \approx 1.733883
RMSE=MSE=1.7338831.3167\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{1.733883} \approx 1.3167

And, for nRMSE, we first calculate the average value of the target, and then continue with the substitution:

yˉ=16(0.06+0.21+0.8+0.16+0.24+0.18)=0.275 nRMSE=1.31670.275=4.788\bar{y} = \frac{1}{6}\left(0.06 + 0.21 + 0.8 + 0.16 + 0.24 + 0.18 \right) = 0.275 \rightarrow \newline\text{ nRMSE} = \frac{1.3167}{0.275} = 4.788

A graph that demonstrates the behaviour of MSE and RMSE

If we plot our metrics in the general way, we can see that MSE has a quadratic growth rate and grows much faster than RMSE.

Conclusion

  • The less the MSE, RMSE, and nRMSE scores are, the higher the quality of the model is. However, a loss of 0 is not realistic.

  • The discussed metrics are all easily differentiable, which is convenient during the training stage.

  • MSE and RMSE are both sensitive to outliers in the data.

  • MSE will square the units and RMSE will preserve the units.

  • We can normalize the RMSE by the average value of the target, and get a metric that allows us to compare models with different scales or in different datasets.

8 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo