6 minutes read

In this topic, you'll get familiar with one of the most popular and interpretable metrics in regression evaluation — mean absolute error (MAE), look at its pros and cons, and perform a sample MAE calculation on a synthetic dataset.

MAE

Let's say Y={y1,y2,,yn}Y = \{y_1, y_2, \dots, y_n\} is the set of true target values, and Y^={y^1,y^2,,y^n}\hat{Y} = \{\hat{y}_1, \hat{y}_2, \dots, \hat{y}_n\} is the set of predicted target values. Then MAE, the average of the absolute errors, can be introduced as follows:

MAE(Y,Y^)=1ni=1nyiy^i=y1y^1+y2y^2++yny^nn\text{MAE}(Y, \hat{Y}) = \frac{1}{n}\sum_{i = 1}^n|y_i - \hat{y}_i| = \frac{|y_1 - \hat{y}_1| + |y_2 - \hat{y}_2| + \dots + |y_n - \hat{y}_n|}{n}

Because the absolute error is taken, MAE is always non-negative. MAE preserves the units of the data under analysis (the score uses the same scale as the underlying data), and this property is known as scale dependency. Scale dependency helps to interpret the scores, but the scores on different datasets or models can't be compared. Since MAE doesn't take into account the sign of the errors (we take only the absolute value of these), it's not possible to determine if our model underestimates or overestimates the data. For the same reasons, it is not possible to determine the skewness (whether the model is prone to either underestimate or overestimate the data) either.

This metric is not very sensitive to outliers, which is a great thing if we don't want to make large errors more prominent. The errors are measured linearly and both the small and the large errors proportionally contribute towards the final score.

A minor elaboration on the outlier insensitivity

Let's illustrate the outlier insensitivity point by considering a small example. We have a set of ground truth predictions, yy, and two sets of predictions made by 2 models, a1a_1 and a2a_2:

yy

a1a_1

a2a_2

1

2

4

2

1

5

3

2

6

4

5

7

100

6

13

6

7

10

Let's calculate the mean squared error for a1a_1 and a2a_2:

MSE(a1)=1473.5\text{MSE}(a_1) = 1473.5MSE(a2)1270.17\text{MSE}(a_2) \approx 1270.17

MSE will report better performance for the second model. The fifth sample in the ground truths is an outlier, and MSE picks the model that predicts the outlier closer, suppressing the general performance on other samples. We can observe that a1a_1 predicts the general samples closer than a2a_2, and a2a_2 is more off for the general samples, but models the outlier better than a1a_1.
Calculating MAE for a1a_1 and a2a_2:

MAE(a1)=16.5\text{MAE}(a_1) = 16.5

MAE(a2)17.17\text{MAE}(a_2) \approx 17.17

In this case, the first model shows better results, since MAE doesn't distinguish whether the prediction got worse on the outlier or on the normal samples.

MAE depends on the values predicted by the model and on the dataset. The dataset is a constant, and the predictions depend on the parameters of the model. Suppose our model is described as fω=ωxf_\omega = \omega \cdot x, where ω\omega is the only parameter. That means we can make MAE a function of the model parameters:

MAE(ω)=1ni=1nyifω(xi)\text{MAE}(\omega) = \frac{1}{n}\sum_{i = 1}^n |y_i - f_{\omega}(x_i)|

During the derivative computation, the derivative can be either 1-1 or 11, and MAE can't be differentiated at ypred=ytruey_{\text{pred}} = y_{\text{true}}, which might pose a challenge in a case where the derivatives are required for training. It's possible to compute the derivatives, but it's just more complicated when compared with other metrics (for example, approximations might be used, but we're not going to consider it in this topic).

Lower MAE values indicate a better model performance. Which loss can be considered too big, you might ask, and the answer boils down to a particular problem scenario. For example, let's say we are predicting human weight from some dataset and use MAE as a loss function. We can tell that the loss of 0.5 (kg, since the units are preserved) would be much better than the loss of 10. So, the acceptable inaccuracies are determined on a case-to-case basis. As a baseline, you can calculate the MAE for a case when all the predictions are equal to the median of the target values, and then make adjustments based on that and the scenario in question.

Calculating the MAE

Let's consider the following dataset with the model being f(x)=0.5xf(x) = 0.5 x:

xx

yy

y^=f(x)=0.5x\hat{y} = f(x) = 0.5 x

0

0.06

0.0

1

0.21

0.5

2

0.80

1.0

3

0.16

1.5

4

0.24

2.0

5

0.18

2.5

MAE=16(0.060+0.210.5+0.81+0.161.5+0.242+0.182.5)=0.995\text{MAE} = \frac{1}{6}\left( |0.06 - 0| + |0.21 - 0.5| + |0.8 - 1| + |0.16 - 1.5| + |0.24 - 2| + |0.18 - 2.5| \right) = 0.995

Let's review MAE in a more general setting and plot it as a function that depends on the model parameters, ω\omega (we still consider our model to be fω=ωxf_\omega = \omega \cdot x):

MAE(ω)=16[0.06+0.21ω+0.802ω+0.163ω+0.244ω+0.185ω]\text{MAE}(\omega) = \frac{1}{6}\left[0.06 + |0.21 - \omega| + |0.80 - 2\omega| + |0.16 - 3\omega| + |0.24 - 4\omega| + |0.18 - 5\omega|\right]

Our plot will look like this:

A graph that depicts the MAE behaviour as defined above

From the plot above, we can tell that MAE is a piecewise-linear function (a function composed of multiple linear segments) regarding the parameter ω\omega with a unique global minimum, and we can see that the errors contribute proportionally towards the score.

Conclusion

Here's what you need to know about MAE:

  • MAE is insensitive when it comes to dealing with outliers in the data.

  • Lower MAE scores indicate a better model performance, but a score of 0 is often unattainable.

  • MAE is differentiable, but the process of differentiation is not very straightforward when compared to other metrics.

  • MAE preserves the units of the underlying data and can be easily interpreted, but it can't be used to compare different datasets or models.

  • MAE's growth is linear, with both large and small errors having a proportional weight in the final score.

4 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo