Computer scienceData scienceMachine learningRegressionRegression metrics

MAPE and sMAPE

10 minutes read

In today's topic, we will dive into two popular metrics for the evaluation of regression model performance — mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (sMAPE), compare their advantages to their downsides, and perform calculations for both metrics on a synthetic dataset.

MAPE

Let's say that $Y = \{y_1, y_2, \dots, y_n\}$ is the set of ground truth values, and $\hat{Y} = \{\hat{y}_1, \hat{y}_2, \dots, \hat{y}_n\}$ is the set of predictions. Then, the mean absolute percentage error will look like this:

$\text{MAPE} = \frac{100\%}{n} \sum^{n}_{i=1} \Big| \frac{y_i - \hat{y_i}}{y_i} \Big|$

Lower MAPE scores indicate better performance.

MAPE has a significant advantage — it is a scale-independent metric (as the name implies, the error is expressed in percentages), which means that the scale of the data doesn't affect the final score, making it suitable for comparisons between different datasets or models. Where MAPE diverges from the expectation is in the fact that it can exceed 100%, which we'll discuss below. Also, MAPE is interpretable and can be easily explained.

However, MAPE is known to suffer from various drawbacks:

The calculation will run into the division by zero or produce very large scores when the targets are less than 1 or equal to zero. So, MAPE's possible range isn't from 0 to 100, but 0 to infinity instead, although the scores can be capped at a threshold. For example, we can say that $\text{MAPE}_{\text{final}} = \min(100, \, \text{MAPE})$ , with a $100 \%$ score indicating a very inaccurate prediction, and we simply don't care for any score that might exceed $100 \%$ . The lack of the upper bound might hinder MAPE's usage when the values we are dealing with are close to 0, since we won't be able to tell the quality of the prediction.
MAPE treats undershooting and overshooting of the ground truths differently. When MAPE is minimized, it might lead to a preference for a model that undershoots the ground truths (more on that here).
MAPE does not have a continuous derivative, which might complicate the gradient computations at the training stage.

To demonstrate the second point, suppose the prediction is fixed at $\hat{Y} = \{150 \}$ , and the real target in the first case is $Y = \{100\}$ (with the absolute error of 50). Then, $\text{MAPE} = 100\% \Big| \frac{100-150}{100} \Big|= 100\%\frac{50}{100} = 50\%$ Alternatively, let's now say that the real target is $Y =\{ 200\}$ and the prediction is still fixed at $\hat{Y} = \{150 \}$ (the absolute error here is also 50):

$\text{MAPE} = 100\% \Big| \frac{200-150}{200} \Big| = 100\%\frac{50}{200} = 25 \%$

The trick here lies in the MAPE minimization, which will make the predictions biased low (if the real targets have an equal chance of being either 200 or 100, the expected MAPE will be minimized by the lower prediction value), but we will not provide the full derivation for brevity.

A note on MAPE derivative

If we compute the first derivatives with respect to $\hat{y_i}$ keeping the definitions, the result will be the following:

$\frac{\partial \text{MAPE}}{\partial \hat{y_i}} = \begin{cases} -\frac{1}{Ny_i}, & \text{if }\ \hat{y_i}<y_i \\ \text{undefined}, & \text{if }\ \hat{y_i}=y_i & \\ \frac{1}{Ny_i} , & \text{if }\ \hat{y_i}>y_i \end{cases}$

This can be tied back to MAPE, preferring the predictions that are smaller than the actual values: if the prediction underestimates the actual value ( $\hat{y_i} < y_i$ ), then increasing $\hat{y_i}$ by one unit will reduce the MAPE by $\frac{1}{Ny_i}$ , and, on the other hand, if $\hat{y_i}$ is reduced by one unit, MAPE will be increased by $\frac{1}{Ny_i}$ .

sMAPE

Now, let's take a look at a metric that addresses some of the issues with MAPE — the symmetric mean absolute percentage error. Keeping the $Y$ and $\hat{Y}$ definitions from above, sMAPE was originally introduced as follows:

$\text{sMAPE} = \frac{100\%}{n} \sum^{n}_{i=1} \frac{|\hat{y_i} - y_i|}{(y_i + \hat{y_i})/2}$ However, a widespread sMAPE modification is used in practice, with absolute values being taken in $|y_i| + |\hat{y_i}|$ so as to avoid the cancellation and the negatives. Also, the multiplication by 2 is dropped (so that the sMAPE lies in the 0%-100% interval instead of 0%-200%):

$\text{sMAPE} = \frac{100\%}{n} \sum^{n}_{i=1} \frac{|\hat{y_i} - y_i|}{(|y_i| + |\hat{y_i}|)}$

sMAPE fixes the issue of the infinite upper bound present in MAPE, but sMAPE will still treat the higher and the lower predictions differently.

Suppose the real target is $Y =\{ 150\}$ and the prediction is $\hat{Y} = \{100 \}$ . Then, $\text{sMAPE} = 100\% \Big| \frac{100-150}{150+100} \Big| = 100\%\frac{50}{250} = 20 \%$ Alternatively, let's now say that the real target is $Y =\{ 150\}$ and the prediction is $\hat{Y} = \{200 \}$ .

$\text{sMAPE} = 100\% \Big| \frac{200-150}{200+150} \Big|= 100\%\frac{50}{350} \approx 14 \%$

In this case, higher predictions actually result in a lower error score.

One of the sMAPE limitations is that if either the prediction or the target is equal to zero, the score will automatically maximize. In turn, it means that the model's quality can't be assessed when dealing with predictions and targets close or equal to zero — we simply won't see the difference between good and bad performance when everything is maxed out, regardless of the actual prediction quality. sMAPE suffers from similar drawbacks to those of MAPE, except for the upper bound issue.

Calculation example

Let's calculate the MAPE and the sMAPE for the following synthetic dataset, with the model being described as $f(x) = 0.5 \cdot x$ :

$x$	$y$	$\hat{y} = f(x) = 0.5 \cdot x$
0	0.06	0.0
1	0.21	0.5
2	0.80	1.0
3	0.16	1.5
4	0.24	2.0
5	0.18	2.5

$\text{MAPE} = \frac{100\%}{6}\Big( \Big| \frac{0.06}{0.06} \Big| + \Big| \frac{0.21-0.5}{0.21} \Big| + \Big| \frac{0.8-1}{0.8} \Big| + \Big| \frac{0.16-1.5}{0.16} \Big|+ \Big| \frac{0.24-2}{0.24} \Big| + \Big| \frac{0.18 - 2.5}{0.18} \Big| \Big) \approx 520\%$

$\text{sMAPE} = \frac{100\%}{6}\Big(\frac{|0 - 0.06|}{0.06} + \frac{|0.5 - 0.21|}{0.21 + 0.5} + \frac{|1 - 0.8|}{0.8 + 1} + \frac{|1.5 - 0.16 |}{0.16 + 1.5} + \frac{|2 - 0.24|}{0.24 + 2.0} + \frac{|2.5 - 0.18|}{0.18 + 2.5} \Big) \approx 66.3\%$

The more general setting of the $f(x) = 0.5 \cdot x$ could be written as $f_w = w \cdot x$ , where $w$ represents the model parameters, and a single prediction is $f_w(x_i)$ :

$\text{MAPE}(\omega) = \frac{100\%}{6} \sum^{6}_{i=1} \Big| \frac{ y_i - f_w(x_i)}{y_i } \Big|$

$\text{sMAPE}(\omega) = \frac{100\%}{6} \sum^{6}_{i=1} \frac{|f_w(x_i) - y_i|}{(|y_i| + |f_w(x_i)|)}$

Let's plot MAPE and sMAPE with respect to $w$ :

MAPE and sMAPE with respect to the w parameter

In the graph above, we can see that both functions are piecewise linear and MAPE lacks the upper bound, while sMAPE has a significant jump around 0.

Conclusion

Both MAPE and sMAPE are scale-independent metrics, which means that they can be used to compare different datasets and different models.
MAPE might reach arbitrary big values, while sMAPE will have an upper bound (either 200 or 100, depending on the implementation).
Both metrics are known to assign unequal weights to overshooting and undershooting.
MAPE and sMAPE both don't have continuous derivatives and have issues with values being close to 0.
Neither MAPE nor sMAPE would make much sense when 0 is an arbitrary value, for example, with temperatures on the Fahrenheit or Celsius scales.

14 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

MAPE and sMAPE

MAPE

sMAPE

Calculation example

Conclusion

Related topics