In today's topic, we will dive into two popular metrics for the evaluation of regression model performance — mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (sMAPE), compare their advantages to their downsides, and perform calculations for both metrics on a synthetic dataset.
MAPE
Let's say that is the set of ground truth values, and is the set of predictions. Then, the mean absolute percentage error will look like this:
Lower MAPE scores indicate better performance.
MAPE has a significant advantage — it is a scale-independent metric (as the name implies, the error is expressed in percentages), which means that the scale of the data doesn't affect the final score, making it suitable for comparisons between different datasets or models. Where MAPE diverges from the expectation is in the fact that it can exceed 100%, which we'll discuss below. Also, MAPE is interpretable and can be easily explained.
However, MAPE is known to suffer from various drawbacks:
- The calculation will run into the division by zero or produce very large scores when the targets are less than 1 or equal to zero. So, MAPE's possible range isn't from 0 to 100, but 0 to infinity instead, although the scores can be capped at a threshold. For example, we can say that , with a score indicating a very inaccurate prediction, and we simply don't care for any score that might exceed . The lack of the upper bound might hinder MAPE's usage when the values we are dealing with are close to 0, since we won't be able to tell the quality of the prediction.
- MAPE treats undershooting and overshooting of the ground truths differently. When MAPE is minimized, it might lead to a preference for a model that undershoots the ground truths (more on that here).
- MAPE does not have a continuous derivative, which might complicate the gradient computations at the training stage.
To demonstrate the second point, suppose the prediction is fixed at , and the real target in the first case is (with the absolute error of 50). Then, Alternatively, let's now say that the real target is and the prediction is still fixed at (the absolute error here is also 50):
The trick here lies in the MAPE minimization, which will make the predictions biased low (if the real targets have an equal chance of being either 200 or 100, the expected MAPE will be minimized by the lower prediction value), but we will not provide the full derivation for brevity.
A note on MAPE derivative
If we compute the first derivatives with respect to keeping the definitions, the result will be the following:
This can be tied back to MAPE, preferring the predictions that are smaller than the actual values: if the prediction underestimates the actual value (), then increasing by one unit will reduce the MAPE by , and, on the other hand, if is reduced by one unit, MAPE will be increased by .
sMAPE
Now, let's take a look at a metric that addresses some of the issues with MAPE — the symmetric mean absolute percentage error. Keeping the and definitions from above, sMAPE was originally introduced as follows:
However, a widespread sMAPE modification is used in practice, with absolute values being taken in so as to avoid the cancellation and the negatives. Also, the multiplication by 2 is dropped (so that the sMAPE lies in the 0%-100% interval instead of 0%-200%):
sMAPE fixes the issue of the infinite upper bound present in MAPE, but sMAPE will still treat the higher and the lower predictions differently.
Suppose the real target is and the prediction is . Then, Alternatively, let's now say that the real target is and the prediction is .
In this case, higher predictions actually result in a lower error score.
One of the sMAPE limitations is that if either the prediction or the target is equal to zero, the score will automatically maximize. In turn, it means that the model's quality can't be assessed when dealing with predictions and targets close or equal to zero — we simply won't see the difference between good and bad performance when everything is maxed out, regardless of the actual prediction quality. sMAPE suffers from similar drawbacks to those of MAPE, except for the upper bound issue.
Calculation example
Let's calculate the MAPE and the sMAPE for the following synthetic dataset, with the model being described as :
| 0 | 0.06 | 0.0 |
| 1 | 0.21 | 0.5 |
| 2 | 0.80 | 1.0 |
| 3 | 0.16 | 1.5 |
| 4 | 0.24 | 2.0 |
| 5 | 0.18 | 2.5 |
The more general setting of the could be written as , where represents the model parameters, and a single prediction is :
Let's plot MAPE and sMAPE with respect to :
In the graph above, we can see that both functions are piecewise linear and MAPE lacks the upper bound, while sMAPE has a significant jump around 0.
Conclusion
- Both MAPE and sMAPE are scale-independent metrics, which means that they can be used to compare different datasets and different models.
- MAPE might reach arbitrary big values, while sMAPE will have an upper bound (either 200 or 100, depending on the implementation).
- Both metrics are known to assign unequal weights to overshooting and undershooting.
- MAPE and sMAPE both don't have continuous derivatives and have issues with values being close to 0.
- Neither MAPE nor sMAPE would make much sense when 0 is an arbitrary value, for example, with temperatures on the Fahrenheit or Celsius scales.