Imagine that you've trained a couple of classification models to solve a particular task. How do you determine which model is the best one? It turns out that the question of the model's quality is not as trivial as it might seem at first glance. In this topic, we will start looking into different aspects of evaluating the performance of classification models.
For example, suppose that your task is to train a model that determines whether a picture contains a dog or a fox on it. You have built two models, and here are their predictions on a test set:
Can you immediately decide which one is better in distinguishing dogs from foxes?
Accuracy
Arguably the easiest way to evaluate the performance of a classification model is to count the ratio of correct answers to the total number of predictions it gives on a given dataset. Such a metric is called model's accuracy:
So, accuracy, or accuracy score, is a number between and . The higher it is, the better a model performs on a given dataset.
Let's compute the accuracies of our dog vs. fox models from above. It's easy to see that both of them make three mistakes when making predictions for six cases, meaning that the accuracies of both models are , which, in fact, means that they are no better than a random guess.
Accuracy isn't a perfect metric
Let's consider a different example. Imagine that you are building a spam filtering model. You've tested your model on 1,000 emails and got a 0.99 accuracy score. It looks like your model classified 99% of all the emails correctly. Brilliant, isn't it? Not so fast.
Remember that there are generally not so many spam emails out there. Suppose that out of 1,000 emails in your dataset, only 10 are marked as spam. Then a very simple model that marks any email as non-spam will have an accuracy score of 0.99 on this dataset!
Such problems where one class is way less represented than the other(s) are called unbalanced problems, and the spam filtering example shows that accuracy isn't a very useful metric there. But even when the problems are well-balanced, looking at accuracy alone might not be enough to evaluate the model's performance.
Imagine that the accuracies of two models are the same. Does that mean that the models are equally good (or bad)? Not necessarily. In our dog vs. fox example, the two models indeed made the same number of mistakes, but you might have noticed that the mistakes they've made are different. The accuracy score simply doesn't reflect that, but this aspect might be crucial in some applications.
For instance, suppose that your model should predict if a newly arrived hospital patient should be placed in the intensive care unit (ICU). The accuracy of such a model reflects how often it makes the right decision, that is (1) patients in critical condition are sent to the ICU and (2) patients whose condition is not life-threatening are not sent to the ICU. But the two types of mistakes are not the same: sending someone to the ICU when they don't really need it is not a big deal, provided that there are enough ICUs, but mistakenly not sending a patient there is much worse because it potentially results in the patient's death. So, in this case, if you had two models with the same accuracies, you would probably want to choose the one that makes fewer mistakes of the second kind.
An easy way to examine the types of mistakes a classification model makes is by constructing a so-called confusion matrix.
Confusion matrix
Unlike a single accuracy score, a model's confusion matrix highlights where exactly our model makes the mistakes, giving us a better insight into the model's behavior.
Let's start with a binary classification case, where we have two classes that can be called positive and negative. When we test our model on some datasets, all the examples fall into one of the following four categories:
True Positives (TP) — the real positive examples to which the model correctly assigned positive class;
True Negatives (TN) — the real negative examples to which the model correctly assigned negative class;
False Positives (FP) — the real negative examples to which the model incorrectly assigned positive class;
False Negatives (FN) — the real positive examples to which the model incorrectly assigned negative class.
We can count the number of examples that fall into each of these classes and compactly represent it in a matrix. Let's once again go back to our dog vs. fox prediction task and construct the confusion matrix for the second model. We can call dogs a positive class and foxes a negative class. Looking at the model's predictions, we can notice that
two dogs were correctly identified as dogs by our model, those are true positives;
only one fox was correctly identified as a fox, so there is only one true negative example;
one dog was misclassified as a fox, which means one false negative example;
finally, two foxes were misclassified as dogs, making the corresponding examples false positives.
Putting those numbers in a matrix, you get the one like in the picture above.
It's easy to notice that from a model's confusion matrix, it's easy to compute its accuracy. Indeed, it's the ratio of all the examples classified correctly to the total number of examples. In other words,
Note that accuracy isn't the only metric that we can calculate using the confusion matrix. You will learn about the others in the following topics.
Confusion matrix for multiclass classification
A confusion matrix is a powerful tool to analyze the performance of any classification model, not just a binary classification one.
Let's imagine that we have a model that recognizes handwritten letters A, B and C. This is therefore a classification problem with three classes. By examining model's predictions on some dataset, we can construct its confusion matrix:
It seems that from the 44 examples, only 15 were classified correctly. Not good at all... How do we improve? Unlike a single accuracy score, confusion matrix tells us where the problem is!
It is apparent that our model often confuses letter A with C (9 examples) and letter B with A (8 examples). We can then look back at the corresponding examples and try to understand what made them difficult for our model. Hopefully, this will help us to come up with some improvement.
Conclusion
Accuracy is the share of examples correctly classified by a model.
When working with unbalanced problems, a high accuracy score doesn't necessarily mean a good model: one should always keep in mind how frequent the majority class is.
The confusion matrix allows one to examine what kinds of mistakes does a model make.