You already know how to compute the accuracy of a classification model, which tells you how often this model predicts the correct class. To improve it, we need to know exactly what types of mistakes it is making. To that end, a confusion matrix can be helpful.
Both accuracy and confusion matrices can be used to evaluate any classification model, binary or multiclass. However, since binary classification problems are more widespread, there are several performance metrics designed specifically for classification problems with two classes.
In this topic, you will get familiar with three such metrics, namely precision, recall, and F-score.
Precision
Let us remember that in a binary classification task, we have two classes: the positive and the negative ones.
Precision is the share of the true positive predictions among all the examples that were predicted to be positive by our model. That is, if our model was predicting if a patient has a certain medical condition, its precision would tell us the share of patients who actually have a condition out of all the patients that we predict to have it.
Precision is easy to compute based on the model's confusion matrix. Indeed, the positive examples that the model predicted positive are True Positives (TP). Apart from them, there are also False Positives (FP) that the model labeled positive incorrectly. So,
Recall
Recall, in turn, measures the share of all the positive examples in the dataset that the model actually labeled as positives.
Once again, let's come up with a way to compute this from the confusion matrix. How do we get the number of all the positive examples in the dataset? Well, it's those that were correctly predicted as positives (True Positives, TP) plus those that our model misclassified as negatives (False Negatives, FN). So,
Recall shows how good the model is in identifying the positive class. In a medical context, you can reformulate the definition of recall as follows: out of all the people who have a certain disease, what is the share of those who were actually diagnosed with it?
Precision and recall: an example
Let us remember a dog vs. fox classification task where the goal is to determine which of the two animals is present in a picture. Let dogs be a positive class and foxes be a negative one. The confusion matrix looks as follows:
First, let us compute model's precision. It will be the share of actual dogs among all the animals that our model labeled as such:
Recall, in turn, will show how good we are in identifying dogs. What is the share of all dogs that our model actually managed to identify correctly?
F-score
Precision and recall are useful and insightful metrics, but it's not obvious how to optimize them both at the same time. F-score, or F-measure combines them into a single metric as follows:
F-score is a number between 0 and 1, with higher values corresponding to models with higher precision and recall scores. Note that The F-score reaches its maximum when the precision and recall results are both equal to one, and turns zero if at least one of the two is zero.
Let's compute the F-score in our dogs vs. foxes example using the definition above. It turns out that it has a value of 0.57:
Conclusion
Precision and recall are performance metrics for binary classification models.
Precision is the share of the true positive predictions among all the examples that were predicted to be positive by our model.
Recall measures the share of all the positive examples in the dataset that the model labeled as positives.
F-score, or F-measure combines precision and recall via the harmonic mean in a single score that can be used to optimize both metrics at the same time.