In previous topics, you've learned a couple of metrics to evaluate the performance of your classifier. This topic introduces you to a visual one. The previous metrics, rigorously speaking, referred to a situation where the model predicts a result in zeros or ones (integers, if there are multiple classes). However, many models don't output the class itself but the probabilities of classes, which are real numbers ranging from zero to one. In this case, a data scientist sets a threshold above which all results are considered ones, and below which they are zeroed. However, in real practice, defining a threshold is often not easy. There is a way to evaluate the model without determining a specific threshold: by drawing the Receiver Operating Characteristic Curve (ROC curve) and estimating the Area Under the Curve (AUC).
The ROC curve
The ROC (Receiver Operating Characteristic) curve shows the classifier's performance at various decision threshold settings. There are plenty of possibilities to evaluate classifier performance: accuracy, precision, recall, F-score, and others. The ROC curve is a 2D graph, so we need to pick just two metrics — one for the x-axis and the other for the y-axis. For the ROC curve, we take the so-called True Positive Rate and the False Positive Rate.
The True Positive Rate, also known as Sensitivity, shows what percentage of class 1 objects we classified correctly. The False Positive Rate, also known as 1 - Specificity, shows the proportion of class 0 objects that were mistakenly classified as class 1. If you look closely, you will notice that Sensitivity (TPR) and recall are the same thing, but they are usually referred to by different terms depending on the task. The formulas are defined as follows:
We place the TPR on the y-axis and the FPR on the x-axis.
The ROC curve shows the performance for all decision threshold values. What does this mean? Almost all classification algorithms output the probability of a data point being of class 1. So, let's say our decision threshold is at 0.5. This means that all points with probabilities below 0.5 will be classified as the negative class, and class 1 will be assigned to all other data points. For the ROC curve, we should take different values of the threshold and see what happens with the True and False Positive Rates.
Our ideal curve describes the case in which the TPR is maximum and the FPR is minimum, which means that the curve should tend towards the point (0, 1). Suppose we have a classifier that randomly assigns labels to data points. Then the ROC curve would look something like this, where the TPR is approximately equal to the FPR. In the graph below, you can see how to interpret the ROC curve:
There is something interesting about the purple curve, which falls under the random classifier. Falling under the random curve generally indicates that a) the model is very off, or b) the labels are flipped (e.g., 0 is predicted as 1, and 1 is predicted as 0). Flipping the labels without other considerations in the second case is not recommended (it might not be the case and could lead to unexpected results down the line). Thus, the model and the dataset should be closely inspected to determine the cause of the curve falling under the random classifier.
How to build the ROC curve
Before we start building the ROC curve, we need to define a list, K, with possible decision thresholds, for which we'll calculate the TPR & FPR. Then, we need to get the list, P, with probabilities for all data points in our dataset that our classification model outputted. Finally, we create an empty list, C, that will later be filled with positional coordinates (x, y).
Then, for all values from list K, repeat:
- Take the next value, k, from list K (containing all defined thresholds).
- Assign the class 0 to all values from P that are smaller than k, and all the other values will be of class 1.
- Calculate TPR & FPR.
- The FPR value is the coordinate and the TPR value is the coordinate. Add FPR & TPR to list C.
Note that typically, the list K (all possible decision thresholds) is set to the list P (the predicted probabilities, sorted), but for illustration in the next section, we will define the list K manually. Defining K manually lets us test custom threshold values. For example, if the concentration of a specific substance is considered dangerous at , various decision thresholds will produce different TP and FP rates. In turn, you might think: how bad are the false positives in our case? If the substance is (falsely) marked as dangerous and gets discarded by mistake, we have to answer how undesirable it is to discard potentially non-dangerous substances (e.g., if we are dealing with medication, exposure might be lethal, but if we are talking about the water supply, it's less crucial). There are many ways to select the thresholds, and we just described a single scenario.
Example
Suppose we have a dataset that identifies whether a snake species is venomous or not. We've built a classification model and obtained the following list with predictions, :
| Name | Venomous | |
| Water Snake | 0 | 0.21 |
| Milk Snake | 0 | 0.32 |
| Garter Snake | 0 | 0.43 |
| Python | 0 | 0.44 |
| Tree Boa | 0 | 0.65 |
| King cobra | 1 | 0.46 |
| Black mamba | 1 | 0.67 |
| Coral Snake | 1 | 0.88 |
| Boomslang | 1 | 0.99 |
| Sea Krait | 1 | 0.91 |
Now let's create our K — the list with thresholds. Let K be the list of floats in the range from 0 to 1 with the step of 0.1, that is, . Now we create an empty list .
The first value in is 0, so let k be 0. All probabilities are greater than 0, so we assign class 1 to all data points. In this case, TPR & FPR are equal to:
This is the first point of our graph, let's add it to . Similarly, we calculate the other points:
For clarity, let us calculate the TPR and FPR for and in more detail.
For , the first sample (Water Snake) will be assigned as not venomous (class 0) since the probability for Water Snake is 0.21 (less than the 0.3 threshold we are checking against), and the true negative (TN) in this case will be 1. All other samples for will be assigned as venomous (class 1). Thus, there will be 4 false positives (FP), samples from 2 to 5. Also, there are 5 true positives (TP). There are no false negatives (FN) for . So, the true positive rate in this case will be . The false positive rate will be .
For , everything below the threshold will be assigned as not venomous (class 0). The King Cobra is venomous (class 1) but will be assigned to class 0, so there is 1 false negative (FN). Similarly, the Tree Boa (, not venomous) will be assigned class 1, so there is one false positive (FP). There will be 4 true negatives (TN) and 4 true positives (TP). Thus, and .
The process of calculating the TP, FP, FN, and TN for the different values of the threshold can be illustrated as follows:
Now, if we delete all duplicates, our list looks like this:
Let's plot it!
In case you like to experiment, below you will find the code for plotting the ROC curve using scikit-learn and matplotlib (note that the scikit-learn plot will be different from the plot we have built since we selected the thresholds manually, which differs from the scikit-learn implementation):
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay
y_true = 5 * [0] + 5 * [1]
y_predicted = [0.21, 0.32, 0.43, 0.44, 0.65, 0.46, 0.67, 0.88, 0.99, 0.91]
params = {'color': 'm', 'marker': 'o'}
roc = RocCurveDisplay.from_predictions(
y_true=y_true,
y_pred=y_predicted,
drop_intermediate=False,
plot_chance_level=True,
**params
)
plt.axis('square')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC - snakes data (scikit-learn ROC curve)')
plt.legend(loc=4)
plt.grid()
AUC
AUC stands for Area Under the Curve, which means the AUC score measures the entire area underneath the ROC curve. If you are familiar with basic calculus, we can define the AUC score as follows:Let's calculate the AUC score for the ROC curve we have built with scikit-learn's roc_curve. In this simple case, we can think of the area under the curve as the area of two rectangles:
So, to calculate the AUC, we need to sum up the areas:
The ideal score is 1, and our result after classifying 10 samples is quite close to 1.
If the ROC curve becomes smoother, the trick with rectangles won't work. In such cases, the process of calculating the area under the ROC curve becomes much more complicated. But the good news is — as a data scientist, you can obtain the AUC score of each ROC curve in just one line of code. How exactly? You'll learn in the topics to come!
Conclusion
In this topic, we've covered:
- The definition of the ROC curve;
- How to build a ROC curve from scratch;
- The definition and meaning of the AUC score.