In this topic, we will focus on binary classification in order to have a deeper understanding of the conceptual foundation of the Adaboost.
A reminder on the AdaBoost algorithm
Below, we provide a refresher on the general Adaboost procedure:
1. Initialize the number of AdaBoost rounds,
Initialize the weights, ,
2. For to :
(a) Fit a classifier to the training data using the weights ;
(b) Compute the error,
(c) Compute the amount of say, (d) Update the weights [Optional step: use buckets to draw a new dataset based on the sample weights]
(e) Normalize the weights
3. Return the final prediction,
The function in the step 2b refers to the 0/1 loss,
Setup, initializing the weights
To understand the mechanics, let's run a single iteration of AdaBoost on the following synthetic set with 10 samples about loan approval:
+----+-------+----------+------------+
| | Age | Income | Approval |
|----+-------+----------+------------|
| 0 | 20 | 10000 | 0 |
| 1 | 25 | 30000 | 1 |
| 2 | 30 | 60000 | 1 |
| 3 | 35 | 25000 | 1 |
| 4 | 40 | 15000 | 0 |
| 5 | 45 | 55000 | 0 |
| 6 | 50 | 22000 | 0 |
| 7 | 55 | 11000 | 0 |
| 8 | 60 | 27000 | 0 |
| 9 | 65 | 65000 | 1 |
+----+-------+----------+------------+
We will use decision tree stumps as the weak learners in this case. The first step is to assign the initial weights to each training sample (since the number of samples, , is equal to 10, ):
+----+-------+----------+------------+------------------+
| | Age | Income | Approval | Initial weight |
|----+-------+----------+------------+------------------|
| 0 | 20 | 10000 | 0 | 0.1 |
| 1 | 25 | 30000 | 1 | 0.1 |
| 2 | 30 | 60000 | 1 | 0.1 |
| 3 | 35 | 25000 | 1 | 0.1 |
| 4 | 40 | 15000 | 0 | 0.1 |
| 5 | 45 | 55000 | 0 | 0.1 |
| 6 | 50 | 22000 | 0 | 0.1 |
| 7 | 55 | 11000 | 0 | 0.1 |
| 8 | 60 | 27000 | 0 | 0.1 |
| 9 | 65 | 65000 | 1 | 0.1 |
+----+-------+----------+------------+------------------+Choosing the first stump
Let's calculate the Gini index for the 2 possible splits. The Gini impurity is calculated as where is the total number of classes (2 in binary classification setting), and is the probability of picking a sample with the class. The split with the lower Gini impurity will be the first weak learner (AdaBoost uses the decision tree stump with the lowest impurity first because it will be more correct in classifying the majority of the data points).
Income has a lower impurity so it will be used as the first weak learner.
The first weight update
Next, let's calculate the amount of say based on income. The total error here is just the sum of the weights of the incorrectly classified samples. The income split produced 2 errors, thus the AOS will be
After calculating the AOS, we can update the weights for the samples based on whether they were correctly classified or not. New sample weight for the incorrectly classified sample: The new sample weight for the correctly classified samples:
The weights should be normalized so they add up to 1, the new sample weight for the incorrectly classified sample will be , and the new weight for the correctly classified sample is . Normalizing the weights is required to make the algorithm more stable and less affected by the outliers (it won't eliminate the outlier effects considerably), and to avoid being too biased towards the misclassifications (i.e., to reduce the risk of overfitting).
Thus, the first weight update looks like this:
+----+-------+----------+------------+------------------+-----------------+
| | Age | Income | Approval | Initial weight | Weight update |
|----+-------+----------+------------+------------------+-----------------|
| 0 | 20 | 10000 | 0 | 0.1 | 0.0625 |
| 1 | 25 | 30000 | 1 | 0.1 | 0.0625 |
| 2 | 30 | 60000 | 1 | 0.1 | 0.0625 |
| 3 | 35 | 25000 | 1 | 0.1 | 0.0625 |
| 4 | 40 | 15000 | 0 | 0.1 | 0.0625 |
| 5 | 45 | 55000 | 0 | 0.1 | 0.25 |
| 6 | 50 | 22000 | 0 | 0.1 | 0.0625 |
| 7 | 55 | 11000 | 0 | 0.1 | 0.0625 |
| 8 | 60 | 27000 | 0 | 0.1 | 0.25 |
| 9 | 65 | 65000 | 1 | 0.1 | 0.0625 |
+----+-------+----------+------------+------------------+-----------------+Selecting the next training set
Now we can select the next training set, where misclassified samples will be more prominent (this is an optional step in the original algorithm description, which is a more sophisticated method for fitting the next stump). We generate random floats in the [0, 1] range, and select the new set of samples by checking whether the float falls into the bucket (the weight interval):
+----+-------+----------+------------+-----------------+------------------+
| | Age | Income | Approval | Weight update | Ranges |
|----+-------+----------+------------+-----------------+------------------|
| 0 | 20 | 10000 | 0 | 0.0625 | [0, 0.0625] |
| 1 | 25 | 30000 | 1 | 0.0625 | [0.0625, 0.125] |
| 2 | 30 | 60000 | 1 | 0.0625 | [0.125, 0.1875] |
| 3 | 35 | 25000 | 1 | 0.0625 | [0.1875, 0.25] |
| 4 | 40 | 15000 | 0 | 0.0625 | [0.25, 0.3125] |
| 5 | 45 | 55000 | 0 | 0.25 | [0.3125, 0.5625] |
| 6 | 50 | 22000 | 0 | 0.0625 | [0.5625, 0.625] |
| 7 | 55 | 11000 | 0 | 0.0625 | [0.625, 0.6875] |
| 8 | 60 | 27000 | 0 | 0.25 | [0.6875, 0.9375] |
| 9 | 65 | 65000 | 1 | 0.0625 | [0.9375, 1] |
+----+-------+----------+------------+-----------------+------------------+
Suppose the random set is [0.82, 0.07, 0.38, 0.91, 0.08, 0.52, 0.1, 0.42, 0.63, 0.16]. First, check where 0.82 lies – we can see it belongs to the interval for the 8th sample. The 8th sample is drawn to the new set. Next, 0.07 lies in the 1st sample interval, so the 1st sample is drawn to the set, etc. The misclassified samples will be more prominent in the new training set since their intervals are larger, thus, they will have a higher probability of appearing in the new set. The new training set (after the weight update and normalization) for the next iteration becomes:
+------------+-------+----------+------------+-------------------------------+
| Sample # | Age | Income | Approval | Updated weight (normalized) |
|------------+-------+----------+------------+-------------------------------|
| 8 | 60 | 27000 | 0 | 0.16 |
| 1 | 25 | 30000 | 1 | 0.04 |
| 5 | 45 | 55000 | 0 | 0.16 |
| 8 | 60 | 27000 | 0 | 0.16 |
| 1 | 25 | 30000 | 1 | 0.04 |
| 5 | 45 | 55000 | 0 | 0.16 |
| 1 | 25 | 30000 | 1 | 0.04 |
| 5 | 45 | 55000 | 0 | 0.16 |
| 7 | 55 | 11000 | 0 | 0.04 |
| 2 | 30 | 60000 | 1 | 0.04 |
+------------+-------+----------+------------+-------------------------------+The final prediction
The iteration continues until a specific stopping condition is met, for example, when the error rate for a specific weak learner exceeds a threshold, or all M boosting rounds have been completed. Then, we have the error rates for each weak learner (denoted as , are the predictions in the binary classification setting) in the ensemble, and we calculated the amount of say for the learner in the final prediction (denoted as below). The final output will be
where M is the number of AdaBoost rounds. More accurate classifiers will have a higher weight.
Conclusion
As a result, now you are familiar with the inner workings of the Adaboost algorithm in the binary classification setting.