Computer scienceData scienceMachine learningEnsemble learning

Adaboost: an illustrated walk-through

3 minutes read

In this topic, we will focus on binary classification in order to have a deeper understanding of the conceptual foundation of the Adaboost.

A reminder on the AdaBoost algorithm

Below, we provide a refresher on the general Adaboost procedure:

1. Initialize the number of AdaBoost rounds, MM

Initialize the weights, wi=1/Nw_i = 1/N, i=1,2,,Ni = 1,2,…, N

2. For m=1m=1 to MM:

(a) Fit a classifier Gm(x)G_m(x) to the training data using the weights wiw_i ;

(b) Compute the error,

errm=i=1Nwi1(Gm(xi)yi)\text{err}_m = \sum \limits_{i=1}^N w_i \mathbb{1}(G_m(x_i) \neq y_i)(c) Compute the amount of say, αm=12ln(1errm)errm\alpha_m = \frac{1}{2}\ln \frac{(1-\text{err}_m)}{\text{err}_m}(d) Update the weights wm+1=wm×{eαr, if  Gm(xi)=yieαr, if  Gm(xi)yiw_{m+1} = w_m \times \begin{cases} e^{-\alpha_r}, \ \text{if }\ G_m(x_i) = y_i \\ e^{\alpha_r}, \ \text{if }\ G_m(x_i) \neq y_i \end{cases}[Optional step: use buckets to draw a new dataset based on the sample weights]

(e) Normalize the weights

3. Return the final prediction, Gm(x)=sign[m=1MαmGm(x)]G_m(x) = \text{sign}\big [\sum \limits_{m=1}^M \alpha_m G_m(x)\big ]

The 1(Gm(xi)yi)\mathbb{1}(G_m(x_i) \neq y_i) function in the step 2b refers to the 0/1 loss,

1(Gm(xi)yi)={0, if  Gm(xi)=yi1, if  Gm(xi)yi\mathbb{1}(G_m(x_i) \neq y_i) = \begin{cases} 0, \ \text{if }\ G_m(x_i) = y_i \\ 1, \ \text{if }\ G_m(x_i) \neq y_i \end{cases}

Setup, initializing the weights

To understand the mechanics, let's run a single iteration of AdaBoost on the following synthetic set with 10 samples about loan approval:

+----+-------+----------+------------+
|    |   Age |   Income |   Approval |
|----+-------+----------+------------|
|  0 |    20 |    10000 |          0 |
|  1 |    25 |    30000 |          1 |
|  2 |    30 |    60000 |          1 |
|  3 |    35 |    25000 |          1 |
|  4 |    40 |    15000 |          0 |
|  5 |    45 |    55000 |          0 |
|  6 |    50 |    22000 |          0 |
|  7 |    55 |    11000 |          0 |
|  8 |    60 |    27000 |          0 |
|  9 |    65 |    65000 |          1 |
+----+-------+----------+------------+

We will use decision tree stumps as the weak learners in this case. The first step is to assign the initial weights to each training sample (since the number of samples, NN, is equal to 10, w0=1N=110=0.1w_0 =\frac{1}{N} = \frac{1}{10} = 0.1):

+----+-------+----------+------------+------------------+
|    |   Age |   Income |   Approval |   Initial weight |
|----+-------+----------+------------+------------------|
|  0 |    20 |    10000 |          0 |              0.1 |
|  1 |    25 |    30000 |          1 |              0.1 |
|  2 |    30 |    60000 |          1 |              0.1 |
|  3 |    35 |    25000 |          1 |              0.1 |
|  4 |    40 |    15000 |          0 |              0.1 |
|  5 |    45 |    55000 |          0 |              0.1 |
|  6 |    50 |    22000 |          0 |              0.1 |
|  7 |    55 |    11000 |          0 |              0.1 |
|  8 |    60 |    27000 |          0 |              0.1 |
|  9 |    65 |    65000 |          1 |              0.1 |
+----+-------+----------+------------+------------------+

Choosing the first stump

Let's calculate the Gini index for the 2 possible splits. The Gini impurity is calculated as G=i=1Cp(i)(1p(i)),G = \sum \limits_{i=1}^Cp(i)\cdot (1-p(i)),where CC is the total number of classes (2 in binary classification setting), and p(i)p(i) is the probability of picking a sample with the ithith class. The split with the lower Gini impurity will be the first weak learner (AdaBoost uses the decision tree stump with the lowest impurity first because it will be more correct in classifying the majority of the data points).

Sample decision tree stumps

Gini Impurity(Age37.5)0.32Gini Impurity(Income23500)0.27\text{Gini Impurity}(\text{Age} \leq 37.5) \approx 0.32 \\ \text{Gini Impurity}(\text{Income} \leq 23500) \approx 0.27Income has a lower impurity so it will be used as the first weak learner.

The first weight update

Next, let's calculate the amount of say based on income. The total error here is just the sum of the weights of the incorrectly classified samples. The income split produced 2 errors, thus the AOS will be

AOS(Income)=12log(1Total errorTotal error)=12log(10.20.2)0.693\text{AOS(Income)} = \frac{1}{2} \log \Big(\frac{1 - \text{Total error}}{\text{Total error}} \Big) = \frac{1}{2} \log \Big(\frac{1 - 0.2}{0.2} \Big) \approx 0.693After calculating the AOS, we can update the weights for the samples based on whether they were correctly classified or not. New sample weight for the incorrectly classified sample: new sample weight(incorrect)=wieAOS=0.1e0.6930.2\text{new sample weight(incorrect)} = {w_{i}} \cdot e^{\text{AOS}} = 0.1 \cdot e^ {0.693} \approx 0.2The new sample weight for the correctly classified samples:

new sample weight(correct)=wieAOS=0.1e0.6930.05\text{new sample weight(correct)} = {w_{i}} \cdot e^{-\text{AOS}} = 0.1 \cdot e^ {-0.693} \approx 0.05

The weights should be normalized so they add up to 1, the new sample weight for the incorrectly classified sample will be 0.20.8=0.25\frac{0.2}{0.8} = 0.25, and the new weight for the correctly classified sample is 0.050.8=0.0625\frac{0.05}{0.8} = 0.0625. Normalizing the weights is required to make the algorithm more stable and less affected by the outliers (it won't eliminate the outlier effects considerably), and to avoid being too biased towards the misclassifications (i.e., to reduce the risk of overfitting).

Thus, the first weight update looks like this:

+----+-------+----------+------------+------------------+-----------------+
|    |   Age |   Income |   Approval |   Initial weight |   Weight update |
|----+-------+----------+------------+------------------+-----------------|
|  0 |    20 |    10000 |          0 |              0.1 |          0.0625 |
|  1 |    25 |    30000 |          1 |              0.1 |          0.0625 |
|  2 |    30 |    60000 |          1 |              0.1 |          0.0625 |
|  3 |    35 |    25000 |          1 |              0.1 |          0.0625 |
|  4 |    40 |    15000 |          0 |              0.1 |          0.0625 |
|  5 |    45 |    55000 |          0 |              0.1 |          0.25   |
|  6 |    50 |    22000 |          0 |              0.1 |          0.0625 |
|  7 |    55 |    11000 |          0 |              0.1 |          0.0625 |
|  8 |    60 |    27000 |          0 |              0.1 |          0.25   |
|  9 |    65 |    65000 |          1 |              0.1 |          0.0625 |
+----+-------+----------+------------+------------------+-----------------+

Selecting the next training set

Now we can select the next training set, where misclassified samples will be more prominent (this is an optional step in the original algorithm description, which is a more sophisticated method for fitting the next stump). We generate NN random floats in the [0, 1] range, and select the new set of samples by checking whether the float falls into the bucket (the weight interval):

+----+-------+----------+------------+-----------------+------------------+
|    |   Age |   Income |   Approval |   Weight update | Ranges           |
|----+-------+----------+------------+-----------------+------------------|
|  0 |    20 |    10000 |          0 |          0.0625 | [0, 0.0625]      |
|  1 |    25 |    30000 |          1 |          0.0625 | [0.0625, 0.125]  |
|  2 |    30 |    60000 |          1 |          0.0625 | [0.125, 0.1875]  |
|  3 |    35 |    25000 |          1 |          0.0625 | [0.1875, 0.25]   |
|  4 |    40 |    15000 |          0 |          0.0625 | [0.25, 0.3125]   |
|  5 |    45 |    55000 |          0 |          0.25   | [0.3125, 0.5625] |
|  6 |    50 |    22000 |          0 |          0.0625 | [0.5625, 0.625]  |
|  7 |    55 |    11000 |          0 |          0.0625 | [0.625, 0.6875]  |
|  8 |    60 |    27000 |          0 |          0.25   | [0.6875, 0.9375] |
|  9 |    65 |    65000 |          1 |          0.0625 | [0.9375, 1]      |
+----+-------+----------+------------+-----------------+------------------+

Suppose the random set is [0.82, 0.07, 0.38, 0.91, 0.08, 0.52, 0.1, 0.42, 0.63, 0.16]. First, check where 0.82 lies – we can see it belongs to the interval for the 8th sample. The 8th sample is drawn to the new set. Next, 0.07 lies in the 1st sample interval, so the 1st sample is drawn to the set, etc. The misclassified samples will be more prominent in the new training set since their intervals are larger, thus, they will have a higher probability of appearing in the new set. The new training set (after the weight update and normalization) for the next iteration becomes:

+------------+-------+----------+------------+-------------------------------+
|   Sample # |   Age |   Income |   Approval |   Updated weight (normalized) |
|------------+-------+----------+------------+-------------------------------|
|          8 |    60 |    27000 |          0 |                        0.16   |
|          1 |    25 |    30000 |          1 |                        0.04   |
|          5 |    45 |    55000 |          0 |                        0.16   |
|          8 |    60 |    27000 |          0 |                        0.16   |
|          1 |    25 |    30000 |          1 |                        0.04   |
|          5 |    45 |    55000 |          0 |                        0.16   |
|          1 |    25 |    30000 |          1 |                        0.04   |
|          5 |    45 |    55000 |          0 |                        0.16   |
|          7 |    55 |    11000 |          0 |                        0.04   |
|          2 |    30 |    60000 |          1 |                        0.04   |
+------------+-------+----------+------------+-------------------------------+

The final prediction

The iteration continues until a specific stopping condition is met, for example, when the error rate for a specific weak learner exceeds a threshold, or all M boosting rounds have been completed. Then, we have the error rates for each weak learner (denoted as GiG_i , Gi{1,1}G_i \in \{-1, 1\} are the predictions in the binary classification setting) in the ensemble, and we calculated the amount of say for the learner in the final prediction (denoted as αi\alpha_i below). The final output will be

prediction=sign(i=1MαiGi(x))\text{prediction} = \text{sign} \Big(\sum_{i = 1}^M \alpha_i G_i(x) \Big)where M is the number of AdaBoost rounds. More accurate classifiers will have a higher weight.

Conclusion

As a result, now you are familiar with the inner workings of the Adaboost algorithm in the binary classification setting.

How did you like the theory?
Report a typo