Computer scienceData scienceMachine learningEnsemble learning

Adaboost

7 minutes read

AdaBoost, short for Adaptive Boosting, was one of the first boosting algorithms for supervised learning that combines multiple weak learners in order to form a strong learner. AdaBoost adjusts the sample weights based on the performance of the previous weak learner. The weight update is based on the misclassifications by the previous weak learner, where misclassified samples are assigned a higher weight to increase their importance in future iterations.

AdaBoost was originally developed for classification, but it also could be used in regression tasks. We consider AdaBoost from the historical perspective, since other ensembles, such as gradient boosting, considerably outperform Adaboost.

The general boosting procedure overview

In boosting, a weak learner is a model that performs slightly better than random guessing. A strong learner is a model that performs significantly better than random guessing. The main idea of AdaBoost is to apply weak learners to the modified versions of the training data, which will produce a sequence of weak learners, with each subsequent weak learner trying to correct the errors of the previous ones. When the set of weak learners is formed, the weighted majority voting produces the final prediction.

The AdaBoost procedure could be summarized in the following steps:

Initialize the weight vector with uniform weights
Loop:
– Apply weak learners to weighted training examples
– Increase the weight for misclassified examples
(Weighted) majority voting on trained classifiers

Below is a visualization of running 3 iterations of AdaBoost by building different decision boundaries with each weak classifier (e.g., in AdaBoost, it's common to use the decision tree stumps) and combining them together in the end:

Example of running AdaBoost on 3 iterations

The general AdaBoost schema is given below:

A schematic overview of AdaBoost

The AdaBoost algorithm

The AdaBoost procedure for classification could be formalized as:

1. Initialize the number of AdaBoost rounds, $M$ (usually, this refers to the number of decision tree stumps in the ensemble)

Initialize the weights, $w_i = 1/N$ , $i = 1,2,…, N$

2. For $m=1$ to $M$ :

(a) Fit a classifier $G_m(x)$ to the training data using the weights $w_i$ ;

(b) Compute the error,

$\text{err}_m = \sum \limits_{i=1}^N w_i \mathbb{1}(G_m(x_i) \neq y_i)$ (c) Compute the amount of say, $\alpha_m = \frac{1}{2}\ln \frac{(1-\text{err}_m)}{\text{err}_m}$ The amount of say is small if the error is large, and vice versa.

(d) Update the weights $w_{m+1} = w_m \times \begin{cases} e^{-\alpha_r}, \ \text{if }\ G_m(x_i) = y_i \\ e^{\alpha_r}, \ \text{if }\ G_m(x_i) \neq y_i \end{cases}$ [Optional step: use buckets to draw a new dataset based on the sample weights]

(e) Normalize the weights

3. Return the final prediction, $G_m(x) = \text{sign}\big [\sum \limits_{m=1}^M \alpha_m G_m(x)\big ]$

The $\mathbb{1}(G_m(x_i) \neq y_i)$ function in the step 2b refers to the 0/1 loss,

$\mathbb{1}(G_m(x_i) \neq y_i) = \begin{cases} 0, \ \text{if }\ G_m(x_i) = y_i \\ 1, \ \text{if }\ G_m(x_i) \neq y_i \end{cases}$

Let's go through each step more carefully. In the first step, all samples are assigned the same weight, which is equal to $1/{\text{no. of samples}}$ . We also determine the number of boosting rounds, denoted by $M$ . Then, for each round, $m = 1, …, M$ , a weak learner $G_m(x)$ (in the binary classification setting, the predictions are $G_m(x) \in \{-1, 1\}$ ) is fit on the data and we compute the weighted error rate, which is $\text{err}_m = \sum \limits_{i=1}^N w_i \mathbb{1}(G_m(x_i) \neq y_i)$ in binary classification. $\alpha_m$ is the amount of say of the weak learner in the ensemble, which indicates how much the weak learner will affect the final prediction — if the weak learner produces a high error, its amount of say will be smaller, and it will have less influence on the prediction, and vice versa. We can visualize the relationship between the error and the amount of say:

A plot that shows the relationship between the error and the amount of say

After obtaining the amount of say, the sample weights are recalculated and normalized (misclassified samples will have a higher weight). Then, we may draw bootstrap samples with weighted probability, so that in the next round, there will be more focus on the misclassified instances.

When the boosting rounds are over (or some other condition is met, e.g., when the error rate for a specific weak learner starts to rise, all subsequent weak learners will be discarded after the error rate increase), and the weighted majority voting is applied to produce the final prediction.

Some considerations for AdaBoost

AdaBoost is best suited for binary classification but could be used for multiclass classification and regression. AdaBoost could be used with any weak learner. However, it traditionally uses decision tree stumps. Other weak learners, such as support vector machines (SVMs) can also be used but might require tuning.

There are several limitations to using AdaBoost:

AdaBoost is sensitive to noise in data and outliers and will assign higher weights to the outliers, thus focusing on them too much, which leads to overfitting. Also, AdaBoost can have difficulty converging to a strong learner without overfitting when the data is not easily separated by a single decision boundary. In such cases, the weak learners that are trained may be very different from each other and may be very specific to misclassifications.
AdaBoost may not generalize well if there is not enough available data or the weak learners which are too weak or overly complex.
AdaBoost assumes that the data is balanced, meaning that the number of samples in each class is roughly equal. In cases of imbalanced data, AdaBoost may result in a model that is biased towards the majority class, and leads to poor prediction accuracy for the minority class.

One of the overfitting reduction techniques, such as early stopping, could be applied to AdaBoost.

AdaBoost was not originally designed to minimize a particular loss function, but it actually minimized the exponential loss:

$\frac{1}{N} \sum \limits_{i=1}^N \exp(-y_i G_m(x_i))$

The choices $\alpha_m$ and $G_m$ (where $m$ is the respective round of AdaBoost) on each round are the same as the choice of $\alpha_m$ and $G_m$ that causes the greatest decrease in the exponential loss, but further clarifications are outside the scope of this topic.

A comparison between Gradboost and Adaboost

	AdaBoost	Gradboost
The loss	Minimizes the exponential loss	Can use any differentiable loss
Base model	Traditionally decision tree stumps, but other models could be used as well	Usually decision trees
The strategy	Assigns weights to each training instance, and each iteration places extra emphasis on misclassified instances	Optimizes a differentiable loss function of a weak learner via consecutive rounds of boosting
The prediction	Weighted majority voting of the weak learners	Additive models of multiple weak learners
Dealing with noise	Worse at handling the outliers	Better at handling the outliers

Conclusion

As a result, now you are familiar with the main concepts and the working scheme of AdaBoost, such as:

AdaBoost combines a series of weak learners (most often, decision stumps) to produce a stronger model;
Some weak learners have a stronger influence on the final prediction based on how well they are able to predict the targets;
Each weak learner is made by taking the errors of the previous models into account.

How did you like the theory?

Report a typo