Computer scienceData scienceMachine learningEnsemble learning

Gradient boosting

9 minutes read

Gradient boosting is an algorithm that combines weak learners to create a strong model. It works by iteratively adding new models to correct the errors of previous weak learners. Gradient boosting could be applied to the regression and classification tasks.

In this topic, we will review the theoretical foundations and consider the main properties of gradient boosting.

Gradient boosting: the high-level overview

Gradient boosting essentially constructs a sequence of (most often, but not limited to) decision trees where each tree is trained on the errors of the previous tree. As we progress through each round of boosting, the tree ensemble gets better by slightly modifying every single tree to move it closer in the correct direction. These slight modifications are derived from a loss gradient, hence the term gradient boosting. Gradients are used in gradient boosting because they contain information about the greatest error increase (and thus, negative gradients point in the direction of greatest error decrease) and magnitude of the error, whereas differences only provide information about the magnitude. This information about the gradients is used to update the model in a more efficient manner.

The gradient boosting procedure could be summarized as:

1. Construct the base tree (a root node).

2. Fit the next tree to the errors of the previous tree

3. Combine the trees from the first two steps. Repeat the second step.

The output of gradient boosting is the additive model of the weak learners.

The gradient boosting algorithm

In this section, we will formalize the gradient boosting procedure.

Step 0. Initialization:

  • set the number of trees (the number of boosting rounds), TT;
  • set the differentiable loss function, L(y(i),h(x(i)))L(y^{(i)}, h(\mathbf{x}^{(i)}));
  • set the learning rate (contribution of each tree towards the final prediction), α\alpha.

Note: we can also set specific restrictions on the tree structure to prevent overfitting, which will be discussed later.

Step 1. Initialize the model (create the root node, h0(x)h_{\text{0}}(\mathbf{x}) is the prediction):

h0(x)=arg miny^i=1nL(y(i),y^)h_0(\mathbf{x}) = \argmin_{\hat{y}} \sum \limits_{i=1}^n L(y^{(i)}, \hat{y})

The base model is denoted as h0(x)h_0(x) and will be a constant that represents the prediction, y^\hat{y}, that minimizes the loss. This initial model will be a starting point for the subsequent boosting rounds.

Step 2. For t=1t = 1 to t=Tt = T:

Firstly, compute the gradients (also known as the pseudo-residuals),ri,t=[L(y(i),h(x(i)))h(x(i))]h(x)=ht1(x), for i = 1 to n r_{i, t} = -\big[ \frac{\partial L(y^{(i)}, h(\mathbf{x}^{(i)}))}{\partial h(\mathbf{x}^{(i)})} \big]_{h(\mathbf{x}) = h_{t-1}(\mathbf{x)}}, \ \text{for i = 1 to n }

Secondly, fit the tree to ritr_{i_{t}} values, creating the terminal nodes Rj,tR_{j, t} for j=1,...,Jtj = 1, ... , J_t

Third, for j=1,...,Jtj = 1, ... , J_t:

y^j,t=arg miny^x(i)Ri,jL(y(i),ht1(x(i))+y^)\hat{y}_{j, t} = \argmin_{\hat{y}} \sum \limits_{x^{(i)} \in R_{i, j}} L(y^{(i)}, h_{t-1}(\mathbf{x}^{(i)}) + \hat{y})

Lastly, update the prediction,

ht(x)=ht1(x)+αj=1Jty^j,t I(xRj,t)h_t(\mathbf{x}) = h_{t-1}(\mathbf{x}) + \alpha \sum \limits_{j=1}^{J_t} \hat{y}_{j, t}\ \mathbb{I}(\mathbf{x} \in R_{j,t})

Step 3. Return the final prediction, ht(x)h_t(\mathbf{x})

Let's elaborate more on the above-described procedure. First, we set the number of trees (or the number of boosting rounds, a standard initial value is 100). Setting it too high might result in overfitting, and if it's set too low the ensemble won't generalize well. Next, the choice of the loss function depends on the task and certain properties of the loss functions (e.g., MSE or Huber loss for regression — Huber loss is suitable for robust regression, log/exponential loss for classification, etc).

The learning rate, α(0,1]\alpha \in (0, 1], is the size of the update and scales the contribution of each tree towards the final prediction. α\alpha is used for regularization, usually, lower values such as 0.01<α<0.10.01 < \alpha < 0.1 show better performance and smaller values of α\alpha (more shrinkage) result in lower test errors, but TT might need increasing (there is a tradeoff between α\alpha and TT, i.e., setting the α\alpha to a higher value will lead to faster convergence, but might result in overfitting). Below, you can see the illustration for running gradient boosting on a synthetic dataset for 100 and 1000 iterations with different learning rates and their effect on the test set deviance (lower deviance indicates better performance, in the second picture relatively higher rates lead to overfitting, while on the first we see that the learning rate of 0.1 didn't converge on just 100 iterations):

The effect of different learning rates[from 0.1 to 1] with respect to 100 and 100 rounds of boosting on the test set deviance

After that, the initial prediction is formed, and it will be equal to y^\hat{y} that minimizes the chosen loss. Next, we iterate through the boosting rounds. The gradient is computed for every sample. The gradients in the gradient boosting context are interchangeable with residuals or pseudo-residuals, pseudo refers to the fact that the partial derivative is taken, and it's not a residual in the traditional sense. Then, a tree is fit on the gradients instead of the original target values, and the terminal regions are created. The value of each terminal node is set to the prediction that minimizes the loss. Then, the prediction ht(x)h_t(\mathbf{x)} is updated with the new tree, multiplied by the learning rate. When the iterations are over (due to exhausting the rounds, early stopping, etc), the prediction from the last iteration is returned.

Some considerations for gradient boosting

Here, we will look at some properties of gradient boosting and discuss certain optimizations. We already went through the effects of the learning rate and the number of boosting rounds, but there are other parameters that help to combat overfitting.

One of these parameters is the maximum tree depth, which limits the number of levels the decision trees can have. A small depth may underfit the data and will not capture the feature interactions properly, while a large depth can overfit. Usually, the tree depth in the [4, 8] interval works well.

The effects of various maximum tree depths on the test set deviance

Subsampling is another technique used in gradient boosting where a fraction η\eta of the total training observations are sampled (without replacement) at each iteration to grow the next tree in the sequence. The remaining steps in the algorithm are identical. This technique is known as stochastic gradient boosting. The value of η\eta is usually 1/21/2, but for bigger datasets, η\eta may be lower (η\eta might lie somewhere in [0.3, 0.6]). A smaller η\eta will reduce variance and increase bias, while a larger ratio will increase variance and reduce bias (gradient boosting thus prevents overfitting by changing the training samples in each iteration). η\eta on itself won't do much and requires lowering the learning rate.

The effects of different combinations of learning rates and subsampling on the test set deviance

Early stopping could be applied as well. There are various early stopping criteria, e.g., the iteration terminates when the error on the holdout set starts to increase or does not significantly change over some number of iterations, etc.

The minimum number of samples in a leaf node affects the depth and complexity of the tree. A small value for the minimum number of samples per leaf can result in overfitting, as the model can create complex trees with very few samples in the leaves. A large value for the minimum number of samples per leaf can result in underfitting. This parameter should be chosen alongside the maximum tree depth.

There are other parameters, such as limiting the maximum number of used features (reducing the features to a couple of the most important ones can generally improve the quality of the prediction), which also help to reduce the variance of the model.

As we can see, gradient boosting has quite a few hyperparameters (but one can achieve pretty good initial accuracy by starting with the low learning rate and the high number of boosting rounds alone and then fine-tuning other parameters on cross-validation).

Gradient boosting tries to reduce both the bias (e.g., by increasing the number of boosting rounds) and the variance (e.g., by decreasing the learning rate).

Conclusion

As a result, you are now familiar with the core ideas of gradient boosting, such as:

  • Gradient boosting composes multiple weak learners (most often, decision trees) to form a strong prediction;
  • Each consecutive learner is fit on the errors of the previous learners;
  • The final prediction is the additive model of the weak learners, where each weak learner has the same scaling factor (α\alpha).

Also, you know some of the techniques to improve the accuracy (e.g., setting the α\alpha, effect of the number of rounds, using subsampling, etc).

How did you like the theory?
Report a typo