Computer scienceData scienceMachine learningEnsemble learning

Gradient boosting: the illustrated regression

9 minutes read

Gradient boosting is a popular machine-learning technique used in both regression and classification problems. It works by iteratively adding weak learners, which are usually based on decision trees, to form a strong predictive model. In this topic, we will perform the calculations for gradient boosting in the regression setting.

The setup

In order to comprehend the gradient boosting stages, let's run gradient boosting on the following dataset:

HouseAge AveRooms Population MedHouseVal
25 4 1392 0.5
30 5 1565 0.5
52 4 1310 5
17 6 1705 2.2
34 5 1063 2.8

A clarification on the loss function is required. The squared error loss (L=(y(i)y^)2L = (y^{(i)} - \hat{y})^2) with a scaling factor of 1/21/2 has an extremely convenient derivative, so it will be easy to demonstrate the steps of the algorithm:

h(x(i))12(y(i)h(x(i)))2=2×12(y(i)h(x(i)))×(01)=(y(i)h(x(i)))=h(x(i))y(i)ri=(h(x(i))y(i))=y(i)h(x(i))\frac{\partial }{\partial h( x^{(i)})} \frac{1}{2}(y^{(i)} - h(\mathbf{x}^{(i)}))^2 = 2 \times \frac{1}{2} (y^{(i)} - h(\mathbf{x}^{(i)})) \times (0-1) \\= -(y^{(i)} -h(\mathbf{x}^{(i)} )) = h(\mathbf{x}^{(i)}) - y^{(i)} \\ \rightarrow r_i = -(h(\mathbf{x}^{(i)}) - y^{(i)}) = y^{(i)} - h(\mathbf{x}^{(i)})

For the reasons explained above, we set the loss to squared error loss with a scaling factor of 1/21/2. For the sake of simplicity, only 2 iterations will be performed, so T=2T = 2.

Initialize the model

To make a prediction based on all training examples at a given node in a decision tree, we can simply compute the average target value for all examples that reach that node, but for demonstration purposes, we will be more careful and stick to the formal algorithm.

h0(x)=arg miny^i=1nL(y(i),y^)=arg miny^i=1n12(y(i)y^)2h_0(\mathbf{x}) = \argmin_{\hat{y}} \sum \limits_{i=1}^n L(y^{(i)}, \hat{y}) = \argmin_{\hat{y}} \sum \limits_{i=1}^n \frac{1}{2} (y^{(i)} - \hat{y})^2

Now, we have to find the prediction, y^\hat{y}, that minimizes the sum:i=1n12(y(i)y^)2\sum \limits_{i=1}^n \frac{1}{2} (y^{(i)} - \hat{y})^2The derivative of the loss is equal to y(i)y^y^{(i)} - \hat{y}, so we set the sum to 0, expand, and solve for y^\hat{y}:

i=15Ly^=0(0.5y^)+(0.5y^)+(5y^)+(2.2y^)+(2.8y^)=011=5y^y^=2.2\sum \limits_{i=1}^5 \frac{\partial L}{\partial \hat{y}} =0 \rightarrow (0.5 - \hat{y}) + (0.5 - \hat{y})+ (5 - \hat{y})+ (2.2 - \hat{y}) + (2.8 - \hat{y}) =0 \\ \rightarrow 11 = 5 \hat{y} \rightarrow \hat{y} = 2.2

So, this particular loss function h0(x)h_0(\mathbf{x}) is just the average of the target values:

h0(x)=1ni=1ny(i)=2.2h_0(\mathbf{x}) = \frac{1}{n} \sum\limits_{i=1}^n y^{(i)} = 2.2

Computing the gradients

Since we already know what the partial derivative looks like, the r1r_1 simplifies to

r1=y1y1^r_1 = y_1 - \hat{y_1}

In this case, the gradient vector looks like (0.52.2,0.52.2,52.2,2.22.2,2.82.2)=(1.7,1.7,2.8,0,0.6)(0.5 - 2.2, 0.5-2.2, 5 -2.2,2.2-2.2, 2.8 - 2.2 ) = (-1.7, -1.7, 2.8, 0, 0.6)Next, we fit a decision tree on the obtained gradients:

A decision tree fitted on the first round gradients

If you are wondering how the decision boundaries were obtained (i.e., why was the HouseAge feature split on <=43<=43), consider reading an article about How Decision Trees Split Nodes, from Loss Function Perspective. In this topic, we will get two identical trees on both iterations (meaning they share the decision boundaries and the samples that end up in the leaf are the same). In practice, since each consecutive tree is fitted on the negative gradients (which are re-computed and changed from iteration to iteration), different splitting boundaries might be selected to build the new tree.

Below, we go through the terminal node value calculation from the definition.

Let's say that the node [samples=1,value=0.6][\text{samples} = 1, \text{value} = 0.6] is the first leaf (denoted by Rj,tR_{j,t}, where jj is the node index, and tt is the tree index in the procedure, so the first leaf will be R1,1R_{1,1}), and the node [samples=2,value=1.7][\text{samples} = 2, \text{value} = -1.7] is the second leaf (denoted by R2,1R_{2, 1}). Sample #5 ends up in the R1,1R_{1,1}, and samples #1 and #2 end up in R2,1R_{2, 1}.

Then, calculating the update: y^1,1=arg miny^x(i)Ri,jL(y(i),ht1(x(i))+y^)=arg miny^12(y(5)(h0(x(5))+y^))2=arg miny^12(2.8(2.2+y^))2=arg miny^12(0.6y^)2[minimizing]12(0.6y^)2=0y^=0.6\hat{y}_{1, 1} = \argmin_{\hat{y}} \sum \limits_{x^{(i)} \in R_{i, j}} L(y^{(i)}, h_{t-1}(\mathbf{x}^{(i)}) + \hat{y}) = \argmin_{\hat{y}} \frac{1}{2} (y^{(5)} - (h_0(\mathbf{x}^{(5)}) + \hat{y} ))^2 \\ = \argmin_{\hat{y}} \frac{1}{2} (2.8 - (2.2 + \hat{y}))^2 = \argmin_{\hat{y}} \frac{1}{2} (0.6 - \hat{y})^2 \rightarrow [\text{minimizing}] \\ \frac{1}{2}(0.6 - \hat{y})^2 = 0 \rightarrow\hat{y} = 0.6

The calculation for the R2,1R_{2, 1} will be similar, with the only difference being that we are minimizing the sum of the two losses:

y^2,1=arg miny^[12(y(1)(h0(x(1))+y^))2+12(y(2)(h0(x(2))+y^))2]\hat{y}_{2, 1} = \argmin_{\hat{y}} \big [ \frac{1}{2} (y^{(1)} - (h_0(\mathbf{x}^{(1)}) + \hat{y} ))^2 + \frac{1}{2} (y^{(2)} - (h_0(\mathbf{x}^{(2)}) + \hat{y} ))^2 \big]

Note: due to the choice of the loss function, the terminal regions, in this case, will be the average of the samples that ended up in that leaf, and, since we have two equal values in the R2,1R_{2, 1}, its predicted value is 1.71.72=1.7\frac{-1.7 - 1.7}{2} = -1.7, however, if the two (or any nn) samples in the leaf had different values, with this loss function we would take their average. Calculation of the prediction with a different loss and nn samples would be akin to the y^2,1\hat{y}_{2, 1} equation with respect to the selected loss and nn terms in the sum.

Updating the predictions

We update the predictions, which yield:

h1(x)=h0(x)+αj=14y^j,1h_1(\mathbf{x}) = h_0 (\mathbf{x}) + \alpha \sum\limits_{j=1}^4 \hat{y}_{j, 1}

Represented graphically, the prediction after the first round is:

The updated prediction after the first round of boosting

Now, we can predict for a single sample. Let's consider the fifth sample (which ends up in the terminal node with the value 0.6), the prediction will be h1(x5)=2.2+0.10.6=2.26h_1(x_5) = 2.2 + 0.1\cdot0.6 = 2.26.

HouseAge AveRooms Population MedHouseVal h0(x)h_0(x) r1r_1 y^1\hat{y}_1
25 4 1392 0.5 2.2 -1.7 2.03
30 5 1565 0.5 2.2 -1.7 2.03
52 4 1310 5 2.2 2.8 2.48
17 6 1705 2.2 2.2 0 2.2
34 5 1063 2.8 2.2 0.6 2.26

We can see that, compared to the initial prediction (h0(x)h_0(x)), y^1\hat{y}_1 moves closer to the true targets (MedHouseVal).

The second boosting round

Next, we can re-compute the gradients for the second round:

r2=(y(i)y^1(i))=(0.52.03,0.52.03,...,2.82.26)=(1.53,1.53,2.52,0,0.54)r_2 = (y^{(i)} - \hat{y}_1^{(i)}) = (0.5 - 2.03, 0.5-2.03, ..., 2.8 -2.26) \\= (-1.53, -1.53, 2.52, 0, 0.54)

A decision tree fitted on the second round gradients

Now, the updated prediction after the two rounds is:

h2(x)=h0(x)+αj=14y^j,1+αj=14y^j,2h_2(\mathbf{x}) = h_0 (\mathbf{x}) + \alpha \sum\limits_{j=1}^4 \hat{y}_{j, 1} + \alpha \sum\limits_{j=1}^4 \hat{y}_{j, 2}

The updated set of predictions is:y^2=(2.03+0.1(1.53),...,2.26+0.10.54)=(1.877,1.877,2.732,2.2,2.314)\hat{y}_2 = (2.03+0.1 \cdot (-1.53), ..., 2.26 + 0.1 * 0.54) = (1.877, 1.877, 2.732, 2.2, 2.314)Again, we see that the predictions iteratively move closer to the real targets:

MedHouseVal h0(x)h_0(x) y^1\hat{y}_1 y^2\hat{y}_2
0.5 2.2 2.03 1.877
0.5 2.2 2.03 1.877
5 2.2 2.48 2.732
2.2 2.2 2.2 2.2
2.8 2.2 2.26 2.314

Conclusion

As a result, you are now familiar with:

  • The general workflow for gradient boosting in the regression setting;
  • How to calculate the gradients and use them to update the predictions;
  • How gradient boosting makes the predictions move closer to the target values over the consecutive boosting rounds.
2 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo