Gradient boosting is a popular machine-learning technique used in both regression and classification problems. It works by iteratively adding weak learners, which are usually based on decision trees, to form a strong predictive model. In this topic, we will perform the calculations for gradient boosting in the regression setting.
The setup
In order to comprehend the gradient boosting stages, let's run gradient boosting on the following dataset:
| HouseAge | AveRooms | Population | MedHouseVal |
| 25 | 4 | 1392 | 0.5 |
| 30 | 5 | 1565 | 0.5 |
| 52 | 4 | 1310 | 5 |
| 17 | 6 | 1705 | 2.2 |
| 34 | 5 | 1063 | 2.8 |
A clarification on the loss function is required. The squared error loss () with a scaling factor of has an extremely convenient derivative, so it will be easy to demonstrate the steps of the algorithm:
For the reasons explained above, we set the loss to squared error loss with a scaling factor of . For the sake of simplicity, only 2 iterations will be performed, so .
Initialize the model
To make a prediction based on all training examples at a given node in a decision tree, we can simply compute the average target value for all examples that reach that node, but for demonstration purposes, we will be more careful and stick to the formal algorithm.
Now, we have to find the prediction, , that minimizes the sum:The derivative of the loss is equal to , so we set the sum to 0, expand, and solve for :
So, this particular loss function is just the average of the target values:
Computing the gradients
Since we already know what the partial derivative looks like, the simplifies to
In this case, the gradient vector looks like Next, we fit a decision tree on the obtained gradients:
If you are wondering how the decision boundaries were obtained (i.e., why was the HouseAge feature split on ), consider reading an article about How Decision Trees Split Nodes, from Loss Function Perspective. In this topic, we will get two identical trees on both iterations (meaning they share the decision boundaries and the samples that end up in the leaf are the same). In practice, since each consecutive tree is fitted on the negative gradients (which are re-computed and changed from iteration to iteration), different splitting boundaries might be selected to build the new tree.
Below, we go through the terminal node value calculation from the definition.
Let's say that the node is the first leaf (denoted by , where is the node index, and is the tree index in the procedure, so the first leaf will be ), and the node is the second leaf (denoted by ). Sample #5 ends up in the , and samples #1 and #2 end up in .
Then, calculating the update:
The calculation for the will be similar, with the only difference being that we are minimizing the sum of the two losses:
Note: due to the choice of the loss function, the terminal regions, in this case, will be the average of the samples that ended up in that leaf, and, since we have two equal values in the , its predicted value is , however, if the two (or any ) samples in the leaf had different values, with this loss function we would take their average. Calculation of the prediction with a different loss and samples would be akin to the equation with respect to the selected loss and terms in the sum.
Updating the predictions
We update the predictions, which yield:
Represented graphically, the prediction after the first round is:
Now, we can predict for a single sample. Let's consider the fifth sample (which ends up in the terminal node with the value 0.6), the prediction will be .
| HouseAge | AveRooms | Population | MedHouseVal | |||
| 25 | 4 | 1392 | 0.5 | 2.2 | -1.7 | 2.03 |
| 30 | 5 | 1565 | 0.5 | 2.2 | -1.7 | 2.03 |
| 52 | 4 | 1310 | 5 | 2.2 | 2.8 | 2.48 |
| 17 | 6 | 1705 | 2.2 | 2.2 | 0 | 2.2 |
| 34 | 5 | 1063 | 2.8 | 2.2 | 0.6 | 2.26 |
We can see that, compared to the initial prediction (), moves closer to the true targets (MedHouseVal).
The second boosting round
Next, we can re-compute the gradients for the second round:
Now, the updated prediction after the two rounds is:
The updated set of predictions is:Again, we see that the predictions iteratively move closer to the real targets:
| MedHouseVal | |||
| 0.5 | 2.2 | 2.03 | 1.877 |
| 0.5 | 2.2 | 2.03 | 1.877 |
| 5 | 2.2 | 2.48 | 2.732 |
| 2.2 | 2.2 | 2.2 | 2.2 |
| 2.8 | 2.2 | 2.26 | 2.314 |
Conclusion
As a result, you are now familiar with:
- The general workflow for gradient boosting in the regression setting;
- How to calculate the gradients and use them to update the predictions;
- How gradient boosting makes the predictions move closer to the target values over the consecutive boosting rounds.