Computer scienceData scienceMachine learningEnsemble learning

Random forest

8 minutes read

As you have already learned, there are ensemble methods in machine learning that are aimed at improving the accuracy of predictions by using several models at once in different ways.

In this topic, we will take a closer look at one of the popular ML algorithms based on an ensemble of decision trees - Random Forest.

Main idea

The random forest is based on bagging, so the main steps of these algorithms are similar.

We create $N$ subsets of objects from the training set with replacement. It is called bootstrapping. $N$ is equal to the number of models.
Next, each subset is used to create a decision tree.
The main difference between a random forest and bagging is that to create the next node in the decision tree we use not all the features of the objects, but only a random set of them.
After creating all the models, the final prediction is obtained by
- Majority voting among the models in the case of a classification task
- Averaging the value in the case of a regression task

Now, let's look at each step in more detail.

Bootstrapping

As already noted, bootstrapping is the process of creating subsets of objects from the training set with replacement. It means the following.

During the creation of each bootstrap sample, we randomly select $k$ objects from the training set, but after selecting we do not remove these objects from the training set, and these same objects can be used to generate other subsets.

As a result, some objects will occur in different data subsets several times, while others will not be selected. These unselected objects are called out-of-bag (OOB) samples. Let's look at the picture.

Deriving the bootstrap and out-of-bag samples from the training set

Note that in the bootstrap sample an object can be repeated several times. Consequently, out-of-bag sets can have different sizes and even be empty.
The size of bootstrap samples is fixed and it can be larger or smaller than the number of unique objects in the training sample.

In the final, we have two subsets for each decision tree. The first one is bootstrap sample, which consists of data a given tree will be trained on; the second one is out-of-bag set, which will remain unseen for this particular tree.

Building the trees

You already know the algorithm of building a decision tree. In the random forest model, only a small change is made in this algorithm.

During the creation of the next split in a node of a decision tree, not all the features are used, but only a subset of them. This process is called feature subsampling.

This method makes base models (decision trees in this case) more diverse, which is good for generalization and therefore for the accuracy of the final model. Furthermore, selecting a subset of features reduces the training time of a model.

Final prediction

Let's look at examples for classification and regression cases.

Imagine that we have 5 decision trees in our random forest model, that can classify two types of objects (binary classification). And, we want to determine the class of one sample. The trees gave us the following predictions: $0,\ 1,\ 1,\ 0,\ 1$ .
As already mentioned, the choice of the final prediction is determined by the number of predictions of each class. The class that received the most votes from all the trees will be the answer. In our example, the model's answer will be $1$ .
Let's consider the case of regression using a model with the same number of trees. Now our trees predict a numeric value instead of a class label. After the training process, we obtained the following tree predictions: $1.95,\ 1.87,\ 2.06,\ 2.03,\ 2.11$ .
To get the final answer we find the mean value of all predictions of the trees. So, in our example, the model's answer will be $2.004$ .

Here is an animated example of making a prediction by a random forest in the case of classification (the picture source).

Random forest makes prediction by combing the predictions of several decision trees

Out-of-bag score

One of the advantages of a random forest is its ability to evaluate itself without any test subset. The value is called out-of-bag error or out-of-bag score. The calculation is as follows.

For each tree, make predictions for data from its out-of-bag sample.
Then calculate the selected metric. It can be accuracy in the case of classification and mean squared error (MSE) for regression.
OOB score is obtained by averaging the scores (accuracy, MSE, or other) over all the trees.

Advantages and disadvantages

The method of training a number of individual decision trees on different subsets and then combining their predictions copes very well with overfitting.

The algorithm also has the ability to evaluate itself without using additional test data.

A random forest algorithm is complex so the training process requires much computational power and may take a long time especially if the dataset is big.

A trained random forest model is more difficult to interpret than a single decision tree due to combining the predictions of about 100 trees (the number can be less or more).

Conclusion

The random forest solves both classification and regression problems.
It is an algorithm based on an ensemble of decision trees.
When building these trees we use such methods as bootstrapping and feature selection.
The quality of predictions can be evaluated during the training process by OOB score.

19 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo