Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnClassification in scikit-learn

Random forest in scikit-learn

11 minutes read

As you have already learned, Random Forest (RF) is one of the popular ML algorithms based on an ensemble of decision trees.
Its essence is to use several decision trees on random subsets of the data and combine their predictions by majority voting or averaging.

In this topic, we will consider the practical application of this algorithm using sklearn library and a dataset from it.

Since Random Forest is suitable for both classification and regression tasks, there are two classes in sklearn: RandomForestClassifier and RandomForestRegressor.

As the main example in this topic, we use RandomForestClassifier, but the most important details are also explained for the regression case.

Data preparation

First of all, we have to load iris flower dataset from sklearn.datasets module and split the data into train and test sets.
Let's use train_test_split function from the model_selection module of sklearn for it.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=27)

Each object of the dataset has 4 features - the sepal and petal lengths and widths of an iris. The target variable, which we are going to predict, is as follows: 0 – iris setosa, 1 – iris versicolor, 2 – iris virginica.
Here are a few rows from the dataset:

sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
              5.7               3.8                1.7               0.3       0
              5.5               2.4                3.8               1.1       1
              7.4               2.8                6.1               1.9       2

Our goal is to classify the type of iris flower using known data about its petal and sepal sizes.

Creating and fitting the model

Now we are ready to initialize and fit the random forest model.

Firstly, we have to import it from sklearn.ensemble module and create an instance of it.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=25,
                               max_features=3,
                               oob_score=True,
                               random_state=42)
model.fit(X_train, y_train)

Let's consider the main parameters:

n_estimators defines the number of decision trees in RF. Note that this value affects the training time: the larger the parameter value is, the more time it takes to fit the model.
max_features determines the maximum number of features that will be considered when splitting in each node. The default value of the parameter is equal to the square root of the number of features in the training dataset. You can set a certain number as in the example above or set another dependence on the number of the train set features. Learn more about it in the documentation.
oob_score=True allows to calculate the error using out-of-bag (OOB) samples and to inspect the model after fitting. By default, this parameter is set to False.
You can disable bootstrapping by specifying bootstrap=False. Each tree will be trained on the full training set. By default, this parameter is set to True so individual tree "sees" a specific subset of data.
We fix the construction of the model for its reproducibility by using the random_state parameter.

Note that the RF model has parameters to configure an individual decision tree. For instance, criterion, max_depth, and others.

Now there is a fitted model, so we can make a prediction.

Predicting

The predictions are made by predict() method of the trained model. It returns class labels for objects from the input.

y_pred = model.predict(X_test)

RandomForestClassifier model in sklearn, like almost all classification models, has the predict_proba() method. The method predicts how probable it is that a given observation belongs to each class presented in the dataset.

y_proba = model.predict_proba(X_test)
print(y_proba[7]) 
# >>> [0.0, 0.04, 0.96]
print(y_pred[7]) 
# >>> 2

In the code above, we see that the seventh object has the highest probability of belonging to the third class – iris virginica (label 2), so the final result in y_pred is 2.

Evaluating

As you already know, Random Forest can evaluate itself using OOB scoring. The oob_score_ attribute contains the score.

print(model.oob_score_)
# >>> 0.96

By default, the OOB score metric is Accuracy in RandomForestClassifier. For the RandomForestRegressor it is R²-score (coefficient of determination).

If you want to calculate other metrics, you can easily get OOB predictions from the model.

RandomForestClassifier has a special attribute, oob_decision_function_, which contains predicted probabilities for every object that appeared in at least one out-of-bag set.

print(model.oob_decision_function_[6])
# >>> [0.  0.1 0.9]

The above means that the 6^th sample was assigned to iris versicolor (label 1) with a probability of 0.1 and to iris virginica (label 2) with a probability of 0.9.

In RandomForestRegressor, OOB predictions are stored in oob_prediction_ attribute.

If there are too few objects in the training set, some of them can never occur in any out-of-bag set. It leads to the fact that OOB predictions for such objects will be equal to nan.

To sum up, there are two options to evaluate the random forest:

Using out-of-bag samples: you can either get a ready-made OOB score or get OOB predictions and calculate any convenient metric
Using the test set: you simply make a prediction with trained model and calculate a metric

Estimators

After fitting the RF model, we have access to all the decision trees. Use the estimators_ attribute to get them.

model.estimators_ contains a list of fitted decision trees that make up our random forest. The size of this list is equal to the value of RF's n_estimators parameter. We can make predictions using one of the trees.

print(len(model.estimators_))
# >>> 25 # there are 25 trees in our RF

# make a prediction using 0th tree
tree = model.estimators_[0]
print(tree.predict(X_test[:5]))
# >>> [2, 0, 2, 2, 1]

Conclusion

sklearn library has a RandomForestClassifier class for classification problems and a RandomForestRegressor for regression ones.
Both of these classes also have all the parameters of a single decision tree in sklearn.
You can use the fit(X_train, y_train) method to train the model with X_train and y_train data and predict(X_test) method to get the predictions for objects in X_test.
The parameter oob_score=True of the model allows you to estimate the OOB score of the fitted model and get OOB predictions using attributes oob_decision_function_ (classifier) or oob_prediction_ (regressor).
The attribute estimators_ of the fitted model contains a list of all the decision trees that make up the random forest.

19 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo