You've already learned what a decision tree is. Imagine that you'd like to go hiking but can't figure out which route to take this time. After giving it some thought, you came up with an idea to build a decision tree that can help you predict whether the route is worth it or not. So, you jotted down the most popular hiking areas and their features as a table.
But which feature should be the root of your tree? There are plenty of algorithms for building a decision tree. In this topic, we will discuss the Gini index.
The Gini index
The Gini index can take values from 0 to 1 and reflects the probability of some variable being incorrectly classified when selected at random. Sounds complicated? Let's take a look at the formula for the Gini index. It's one of those rare cases where a formula can explain everything:
The next question is, how can you use this formula to build a decision tree? Here is the algorithm for using the Gini index when building a decision tree:
- Pick a feature.
- Calculate the Gini index for all the possible pairs of the type "the target variable = a categorical feature".
- Calculate the weighted Gini index of each categorical feature.
- Repeat steps 1-3 for all features in a dataset.
- Select the feature that has a lower Gini index to make the best split.
The first split
Let's take a look at the first draft of the dataset you've collected:
| area | difficulty | dangerous species | go? | |
| 1 | wood | easy | no | yes |
| 2 | mountains | medium | yes | no |
| 3 | wood | hard | no | no |
| 4 | wood | easy | no | yes |
| 5 | mountains | easy | no | yes |
| 6 | wood | easy | yes | no |
| 7 | wood | medium | no | yes |
| 8 | mountains | hard | yes | no |
Now let's build a decision tree based on the Gini index. We'll start with the area feature. Here we have two categories — wood (5 records) and mountains (3 records). Next, we must calculate the Gini index for all pairs in the area variable depending on the target feature, which would be the following:
area = wood:
area = mountains:
Then, let's calculate the weighted Gini index for the feature area:
Now let's carry out the same set of calculations with the other two features.
Here are the Gini indexes for the Difficulty variable:
difficulty = easy:difficulty = medium:difficulty = hard:
And here are those for Dangerous species:
-
dangerous species = yes: -
dangerous species = no:
And the winner is dangerous species with a Gini of 0.2. So, let's draw the first split:
Finishing the decision tree
After the first split our data looks now like this:
| area | difficulty | go? | |
| 1 | wood | easy | yes |
| 2 | wood | hard | no |
| 3 | wood | easy | yes |
| 4 | mountains | easy | yes |
| 5 | wood | medium | yes |
We must repeat all the steps from above with the new version of the dataset:
1) Area:
-
area = wood: -
area = mountains:
2) Difficulty:
-
difficulty = easy: -
difficulty = medium: -
difficulty = hard:
And here we have an absolute winner, which means that our decision tree is finished:
The example we provided here is just a simplified version of tasks you'll face on your learning and working path. However, it's still helpful for understanding the Gini algorithm in action and imagining how it works on bigger data.
Conclusion
In this topic we've covered:
- The Gini index and how to calculate it;
- The Gini algorithm;
- The application of the Gini algorithm in a specific case.