Computer scienceData scienceMachine learningClassificationDecision tree learning

Gini index

11 minutes read

You've already learned what a decision tree is. Imagine that you'd like to go hiking but can't figure out which route to take this time. After giving it some thought, you came up with an idea to build a decision tree that can help you predict whether the route is worth it or not. So, you jotted down the most popular hiking areas and their features as a table.

But which feature should be the root of your tree? There are plenty of algorithms for building a decision tree. In this topic, we will discuss the Gini index.

The Gini index

The Gini index can take values from 0 to 1 and reflects the probability of some variable being incorrectly classified when selected at random. Sounds complicated? Let's take a look at the formula for the Gini index. It's one of those rare cases where a formula can explain everything:

$\text{Gini} = 1 - \sum_{i=1}^n p_i^2$

The next question is, how can you use this formula to build a decision tree? Here is the algorithm for using the Gini index when building a decision tree:

Pick a feature.
Calculate the Gini index for all the possible pairs of the type "the target variable = a categorical feature".
Calculate the weighted Gini index of each categorical feature.
Repeat steps 1-3 for all features in a dataset.
Select the feature that has a lower Gini index to make the best split.

The first split

Let's take a look at the first draft of the dataset you've collected:

	area	difficulty	dangerous species	go?
1	wood	easy	no	yes
2	mountains	medium	yes	no
3	wood	hard	no	no
4	wood	easy	no	yes
5	mountains	easy	no	yes
6	wood	easy	yes	no
7	wood	medium	no	yes
8	mountains	hard	yes	no

Now let's build a decision tree based on the Gini index. We'll start with the area feature. Here we have two categories — wood (5 records) and mountains (3 records). Next, we must calculate the Gini index for all pairs in the area variable depending on the target feature, which would be the following:

area = wood:

$P(\text{go?= yes | area = wood}) = \frac{3}{5} \\ P(\text{go? = no | area = wood}) = \frac{2}{5}$

$\text{Gini Index} = 1 - ((\frac 2 5)^2 + (\frac 3 5)^2) = 0.48$

area = mountains:

$P(\text{go? = yes | area = mountains}) = \frac 1 3 \newline P(\text{go? = no | area = mountains}) = \frac 2 3$

$\text{Gini Index} = 1 - ((\frac 1 3)^2 + (\frac 2 3)^2) \approx 0.44$

Then, let's calculate the weighted Gini index for the feature area:

$\text{Weighted Gini} =\frac 5 8 \cdot 0.48 + \frac 3 8 \cdot 0.44 = 0.465$

Now let's carry out the same set of calculations with the other two features.

Here are the Gini indexes for the Difficulty variable:

difficulty = easy: $\text{Gini Index} = 1 - ((\frac 3 4)^2 + (\frac 1 4)^2) = 0.375$
difficulty = medium: $\text{Gini Index} = 1 - ((\frac 1 2)^2 + (\frac 1 2)^2) = 0.5$
difficulty = hard: $\text{Gini Index} = 1 - ((\frac 2 2)^2 + (0)^2) = 0$

$\text{Weighted\ Gini} =\frac 4 8 \cdot 0.375 + \frac 2 8 \cdot 0.5 + \frac 2 8 \cdot 0 = 0.3125$

And here are those for Dangerous species:

dangerous species = yes: $\text{Gini Index} = 1 - ((\frac 3 3)^2 + (0)^2) = 0$
dangerous species = no: $\text{Gini Index} = 1 - ((\frac 4 5)^2 + (\frac 1 5)^2) = 0.32$

$\text{Weighted Gini} =\frac 3 8 \cdot 0 + \frac 5 8 \cdot 0.32 = 0.2$

And the winner is dangerous species with a Gini of 0.2. So, let's draw the first split:

Split the decision tree with the dangerous species feature

Finishing the decision tree

After the first split our data looks now like this:

	area	difficulty	go?
1	wood	easy	yes
2	wood	hard	no
3	wood	easy	yes
4	mountains	easy	yes
5	wood	medium	yes

We must repeat all the steps from above with the new version of the dataset:

1) Area:

area = wood: $\text{Gini Index} = 1 - ((\frac 3 4)^2 + (\frac 1 4)^2) = 0.375$
area = mountains: $\text{Gini Index} = 1 - ((\frac 1 1)^2 + (0)^2) = 0$

$\text{Weighted\ Gini} =\frac 4 5 \cdot 0.375 + \frac 1 5 \cdot 0 = 0.3$

2) Difficulty:

difficulty = easy: $\text{Gini Index} = 1 - ((\frac 3 3)^2 + (0)^2) = 0$
difficulty = medium: $\text{Gini Index} = 1 - ((\frac 1 1)^2 + (0)^2) = 0$
difficulty = hard: $\text{Gini Index} = 1 - ((0)^2 + (\frac 1 1)^2) = 0$

$\text{Weighted\ Gini} =\frac 3 5 \cdot 0 + \frac 1 5 \cdot 0 + \frac 1 5 \cdot 0 = 0$

And here we have an absolute winner, which means that our decision tree is finished:

Split the decision tree with the difficulty feature

The example we provided here is just a simplified version of tasks you'll face on your learning and working path. However, it's still helpful for understanding the Gini algorithm in action and imagining how it works on bigger data.

Conclusion

In this topic we've covered:

The Gini index and how to calculate it;
The Gini algorithm;
The application of the Gini algorithm in a specific case.

13 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Gini index

The Gini index

The first split

Finishing the decision tree

Conclusion

Related topics