Computer scienceData scienceMachine learningClassificationDecision tree learning

Gini index

11 minutes read

You've already learned what a decision tree is. Imagine that you'd like to go hiking but can't figure out which route to take this time. After giving it some thought, you came up with an idea to build a decision tree that can help you predict whether the route is worth it or not. So, you jotted down the most popular hiking areas and their features as a table.

But which feature should be the root of your tree? There are plenty of algorithms for building a decision tree. In this topic, we will discuss the Gini index.

The Gini index

The Gini index can take values from 0 to 1 and reflects the probability of some variable being incorrectly classified when selected at random. Sounds complicated? Let's take a look at the formula for the Gini index. It's one of those rare cases where a formula can explain everything:

Gini=1i=1npi2\text{Gini} = 1 - \sum_{i=1}^n p_i^2

The next question is, how can you use this formula to build a decision tree? Here is the algorithm for using the Gini index when building a decision tree:

  1. Pick a feature.
  2. Calculate the Gini index for all the possible pairs of the type "the target variable = a categorical feature".
  3. Calculate the weighted Gini index of each categorical feature.
  4. Repeat steps 1-3 for all features in a dataset.
  5. Select the feature that has a lower Gini index to make the best split.

The first split

Let's take a look at the first draft of the dataset you've collected:

area difficulty dangerous species go?
1 wood easy no yes
2 mountains medium yes no
3 wood hard no no
4 wood easy no yes
5 mountains easy no yes
6 wood easy yes no
7 wood medium no yes
8 mountains hard yes no

Now let's build a decision tree based on the Gini index. We'll start with the area feature. Here we have two categories — wood (5 records) and mountains (3 records). Next, we must calculate the Gini index for all pairs in the area variable depending on the target feature, which would be the following:

  • area = wood:

P(go?= yes | area = wood)=35P(go? = no | area = wood)=25P(\text{go?= yes | area = wood}) = \frac{3}{5} \\ P(\text{go? = no | area = wood}) = \frac{2}{5}

Gini Index=1((25)2+(35)2)=0.48\text{Gini Index} = 1 - ((\frac 2 5)^2 + (\frac 3 5)^2) = 0.48

  • area = mountains:

P(go? = yes | area = mountains)=13P(go? = no | area = mountains)=23P(\text{go? = yes | area = mountains}) = \frac 1 3 \newline P(\text{go? = no | area = mountains}) = \frac 2 3

Gini Index=1((13)2+(23)2)0.44\text{Gini Index} = 1 - ((\frac 1 3)^2 + (\frac 2 3)^2) \approx 0.44

Then, let's calculate the weighted Gini index for the feature area:

Weighted Gini=580.48+380.44=0.465\text{Weighted Gini} =\frac 5 8 \cdot 0.48 + \frac 3 8 \cdot 0.44 = 0.465

Now let's carry out the same set of calculations with the other two features.

Here are the Gini indexes for the Difficulty variable:

  • difficulty = easy: Gini Index=1((34)2+(14)2)=0.375\text{Gini Index} = 1 - ((\frac 3 4)^2 + (\frac 1 4)^2) = 0.375
  • difficulty = medium: Gini Index=1((12)2+(12)2)=0.5\text{Gini Index} = 1 - ((\frac 1 2)^2 + (\frac 1 2)^2) = 0.5
  • difficulty = hard: Gini Index=1((22)2+(0)2)=0\text{Gini Index} = 1 - ((\frac 2 2)^2 + (0)^2) = 0

Weighted Gini=480.375+280.5+280=0.3125\text{Weighted\ Gini} =\frac 4 8 \cdot 0.375 + \frac 2 8 \cdot 0.5 + \frac 2 8 \cdot 0 = 0.3125

And here are those for Dangerous species:

  • dangerous species = yes: Gini Index=1((33)2+(0)2)=0\text{Gini Index} = 1 - ((\frac 3 3)^2 + (0)^2) = 0
  • dangerous species = no: Gini Index=1((45)2+(15)2)=0.32\text{Gini Index} = 1 - ((\frac 4 5)^2 + (\frac 1 5)^2) = 0.32

Weighted Gini=380+580.32=0.2\text{Weighted Gini} =\frac 3 8 \cdot 0 + \frac 5 8 \cdot 0.32 = 0.2

And the winner is dangerous species with a Gini of 0.2. So, let's draw the first split:

Split the decision tree with the dangerous species feature

Finishing the decision tree

After the first split our data looks now like this:

area difficulty go?
1 wood easy yes
2 wood hard no
3 wood easy yes
4 mountains easy yes
5 wood medium yes

We must repeat all the steps from above with the new version of the dataset:

1) Area:

  • area = wood: Gini Index=1((34)2+(14)2)=0.375\text{Gini Index} = 1 - ((\frac 3 4)^2 + (\frac 1 4)^2) = 0.375
  • area = mountains: Gini Index=1((11)2+(0)2)=0\text{Gini Index} = 1 - ((\frac 1 1)^2 + (0)^2) = 0

Weighted Gini=450.375+150=0.3\text{Weighted\ Gini} =\frac 4 5 \cdot 0.375 + \frac 1 5 \cdot 0 = 0.3

2) Difficulty:

  • difficulty = easy: Gini Index=1((33)2+(0)2)=0\text{Gini Index} = 1 - ((\frac 3 3)^2 + (0)^2) = 0
  • difficulty = medium: Gini Index=1((11)2+(0)2)=0\text{Gini Index} = 1 - ((\frac 1 1)^2 + (0)^2) = 0
  • difficulty = hard: Gini Index=1((0)2+(11)2)=0\text{Gini Index} = 1 - ((0)^2 + (\frac 1 1)^2) = 0

Weighted Gini=350+150+150=0\text{Weighted\ Gini} =\frac 3 5 \cdot 0 + \frac 1 5 \cdot 0 + \frac 1 5 \cdot 0 = 0

And here we have an absolute winner, which means that our decision tree is finished:

Split the decision tree with the difficulty feature

The example we provided here is just a simplified version of tasks you'll face on your learning and working path. However, it's still helpful for understanding the Gini algorithm in action and imagining how it works on bigger data.

Conclusion

In this topic we've covered:

  • The Gini index and how to calculate it;
  • The Gini algorithm;
  • The application of the Gini algorithm in a specific case.
13 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo