Computer scienceData scienceMachine learningClassificationDecision tree learning

ID3 algorithm

20 minutes read

Now you know what a decision tree is. Let's imagine a situation: you'd like to go hiking for a couple of days. But the fact is, the area you've picked for a hike is infested with snakes. Luckily, most of them aren't venomous. In your travel guide, you found a list of the most dangerous species. Since the list is too long to remember, you've decided to build a decision tree that will help you detect whether a snake is venomous or not.

A decision tree with a '?' at the root node and two leaves: a dark blue (non-venomous) and a bright blue (venomous) snake

Such a great idea! But which features should we trust when making a split? Suppose we built a decision tree by picking a random feature for each split. With such an approach, our decision tree would make inaccurate predictions on new data. To make our tree perform better, we must apply some criterion to choose the feature that produces the best split. As a result, if we encounter a venomous snake while hiking, we will be able to classify the snake correctly.

There are plenty of algorithms for building a decision tree. In this topic, we'll cover one of the most popular - the ID3 (Iterative Dichotomiser 3) algorithm.

Entropy

ID3 is based on two metrics: entropy and information gain. Let's start with entropy. It reflects the level of uncertainty that we have in our data and can take values from 0 to 1. In other words, the more chaotic and non-homogenous our data is, the closer entropy approaches 1. Let's consider a random variable XX:

X={1,if a snake is venomous0,otherwiseX = \begin{cases} 1, & \text {if a snake is venomous}\\ 0, & \text {otherwise}\end{cases}

The formula to calculate the entropy of XX is as follows:

H(X)=i=1np(xi)log2(1p(xi))H(X)=\sum_{i=1}^np(x_i)\log_2 \Big(\frac{1}{p(x_i)} \Big)

where xix_i is a possible outcome of XX. In our case, it can be either 0 or 1. Take a look at the graph:

Entropy plot for different sets of snakes as classes (the sets with homogenous snakes have 0 entropy, and diverse sets have higher entropy)

As you may have noted, if H(X)=0H(X)=0, then we have absolutely homogenous data. In this case, there is no chaos in our data.

The function reaches its maximum at the point (0.5;1)(0.5;1). Why? Because at this point, the probabilities of both outcomes are equal to 0.50.5, which means that both events are equally likely to occur. For our example, it means that the snake we'd be looking at is equally likely to be venomous or not. That's not good, right? Therefore, our goal is to reduce the uncertainty in our data. But how can we measure the reduction of uncertainty? At this point, information gain comes into play.

Information Gain

The formula to calculate information gain is simple:

IG(A,B)=H(A)H(AB)\text{IG}(A,B)=H(A)−H(A∣B)

Here, we subtract the entropy of AA conditioned on BB from the entropy of AA. You can understand the result as the amount of information that BB "knows" about AA.

Let's say we've made a table from the travel guide that contains records about every snake species we may run into. It has the following features: species name, bright colours, spotted, hooded head, and venomous. To make the best first split, we must find the feature that gives us the maximum information gain.

We will calculate information gain for all possible pairs "venomous - ...":

  • IG\text{IG} (venomous, bright colours);

  • IG\text{IG} (venomous, spotted);

  • IG\text{IG} (venomous, hooded head) ,

and pick the largest value.

ID3

Now let's sum up everything above. To build a decision tree, start from the root and:

  1. Calculate the entropy of the target variable (venomous in our case).

  2. Calculate the conditional entropy for all possible pairs "the target variable - a categorical feature".

  3. Calculate the weighted conditional entropy of each categorical feature.
    Don't worry, we will consider this new concept below.

  4. Given the calculations above, find the information gain of each categorical feature.

  5. Select the feature that has the maximum IG\text{IG} to make the best split.

  6. Repeat for the next node.

This algorithm is called Iterative Dichotomiser 3. It seems it's time to put everything we've covered in this topic into practice!

Toy dataset

Let's take a look at our snake dataset:

name

bright colors

spotted

hooded head

venomous

Water Snake

No

No

No

No

Milk Snake

Yes

No

No

No

Garter Snake

No

Yes

No

No

Python

No

Yes

No

No

Tree Boa

Yes

No

No

No

King cobra

No

Yes

Yes

Yes

Black mamba

No

No

Yes

Yes

Coral Snake

Yes

Yes

No

Yes

Boomslang

Yes

Yes

No

Yes

Sea Krait

Yes

Yes

No

Yes

Let's build a decision tree based on the ID3 algorithm. First, we must calculate entropy for the label venomous (VV). We have 10 samples in our dataset: the half is venomous, the other half is not. Therefore,

H(V)=i=1np(xi)log(1p(xi))=0.5log(10.5)+0.5log(10.5)=1H(V)=∑_{i=1}^n p(x_i)\log \Big(\frac{1}{p(x_i)}\Big)= 0.5⋅\log\Big(\frac{1}{0.5}\Big) +0.5⋅\log \Big(\frac{1}{0.5}\Big)=1

Next, it's time to calculate information gain for each feature. We'll start with bright color(BCBC). In our data, we have 3 bright-colored venomous samples and 2 bright-colored non-venomous samples. So the probability that a snake is venomous given that it has bright colors is 35\frac{3}{5}; the probability of being non-venomous given the same condition is 25\frac{2}{5}. Therefore,

H(VBC = yes)=0.6log(10.6)+0.4log(10.4)=0.97H(V∣\text{BC = yes})=0.6⋅\log \Big( \frac{1}{0.6} \Big)+0.4⋅\log \Big( \frac{1}{0.4} \Big)=0.97

Now we need to calculate the entropy for dark-colored snakes, the total number of whose is equal to 5 (2 are venomous and 3 not).

H(VBC = no)=0.6log(10.6)+0.4log(10.4)=0.97H(V∣\text{BC = no})=0.6⋅\log \Big(\frac{1}{0.6} \Big)+0.4⋅\log \Big( \frac{1}{0.4} \Big)=0.97

The next question is how to get the entropy of the whole feature. To calculate the entropy for bright colors, just sum all weighted entropy values that we've got above. "Weighted" means that we multiply entropy by the probability of the value within the feature. In other words, the half of the snakes is bright-colored, and the other half is dark-colored, so

H(VBC)=0.50.97+0.50.97=0.97H(V∣BC)=0.5⋅0.97+0.5⋅0.97=0.97

Hence, the information gain for the feature bright color is as follows:

IG(V,BC)=10.97=0.03\text{IG}(V,BC)=1−0.97=0.03

That's not quite impressive, so let's move to the next feature.

We must perform the same set of calculations for the feature spotted (SS):

spotted

venomous

non-venomous

total

yes

4

2

6

no

1

3

4

H(VS = yes)=46log(64)+26log(62)=0.92H(VS = no)=14log(4)+34log(43)=0.81H(VS)=0.60.92+0.40.81=0.88IG(V,S)=10.88=0.12H(V∣\text{S = yes})=\frac{4}{6}⋅\log(\frac{6}{4})+\frac{2}{6}⋅\log(\frac{6}{2})=0.92 \newline H(V∣\text{S = no})=\frac{1}{4}⋅\log(4)+\frac{3}{4}⋅\log(\frac{4}{3})=0.81 \newline H(V∣S)=0.6⋅0.92+0.4⋅0.81=0.88 \newline \text{IG}(V,S)=1−0.88=0.12

A bit better. But what is about the last feature?

hooded head

venomous

non-venomous

total

yes

2

0

2

no

3

5

8

H(VHH = yes)=22log(1)+0=0H(VHH = no)=38log(83)+58log(85)=0.95H(VHH)=2100+8100.95=0.76IG(V,HH)=10.76=0.24H(V∣\text{HH = yes})=\frac{2}{2}⋅\log(1)+0=0 \newline H(V∣\text{HH = no})=\frac{3}{8}⋅\log(\frac{8}{3})+\frac{5}{8}⋅\log(\frac{8}{5})=0.95 \newline H(V∣HH)=\frac{2}{10} \cdot 0 + \frac{8}{10} \cdot 0.95=0.76 \newline \text{IG}(V,HH)=1−0.76=0.24

Bingo! We've found the best split. All snakes with hooded heads were venomous, so the root (=first node) of our decision tree looks like:

A decision tree with 'Hooded head?' at the root node and 'Yes' and 'No' leaves

We proceed to build the next node of our tree. We consider the rest of the samples under condition that they don't have hooded head (i.e. 8 samples). Again we need to perform the same calculations but for the smaller dataset.

bright color:

BC

venomous

non-venomous

total

yes

3

2

5

no

0

3

3

H(VBC=yesHH=no)=0.6log(10.6)+0.4log(10.4)=0.97H(VBC=noHH=no)=0+1log(1)=0H(VBCHH=no)=580.97+0=0.61IG(V,BC)=10.61=0.39H(V∣BC=\text{yes}∧HH=\text{no})=0.6⋅\log(\frac{1}{0.6})+0.4⋅\log(\frac{1}{0.4})=0.97 \newline H(V∣BC=\text{no}∧HH=\text{no})=0+1⋅\log(1)=0 \newline H(V∣BC∧HH=\text{no})=\frac{5}{8}⋅0.97+0=0.61 \newline \text{IG}(V,BC)=1−0.61=0.39

spotted:

H(VS=yesHH=no)=0.6log(10.6)+0.4log(10.4)=0.97H(VS=noHH=no)=0+1log(1)=0H(VSHH=no)=580.97+0=0.61IG(V,S)=10.61=0.39H(V∣S=\text{yes}∧HH=\text{no})=0.6⋅\log(\frac{1}{0.6})+0.4⋅\log(\frac{1}{0.4})=0.97 \newline H(V∣S=\text{no}∧HH=\text{no})=0+1⋅\log(1)=0 \newline H(V∣S∧HH=\text{no})=\frac{5}{8}⋅0.97+0=0.61 \newline \text{IG}(V,S)=1−0.61=0.39

IG\text{IG} values for both features are equal, so let's take bright color for a split and draw a decision tree:

A decision tree with 'Hooded head?' and 'Bright colored?' as the splits

It seems only one feature (spotted) has left. Here is an excerpt from the dataset with bright-colored samples without hooded head:

name

spotted

venomous

Milk Snake

No

No

Tree Boa

No

No

Coral Snake

Yes

Yes

Boomslang

Yes

Yes

Sea Krait

Yes

Yes

It's quite obvious. We don't need to calculate anything here, let's make a final cut:

A decision tree with 'Hooded head?', 'Bright colored?' and 'Spotted?' as the splits

Our decision tree is done!

P.S. There are about 3 000 snake species in the world, so our decision tree most likely isn't able to make accurate predictions. So, don't go hiking with a decision tree trained on 10 snake samples! However, with a much bigger dataset it's possible to build a decision tree that will be quite precise in classifying species. In the next topics you'll know more about this.

Conclusion

In this topic we've covered the following concepts:

  • Entropy;

  • Information gain;

  • ID3 algorithm for building a decision tree.

15 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo