Computer scienceData scienceMachine learningClassificationDecision tree learning

Decision Trees

10 minutes read

This topic will focus on a popular machine learning algorithm, decision trees. Decision trees are a reasonably popular machine learning algorithm that resembles the human decision-making process. This characteristic makes this algorithm easy to understand. It is a versatile tool for regression and classification problems.

Imagine a situation where you choose between going for a walk with your friends or staying at home in the evening.

Solve the Neflix dilemma with a binary decision tree

In terms of Machine Learning, we can say that it is a problem of binary classification of entertainment with features "time" and "new series on Netflix".

In other words, these features can be positioned along the XX and YY axes, obtaining a two-dimensional plane. If we make the divisions of the plane using lines, then we will see that each subdomain of the plane represents some class of entertainment.

Cartesian plot of the solution to the Neflix dilemma problem

A little more imagination is required to represent multi-feature spaces, but first things first.

The main idea

Before we get to see how these trees are built, let's start with the terminology.

  • A Node. A tree node is a part of a tree containing a condition. It has zero or more child nodes.
  • A Root. The first tree node.
  • A Leaf or a terminal node. A node without child nodes (an outcome).
  • A Condition or an internal node. A condition in a node indicates a possible result or the next action with the sample.
  • A Stopping criterion. A particular condition, which determines whether to make a split in a given node or not. If splitting is stopped that node becomes a leaf.

The components of a decision tree

In these terms, we can write the main algorithm as follows. At each node, we want to find a split into two parts, i.e., two leaves, with a predetermined quality loss function. We will clarify what the loss function is below. Still, it is intuitively clear that, at each step, we want to get a question, the answer to which will make our solution more accurate. Depending on the condition, a sample goes to the right node or the left node.

It is important to note that splitting at a node does not always create two leaves. In the case of categorical features, the node can be split into NN leaves.

Further, each leaf becomes the next node. We check this node for a predetermined stopping criterion. If it is fulfilled, the node becomes a leaf โ€” the final point. Otherwise, the action is repeated recursively. The final leaf (the terminal node) determines the solution for each sample that falls into it for prediction. In other words, all samples from one leaf are of one class or value. In the task of regression, the leaf produces an average real number; in classification, the leaf produces a class or the probability of belonging to that class.

๐Ÿ˜Š This architecture of the algorithm allows you to build an interpretable tool that can identify nonlinear dependencies within the data. Another valuable tree advantage is the ability to work with missing values as well as with non-normalized data.
๐Ÿฅฒ On the other hand, the algorithm is discrete, which means that it is difficult to optimize. A too complex design makes trees more prone to overfitting.

Quality loss function

Despite the seeming simplicity, the main problem of trees is how to choose a new splitting. Intuitively, it is obvious that each time we split the node we want different objects to go to different leaves and similar objects to stay together.

Good and bad splitting criteria for decision trees

Let's consider the classification problem at first. The example above illustrates the mathematical description of the Shannon Entropy for the state of a system โ€” also known as Information Gain. In simple terms, entropy is a measure of the chaos of a system: more chaos means large entropy. If there are only cats in a leaf then the entropy of this leaf equals zero. This is our goal.

Another criterion may be familiar for those who came to ML from economics. It is the Gini Impurity. It can be used for calculating the informativeness of the system in classification problems. This criterion is based on the probability of a random sample from the data being assigned to the wrong class.

For the regression problem, the criterion is more trivial, and it is the reduction in variance. The values that fall into one leaf are averaged (for example, by calculating mean or median). The smaller variance of a variable in a leaf means a more accurate prediction.

Task Criterion
Classification ๐Ÿถ | ๐Ÿฑ Information Gain, Gini Impurity
Regression ๐Ÿ“ˆ Reduction in variance

Stopping criteria and pruning

Sooner or later, the splitting of the tree must be stopped. Otherwise, only one sample will fall into each leaf, and the tree will be overfitted. It means that a tree will be adapted to the training set and poorly generalized to new data.

There are two ways to find a balance between uninformative trees and overfitted trees: stopping criterion and pruning.

Pruning is the removal of nodes from a tree to reduce complexity and increase generalizability. A common approach is to grow the whole tree then remove nodes that do not have a significant effect on the result.

The stopping criterion is some criterion that we check at each node to decide whether we stop splitting or not. It can be very diverse: the maximum depth of the tree, the number of samples in a leaf, improvement by a certain percentage of accuracy, and many others.

Conclusion

  • A decision tree consists of a root, nodes, and leaves.
  • In each node, a decision is made based on a certain condition with an eye to stopping criterion.
  • Before making a decision, the condition is selected using a quality loss function.
  • Pruning and stopping criteria help to avoid overfitting.
65 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo