Kullback-Leibler (KL) divergence is crucial in various fields, including statistics and machine learning. It measures the difference between two probability distributions and acts as a gauge of dissimilarity. This gives you the power to assess how well one distribution resembles another, making it highly valuable for model evaluation, hypothesis testing, and anomaly detection tasks.
Understanding and using KL divergence allows you to compare different models or quantify the information loss when a complex distribution is made simpler. It helps you make informed decisions about the best models or algorithms for specific tasks.
In this topic, you will delve into the intricate details of KL divergence, studying its properties and grasping the logic behind its computations.
Entropy
Imagine a unique channel through which symbols are transmitted. These symbols could be a long sequence of independent random symbols such as letters, numbers, and special characters like @ or %. Your role is to encode each symbol using bits so you can clearly restore the original sequence.
We need to introduce some helpful notation. If there are possible symbols, you can denote them as . Each new observation is random, indicating that it can present each possible value with a certain probability for every . It's important to note that this implies and that . These probabilities define a distribution represented by .
Your task is to encode each possible observation into a number of bits. Each potential encoding is a mapping that converts each symbol into a number of bits . So, how can you decide on the best encoding ? Considering each new symbol is random, it can take any value of . Therefore, its number of bits (length) can be estimated in advance by considering all the possible lengths and weighting them by the probability of their occurrence:
The most effective encoding is the one that minimizes this average length. The best part? This encoding turns out to be straightforward: . Keep in mind that indicates . To make sure the encoding is a positive number, just multiply by . Eventually, the average length of encoding will be:
This number is referred to as the entropy for the random symbol with the distribution. It's a measure of uncertainty about the upcoming new symbol. To grasp why this is so, let's examine it closely!
Understanding entropy
If a certain symbol has a high probability , then it is quite certain to occur. Therefore there is little uncertainty about its occurrence. Similarly, when its probability is really low, then there is also little uncertainty about its occurrence: it's quite certain that it won't happen. It makes a lot of sense, doesn't it?
But what happens when the moves away from the extremes? Uncertainty starts to increase because it is just as likely to happen as it is not to happen. The expression is just this measure of uncertainty:
Then, the entropy is the sum of the uncertainty associated with each symbol through its respective probability: . The greater the uncertainty in , the greater its contribution to entropy. In particular, when , the entropy becomes and it attains its maximum at :
The worst scenario happens when all symbols are equally probable . In this case the entropy reaches its maximum:
When some probability is 1, its symbol will occur with certainty, so the entropy attains its minimum .
Cross-entropy
In reality, you may not know the probabilities of various symbols. The best action you can take is to attempt to gather estimates of the actual probabilities. A simple method to estimate probabilities is to observe some symbols over time and then calculate each symbol's proportion of records. In doing so, the distribution becomes a model that approximates . So, the average length using to approximate is:
This figure is known as the cross-entropy from to . Except when the distributions are equal, it's always greater than entropy. The higher the value, the less accurately the model reflects reality. Therefore, cross-entropy is a measure of the model's quality: the lower the cross-entropy, the more accurately the model represents reality.
But how effective is the model in estimating ? That's a significant drawback of cross-entropy: it's rather difficult to interpret. For instance, what does a cross-entropy of mean? Is it good or bad? Since cross-entropy is always greater than entropy, it makes sense to compare the two values. This comparison brings us to our main topic, the KL-divergence.
KL-divergence
KL-divergence is simply the difference between cross-entropy and entropy:
It shows how many bits longer the encoding obtained using model is than the optimal encoding. KL-divergence is always positive, reaching zero only if the model perfectly mirrors reality, that is .
Given that distribution is fixed, KL-divergence differs from cross-entropy by the constant . Because of this, minimizing KL-divergence equates to minimizing cross-entropy. In simpler words, if the distribution minimizes one of them, it brings down the other as well. Therefore, the smaller the KL-divergence is, the more efficient the model proves to be. This offers you a method for choosing the best model: if and are estimates of , then you should pick the one with the smallest KL-divergence.
Due to KL-divergence not being symmetric, , it cannot be seen as a distance measure. It also doesn't adhere to the triangle inequality. You should see it as a loss of information by predicting with , rather than as a measure of discrepancy.
Continuous case
Up to now, we've only looked at discrete distributions. However, we can also apply these definitions to continuous distributions with density . Here's the entropy for such cases:
The interpretation isn't as straightforward as for the discrete case; for instance, it can yield a negative value. Regardless, you can use it to gauge the uncertainty of a random symbol. If you're comfortable with expectations of functions of a random variable, consider this: if indicates a random variable with a density function of , then the entropy equates to the expected value of a transformation of :
This equation simplifies the process of interpreting and calculating entropy. As expected, the cross-entropy remains the same:
We define the KL-divergence in the same manner as before:
Here's the good news: regardless of the circumstances, the KL-divergence stays non-negative and is zero only if . Therefore, the smaller the KL-divergence, the better.
Conclusion
Entropy measures the uncertainty about a distribution: . If the density function of is , then .
Cross-entropy measures the quality of the model estimating distribution . The better the model describes reality, the lower the cross-entropy: .
KL-divergence shows how well a distribution estimates a particular distribution : .
If and are estimates of , then the model with the smallest KL-divergence is the best one.