MathComputational mathOptimization

KL-divergence

Provided by: Edvancium

6 minutes read

Kullback-Leibler (KL) divergence is crucial in various fields, including statistics and machine learning. It measures the difference between two probability distributions and acts as a gauge of dissimilarity. This gives you the power to assess how well one distribution resembles another, making it highly valuable for model evaluation, hypothesis testing, and anomaly detection tasks.

Understanding and using KL divergence allows you to compare different models or quantify the information loss when a complex distribution is made simpler. It helps you make informed decisions about the best models or algorithms for specific tasks.

In this topic, you will delve into the intricate details of KL divergence, studying its properties and grasping the logic behind its computations.

Entropy

Imagine a unique channel through which symbols are transmitted. These symbols could be a long sequence of independent random symbols such as letters, numbers, and special characters like @ or %. Your role is to encode each symbol using bits so you can clearly restore the original sequence.

We need to introduce some helpful notation. If there are $k$ possible symbols, you can denote them as $x_1, \dots, x_k$ . Each new observation is random, indicating that it can present each possible value $x_i$ with a certain probability $p_i$ for every $i \in \{ 1, \dots, k\}$ . It's important to note that this implies $0 < p_i <1$ and that $\sum \limits_{i=1}^k p_i = 1$ . These probabilities define a distribution represented by $p$ .

Your task is to encode each possible observation $x_i$ into a number of bits. Each potential encoding is a mapping $C$ that converts each symbol $x_i$ into a number of bits $C(x_i)$ . So, how can you decide on the best encoding $C$ ? Considering each new symbol is random, it can take any value of $x_1, \dots, x_k$ . Therefore, its number of bits (length) can be estimated in advance by considering all the possible lengths and weighting them by the probability of their occurrence:

\sum_{i=1}^k C(x_i) p_i

The most effective encoding $C$ is the one that minimizes this average length. The best part? This encoding turns out to be straightforward: $C(x_i) = - \log (p_i)$ . Keep in mind that $0 < p_i<1$ indicates $\log (p_i) < 0$ . To make sure the encoding is a positive number, just multiply by $-1$ . Eventually, the average length of encoding will be:

H(p) = - \sum_{i=1}^k p_i \log{p_i}

This number $H(p)$ is referred to as the entropy for the random symbol with the $p$ distribution. It's a measure of uncertainty about the upcoming new symbol. To grasp why this is so, let's examine it closely!

Understanding entropy

If a certain symbol $x_i$ has a high probability $p_i$ , then it is quite certain to occur. Therefore there is little uncertainty about its occurrence. Similarly, when its probability is really low, then there is also little uncertainty about its occurrence: it's quite certain that it won't happen. It makes a lot of sense, doesn't it?

But what happens when the $p_i$ moves away from the extremes? Uncertainty starts to increase because it is just as likely to happen as it is not to happen. The expression $- p_i \log{p_i}$ is just this measure of uncertainty:

Uncertainty

Then, the entropy is the sum of the uncertainty associated with each symbol through its respective probability: $H(p) = - \sum \limits_{i=1}^k p_i \log{p_i}$ . The greater the uncertainty in $p_i$ , the greater its contribution to entropy. In particular, when $k=2$ , the entropy becomes $H(p) = - p \log{p} - (1-p) \log\left(1-p \right)$ and it attains its maximum at $1 / 2$ :

Entropy with just 2 values

The worst scenario happens when all symbols are equally probable $p_i = 1/k$ . In this case the entropy reaches its maximum:

\begin{align*} H(p) &= - \sum_{i=1}^k p_i \log{p_i} \\ &= - \sum_{i=1}^k \frac{1}{k} \log{ \frac{1}{k}} \\ &= \frac{1}{k} \sum_{i=1}^k \log{k} \\ &= \frac{1}{k} \log{k^k} \\ &= \frac{k}{k} \log{k} \\ &= \log{k} \\ \end{align*}

When some probability $p_i$ is 1, its symbol will occur with certainty, so the entropy attains its minimum $H(p) =0$ .

Cross-entropy

In reality, you may not know the probabilities $p$ of various symbols. The best action you can take is to attempt to gather estimates $q_1, \dots, q_k$ of the actual probabilities. A simple method to estimate probabilities is to observe some symbols over time and then calculate each symbol's proportion of records. In doing so, the $q$ distribution becomes a model that approximates $p$ . So, the average length using $q$ to approximate $p$ is:

H(p | q) = - \sum_{i=1}^k p_i \log{q_i}

This figure is known as the cross-entropy from $q$ to $p$ . Except when the distributions are equal, it's always greater than entropy. The higher the value, the less accurately the model reflects reality. Therefore, cross-entropy is a measure of the model's quality: the lower the cross-entropy, the more accurately the model represents reality.

But how effective is the model $q$ in estimating $p$ ? That's a significant drawback of cross-entropy: it's rather difficult to interpret. For instance, what does a cross-entropy of $5$ mean? Is it good or bad? Since cross-entropy is always greater than entropy, it makes sense to compare the two values. This comparison brings us to our main topic, the KL-divergence.

KL-divergence

KL-divergence is simply the difference between cross-entropy and entropy:

\begin{align*} D(p|q) &= H(p|q) - H(p) \\ &= - \sum_{i=1}^k p_i \log(q_i) - \sum_{i=1}^k -p_i \log(p_i) \\ &= - \sum_{i=1}^k p_i \left[ \log(q_i) - \log(p_i) \right] \\ &= -\sum_{i=1}^k p_i \log\left(\frac{q_i}{p_i}\right) \end{align*}

It shows how many bits longer the encoding obtained using model $q_i$ is than the optimal encoding. KL-divergence is always positive, reaching zero only if the model perfectly mirrors reality, that is $p=q$ .

Given that distribution $p$ is fixed, KL-divergence differs from cross-entropy by the constant $H(p)$ . Because of this, minimizing KL-divergence equates to minimizing cross-entropy. In simpler words, if the distribution $q$ minimizes one of them, it brings down the other as well. Therefore, the smaller the KL-divergence is, the more efficient the model proves to be. This offers you a method for choosing the best model: if $q$ and $q'$ are estimates of $p$ , then you should pick the one with the smallest KL-divergence.

Due to KL-divergence not being symmetric, $D(p|q) \neq D(q|p)$ , it cannot be seen as a distance measure. It also doesn't adhere to the triangle inequality. You should see it as a loss of information by predicting $p$ with $q$ , rather than as a measure of discrepancy.

Continuous case

Up to now, we've only looked at discrete distributions. However, we can also apply these definitions to continuous distributions with density $p$ . Here's the entropy for such cases:

H(p) = - \int_{-\infty}^{\infty} p(x) \log p(x) \ dx

The interpretation isn't as straightforward as for the discrete case; for instance, it can yield a negative value. Regardless, you can use it to gauge the uncertainty of a random symbol. If you're comfortable with expectations of functions of a random variable, consider this: if $X$ indicates a random variable with a density function of $p$ , then the entropy equates to the expected value of a transformation of $X$ :

H(p) = -E \left[ \log p(X) \right]

This equation simplifies the process of interpreting and calculating entropy. As expected, the cross-entropy remains the same:

H(p | q) = - \int_{-\infty}^{\infty} p(x) \log q(x) \ d x

We define the KL-divergence in the same manner as before:

\begin{align*} D(p|q) &= H(p|q) - H(p) \\ &= - \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)}\ d x\ \end{align*}

Here's the good news: regardless of the circumstances, the KL-divergence stays non-negative and is zero only if $p=q$ . Therefore, the smaller the KL-divergence, the better.

Conclusion

Entropy measures the uncertainty about a distribution: $H(p) = \begin{cases} - \sum_{i=1}^k p_i \log {p_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log p(x) \ dx & \text{continuous case} \\ \end{cases}$ . If the density function of $X$ is $p$ , then $H(p) = -E \left[ \log p(X) \right]$ .
Cross-entropy measures the quality of the model $q$ estimating distribution $p$ . The better the model describes reality, the lower the cross-entropy: $H(p| q) = \begin{cases} - \sum_{i=1}^k p_i \log{q_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log q(x) \ d x & \text{continuous case} \\ \end{cases}$ .
KL-divergence shows how well a distribution $q$ estimates a particular distribution $p$ : $D(p| q) = H(p|q) - H(p) = \begin{cases} -\sum_{i=1}^k p_i \log \frac{q_i}{p_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)}\ d x & \text{continuous case} \\ \end{cases}$ .
If $q$ and $q'$ are estimates of $p$ , then the model with the smallest KL-divergence is the best one.

How did you like the theory?

Report a typo

KL-divergence

Entropy

Understanding entropy

Cross-entropy

KL-divergence

Continuous case

Conclusion

Related topics