MathComputational mathOptimization

KL-divergence

6 minutes read

Kullback-Leibler (KL) divergence is crucial in various fields, including statistics and machine learning. It measures the difference between two probability distributions and acts as a gauge of dissimilarity. This gives you the power to assess how well one distribution resembles another, making it highly valuable for model evaluation, hypothesis testing, and anomaly detection tasks.

Understanding and using KL divergence allows you to compare different models or quantify the information loss when a complex distribution is made simpler. It helps you make informed decisions about the best models or algorithms for specific tasks.

In this topic, you will delve into the intricate details of KL divergence, studying its properties and grasping the logic behind its computations.

Entropy

Imagine a unique channel through which symbols are transmitted. These symbols could be a long sequence of independent random symbols such as letters, numbers, and special characters like @ or %. Your role is to encode each symbol using bits so you can clearly restore the original sequence.

We need to introduce some helpful notation. If there are kk possible symbols, you can denote them as x1,,xkx_1, \dots, x_k. Each new observation is random, indicating that it can present each possible value xix_i with a certain probability pip_i for every i{1,,k}i \in \{ 1, \dots, k\}. It's important to note that this implies 0<pi<10 < p_i <1 and that i=1kpi=1\sum \limits_{i=1}^k p_i = 1. These probabilities define a distribution represented by pp.

Your task is to encode each possible observation xix_i into a number of bits. Each potential encoding is a mapping CC that converts each symbol xix_i into a number of bits C(xi)C(x_i). So, how can you decide on the best encoding CC? Considering each new symbol is random, it can take any value of x1,,xkx_1, \dots, x_k. Therefore, its number of bits (length) can be estimated in advance by considering all the possible lengths and weighting them by the probability of their occurrence:

i=1kC(xi)pi\sum_{i=1}^k C(x_i) p_i

The most effective encoding CC is the one that minimizes this average length. The best part? This encoding turns out to be straightforward: C(xi)=log(pi)C(x_i) = - \log (p_i). Keep in mind that 0<pi<10 < p_i<1 indicates log(pi)<0\log (p_i) < 0. To make sure the encoding is a positive number, just multiply by 1-1. Eventually, the average length of encoding will be:

H(p)=i=1kpilogpiH(p) = - \sum_{i=1}^k p_i \log{p_i}

This number H(p)H(p) is referred to as the entropy for the random symbol with the pp distribution. It's a measure of uncertainty about the upcoming new symbol. To grasp why this is so, let's examine it closely!

Understanding entropy

If a certain symbol xix_i has a high probability pip_i, then it is quite certain to occur. Therefore there is little uncertainty about its occurrence. Similarly, when its probability is really low, then there is also little uncertainty about its occurrence: it's quite certain that it won't happen. It makes a lot of sense, doesn't it?

But what happens when the pip_i moves away from the extremes? Uncertainty starts to increase because it is just as likely to happen as it is not to happen. The expression pilogpi- p_i \log{p_i} is just this measure of uncertainty:

Uncertainty

Then, the entropy is the sum of the uncertainty associated with each symbol through its respective probability: H(p)=i=1kpilogpiH(p) = - \sum \limits_{i=1}^k p_i \log{p_i}. The greater the uncertainty in pip_i, the greater its contribution to entropy. In particular, when k=2k=2, the entropy becomes H(p)=plogp(1p)log(1p)H(p) = - p \log{p} - (1-p) \log\left(1-p \right) and it attains its maximum at 1/21 / 2:

Entropy with just 2 values

The worst scenario happens when all symbols are equally probable pi=1/kp_i = 1/k. In this case the entropy reaches its maximum:

H(p)=i=1kpilogpi=i=1k1klog1k=1ki=1klogk=1klogkk=kklogk=logk\begin{align*} H(p) &= - \sum_{i=1}^k p_i \log{p_i} \\ &= - \sum_{i=1}^k \frac{1}{k} \log{ \frac{1}{k}} \\ &= \frac{1}{k} \sum_{i=1}^k \log{k} \\ &= \frac{1}{k} \log{k^k} \\ &= \frac{k}{k} \log{k} \\ &= \log{k} \\ \end{align*}

When some probability pip_i is 1, its symbol will occur with certainty, so the entropy attains its minimum H(p)=0H(p) =0.

Cross-entropy

In reality, you may not know the probabilities pp of various symbols. The best action you can take is to attempt to gather estimates q1,,qkq_1, \dots, q_k of the actual probabilities. A simple method to estimate probabilities is to observe some symbols over time and then calculate each symbol's proportion of records. In doing so, the qq distribution becomes a model that approximates pp. So, the average length using qq to approximate pp is:

H(pq)=i=1kpilogqiH(p | q) = - \sum_{i=1}^k p_i \log{q_i}

This figure is known as the cross-entropy from qq to pp. Except when the distributions are equal, it's always greater than entropy. The higher the value, the less accurately the model reflects reality. Therefore, cross-entropy is a measure of the model's quality: the lower the cross-entropy, the more accurately the model represents reality.

But how effective is the model qq in estimating pp? That's a significant drawback of cross-entropy: it's rather difficult to interpret. For instance, what does a cross-entropy of 55 mean? Is it good or bad? Since cross-entropy is always greater than entropy, it makes sense to compare the two values. This comparison brings us to our main topic, the KL-divergence.

KL-divergence

KL-divergence is simply the difference between cross-entropy and entropy:

D(pq)=H(pq)H(p)=i=1kpilog(qi)i=1kpilog(pi)=i=1kpi[log(qi)log(pi)]=i=1kpilog(qipi)\begin{align*} D(p|q) &= H(p|q) - H(p) \\ &= - \sum_{i=1}^k p_i \log(q_i) - \sum_{i=1}^k -p_i \log(p_i) \\ &= - \sum_{i=1}^k p_i \left[ \log(q_i) - \log(p_i) \right] \\ &= -\sum_{i=1}^k p_i \log\left(\frac{q_i}{p_i}\right) \end{align*}

It shows how many bits longer the encoding obtained using model qiq_i is than the optimal encoding. KL-divergence is always positive, reaching zero only if the model perfectly mirrors reality, that is p=qp=q.

Given that distribution pp is fixed, KL-divergence differs from cross-entropy by the constant H(p)H(p). Because of this, minimizing KL-divergence equates to minimizing cross-entropy. In simpler words, if the distribution qq minimizes one of them, it brings down the other as well. Therefore, the smaller the KL-divergence is, the more efficient the model proves to be. This offers you a method for choosing the best model: if qq and qq' are estimates of pp, then you should pick the one with the smallest KL-divergence.

Due to KL-divergence not being symmetric, D(pq)D(qp)D(p|q) \neq D(q|p), it cannot be seen as a distance measure. It also doesn't adhere to the triangle inequality. You should see it as a loss of information by predicting pp with qq, rather than as a measure of discrepancy.

Continuous case

Up to now, we've only looked at discrete distributions. However, we can also apply these definitions to continuous distributions with density pp. Here's the entropy for such cases:

H(p)=p(x)logp(x) dxH(p) = - \int_{-\infty}^{\infty} p(x) \log p(x) \ dx

The interpretation isn't as straightforward as for the discrete case; for instance, it can yield a negative value. Regardless, you can use it to gauge the uncertainty of a random symbol. If you're comfortable with expectations of functions of a random variable, consider this: if XX indicates a random variable with a density function of pp, then the entropy equates to the expected value of a transformation of XX:

H(p)=E[logp(X)]H(p) = -E \left[ \log p(X) \right]

This equation simplifies the process of interpreting and calculating entropy. As expected, the cross-entropy remains the same:

H(pq)=p(x)logq(x) dxH(p | q) = - \int_{-\infty}^{\infty} p(x) \log q(x) \ d x

We define the KL-divergence in the same manner as before:

D(pq)=H(pq)H(p)=p(x)logp(x)q(x) dx \begin{align*} D(p|q) &= H(p|q) - H(p) \\ &= - \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)}\ d x\ \end{align*}

Here's the good news: regardless of the circumstances, the KL-divergence stays non-negative and is zero only if p=qp=q. Therefore, the smaller the KL-divergence, the better.

Conclusion

  • Entropy measures the uncertainty about a distribution: H(p)={i=1kpilogpidiscrete case p(x)logp(x) dxcontinuous caseH(p) = \begin{cases} - \sum_{i=1}^k p_i \log {p_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log p(x) \ dx & \text{continuous case} \\ \end{cases}. If the density function of XX is pp, then H(p)=E[logp(X)]H(p) = -E \left[ \log p(X) \right].

  • Cross-entropy measures the quality of the model qq estimating distribution pp. The better the model describes reality, the lower the cross-entropy: H(pq)={i=1kpilogqidiscrete case p(x)logq(x) dxcontinuous caseH(p| q) = \begin{cases} - \sum_{i=1}^k p_i \log{q_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log q(x) \ d x & \text{continuous case} \\ \end{cases}.

  • KL-divergence shows how well a distribution qq estimates a particular distribution pp: D(pq)=H(pq)H(p)={i=1kpilogqipidiscrete case p(x)logp(x)q(x) dxcontinuous caseD(p| q) = H(p|q) - H(p) = \begin{cases} -\sum_{i=1}^k p_i \log \frac{q_i}{p_i} & \text{discrete case } \\ - \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)}\ d x & \text{continuous case} \\ \end{cases}.

  • If qq and qq' are estimates of pp, then the model with the smallest KL-divergence is the best one.

How did you like the theory?
Report a typo