Information Theory
Kullback-Leibler Divergence
The extra bits you pay when you code samples from P using a code built for Q
D(P‖Q) = Σ P(x) log(P(x)/Q(x)) — the extra coding cost of using Q for P. Zero iff P = Q, infinite where Q vanishes. The objective behind VAE ELBOs.
- DefinitionD(P‖Q) = Σ P(x) log[P(x) / Q(x)]
- Sign≥ 0 always; = 0 iff P = Q (Gibbs' inequality)
- AsymmetricD(P‖Q) ≠ D(Q‖P); no triangle inequality
- PathologyD = ∞ whenever Q(x) = 0 and P(x) > 0
- DecompositionH(P, Q) = H(P) + D(P‖Q)
- Used inVAE, ELBO, policy gradients, PPO, MLE, regularisation
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The coding-cost picture
Take two probability distributions P and Q on the same set of outcomes. Imagine you build an optimal prefix code for Q — Huffman or arithmetic coding, whatever — that assigns a codeword of length about −log Q(x) bits to each outcome x. Now feed it samples drawn from P, not Q. On average, you spend −Σ P(x) log Q(x) bits per sample. The smallest possible average — using a code built for the true distribution P — is the entropy −Σ P(x) log P(x). The difference is what you over-pay:
D(P ‖ Q) = Σ_x P(x) · log[ P(x) / Q(x) ]
= − Σ_x P(x) · log Q(x) − (− Σ_x P(x) · log P(x))
= H(P, Q) − H(P)
That's it. The Kullback-Leibler divergence is the expected number of extra bits — the gap between cross-entropy and entropy. For continuous distributions, replace the sum with an integral and the probability mass functions with densities.
Kullback and Leibler introduced this quantity in their 1951 paper On Information and Sufficiency, where they called it the "discrimination information." Modern usage standardised on "relative entropy" in the information-theory community and "KL divergence" in machine learning. The two names point at the same object.
Three properties that make it useful
1. Non-negative, zero only at equality. D(P‖Q) ≥ 0 with equality iff P = Q (Gibbs' inequality). Apply Jensen's inequality to the concave logarithm:
−D(P‖Q) = E_P[ log(Q/P) ] ≤ log E_P[Q/P] = log Σ Q(x) = log 1 = 0.
Equality in Jensen's inequality requires the integrand log(Q/P) to be constant P-almost-everywhere, which (because both distributions sum to 1) forces P = Q.
2. Asymmetric. D(P‖Q) ≠ D(Q‖P) in general. Try P = (0.5, 0.5) and Q = (0.9, 0.1):
D(P‖Q) = 0.5 log(0.5/0.9) + 0.5 log(0.5/0.1) ≈ 0.510 nats
D(Q‖P) = 0.9 log(0.9/0.5) + 0.1 log(0.1/0.5) ≈ 0.368 nats
Switching arguments changes the number by 30 percent. The asymmetry has a name in optimisation: the "M-projection" minimises D(P‖Q) over Q (mean-seeking, zero-avoiding), while the "I-projection" minimises D(Q‖P) (mode-seeking, zero-forcing). They produce visibly different fits when the target distribution is multimodal.
3. Infinite where the support breaks. If Q(x₀) = 0 at any point where P(x₀) > 0, the integrand log(P/Q) is infinite at x₀ and so is D(P‖Q). The reverse — Q(x) > 0 at points where P(x) = 0 — is fine: the term 0 · log(0/Q) is conventionally zero. Practically this means an approximating distribution Q must cover every point of P's support; numerical implementations smooth Q (Laplace, Dirichlet prior, softmax floor) to keep it strictly positive.
Worked example — biased coin vs fair coin
Let P be the true distribution of a coin with bias 0.8 toward heads, and Q be a uniform fair coin (0.5, 0.5). The KL from truth to model:
D(P‖Q) = 0.8 · log(0.8 / 0.5) + 0.2 · log(0.2 / 0.5)
= 0.8 · log(1.6) + 0.2 · log(0.4)
= 0.8 · 0.470 + 0.2 · (−0.916)
= 0.376 − 0.183
= 0.193 nats
= 0.193 / ln 2 ≈ 0.278 bits.
So coding samples from the biased coin with the fair-coin code costs an extra 0.278 bits per flip. Over 1000 flips that is 278 wasted bits — significant if you're streaming. The reverse direction:
D(Q‖P) = 0.5 · log(0.5 / 0.8) + 0.5 · log(0.5 / 0.2)
= 0.5 · (−0.470) + 0.5 · 0.916
= 0.223 nats
= 0.322 bits.
Notice D(Q‖P) > D(P‖Q) here. The fair-coin code under-codes the rare-tail case (Q assigns 0.5 to tails but P assigns only 0.2), which is the more expensive mismatch.
Connection to cross-entropy and maximum likelihood
The cross-entropy H(P, Q) = −Σ P(x) log Q(x) decomposes as
H(P, Q) = H(P) + D(P‖Q).
When P is the empirical data distribution and Q is a model, H(P) is a constant of the data and minimising H(P, Q) over the model parameters is exactly minimising D(P‖Q). Equivalently — by writing −Σ P(x) log Q(x) ≈ −(1/n) Σᵢ log Q(xᵢ) for the empirical distribution — minimising cross-entropy is maximum likelihood estimation. Every classifier you've trained with cross-entropy loss is, under the hood, minimising the KL divergence between the data distribution and its model. The binary form −y log ŷ − (1 − y) log(1 − ŷ) is the same identity for a two-class Bernoulli model.
Variational inference and the ELBO
Bayesian inference wants the posterior p(z | x), which is usually intractable. Variational inference replaces it by a tractable family q(z; φ) and minimises D(q(z; φ) ‖ p(z | x)) over φ. Substituting p(z | x) = p(x, z) / p(x) and rearranging produces the identity that the entire field is built on:
log p(x) = E_q[ log p(x, z) − log q(z; φ) ] + D( q(z; φ) ‖ p(z | x) )
= ELBO(φ) + KL.
The log evidence log p(x) does not depend on φ. The KL term is non-negative. So minimising the KL is the same as maximising the ELBO — and the ELBO is computable from the joint p(x, z) without ever forming the posterior. This is what variational autoencoders, mean-field methods and neural-network Bayesian inference all do.
Mode-seeking vs mean-seeking
Set P to a mixture of two well-separated Gaussians and Q to a single Gaussian. Then:
- Forward KL, min D(P‖Q). Each integrand P(x) log(P/Q(x)) is heavily penalised when Q(x) is tiny but P(x) is not. The optimal Q stretches to cover both modes, even at the cost of putting density in the empty valley between them. This is mean-seeking — the fitted Q has mean ≈ midpoint of the two true modes and large variance. Expectation propagation, moment-matching and forward KL VI all behave this way.
- Reverse KL, min D(Q‖P). Each integrand Q(x) log(Q/P(x)) is heavily penalised when Q(x) is positive but P(x) is near zero. The optimal Q collapses onto one of the two modes and ignores the other. This is mode-seeking — the fitted Q is narrow and centred on a single peak. Variational autoencoders, REINFORCE, mean-field VI and PPO all behave this way.
Neither answer is "the right one". They optimise different losses. The choice between forward and reverse KL is one of the most consequential design decisions in modern probabilistic modelling — bring it up in any VI paper review and someone will write a comment.
JavaScript — computing KL divergence
// Discrete KL divergence (in nats; divide by Math.LN2 for bits)
function klDivergence(P, Q) {
if (P.length !== Q.length) throw new Error('P and Q must have same support');
let kl = 0;
for (let i = 0; i < P.length; i++) {
if (P[i] === 0) continue; // 0 · log 0 := 0
if (Q[i] === 0) return Infinity; // unavoidable blow-up
kl += P[i] * Math.log(P[i] / Q[i]);
}
return kl;
}
// Smoothing — add ε before normalising to keep Q strictly positive
function smooth(dist, eps = 1e-9) {
const total = dist.reduce((s, p) => s + p + eps, 0);
return dist.map(p => (p + eps) / total);
}
// Worked example — biased coin vs fair coin
const P = [0.8, 0.2]; // true biased coin
const Q = [0.5, 0.5]; // fair-coin model
console.log(klDivergence(P, Q)); // 0.193 nats
console.log(klDivergence(P, Q) / Math.LN2); // 0.278 bits
console.log(klDivergence(Q, P)); // 0.223 nats — different number
Where KL divergence shows up
- Cross-entropy classifier loss. Every softmax classifier you train minimises KL between the one-hot label distribution and the predicted distribution. ImageNet, transformer pre-training, masked language modelling — all KL minimisation in disguise.
- Variational autoencoders. The encoder produces q(z | x), the prior is p(z) = 𝒩(0, I), and the loss penalises D(q(z|x)‖p(z)) on top of the reconstruction term. The KL term keeps the latent space close to a unit Gaussian and is the difference between a VAE and a deterministic autoencoder.
- Policy gradient with trust region (TRPO, PPO). The trust region in TRPO is D(π_old‖π_new) ≤ δ. PPO replaces the hard constraint with a clipped surrogate but still uses KL to monitor the size of policy updates. RLHF for LLMs adds an explicit KL penalty against a reference policy.
- Bayesian model selection. The marginal log-likelihood log p(x) of competing models is computed by integrating out parameters; the variational lower bound provides a KL-aware approximation that is the basis of automatic relevance determination and Gaussian-process model selection.
- Anomaly detection and drift monitoring. D(P_train‖P_live) measures how much the input distribution to a deployed model has drifted from training. Library-level production monitoring (Evidently AI, Arize) reports KL between train and serving distributions over individual features.
- Mutual information. I(X; Y) = D(p(x, y)‖p(x)p(y)) — the KL between the joint distribution and the product of marginals. KL is the parent quantity from which most other information-theoretic measures derive.
Common pitfalls
- Treating KL as a distance. It is not. It fails symmetry and the triangle inequality. If you need a metric, use Jensen-Shannon, Hellinger or Wasserstein — all are symmetric and bounded, and Jensen-Shannon is the symmetric average of two KLs.
- Letting Q hit zero on samples. Whenever you fit Q to data, smooth Q (add a small ε, use a Dirichlet prior, replace argmax by softmax) so the divergence stays finite. Otherwise a single unseen training point sends the loss to infinity.
- Confusing forward and reverse KL. The order of arguments matters. D(P‖Q) is mean-seeking; D(Q‖P) is mode-seeking. Reversing them in a VI implementation silently changes the optimum from a broad cover-all to a single mode.
- Plugging in raw counts. Empirical KL between two histograms is biased downward; the bias is roughly (k − 1) / (2n) nats where k is the number of bins. For honest density-comparison, use the Wasserstein distance or kernel methods.
- Comparing KLs across different sample sizes. The numerical value of D̂ depends on how many bins you used and how much data you have. Two studies reporting "KL = 0.3 nats" may mean very different fits if one used 10 bins and the other 1000.
- Ignoring units. Switching base of the logarithm rescales every number. Conferences default to nats; engineering defaults to bits. Always state the base.
KL vs related divergences
| KL divergence | Jensen-Shannon | Total variation | Wasserstein-1 | |
|---|---|---|---|---|
| Symmetric | No | Yes | Yes | Yes |
| Triangle inequality | No | Square root is a metric | Yes | Yes |
| Bounded | Unbounded (∞ on support mismatch) | ≤ log 2 | ≤ 1 | Unbounded but finite when supports overlap |
| Handles disjoint supports | No (becomes ∞) | Yes (saturates at log 2) | Yes (saturates at 1) | Yes (uses ground metric) |
| Differentiable in distribution parameters | Yes (and cheap) | Yes | No (subgradients only) | Yes via dual |
| Best for | MLE, VI, ELBOs, RL trust regions | GAN-style symmetric comparison | Statistical testing | OT, sliced-Wasserstein, WGAN |
KL is the cheapest and most analytically tractable choice, which is why it dominates in machine learning. Wasserstein is the most geometrically faithful but requires solving an optimal-transport problem. Jensen-Shannon and total variation are sometimes preferred when bounded, symmetric outputs are needed.
Frequently asked questions
Why is KL divergence not a true distance?
It fails two of the three metric axioms. Symmetry fails — D(P‖Q) ≠ D(Q‖P) in general; flipping the arguments changes both magnitude and behaviour. The triangle inequality fails — there is no constant guarantee that D(P‖R) ≤ D(P‖Q) + D(Q‖R). Only non-negativity and identity-of-indiscernibles (D = 0 iff P = Q) survive. KL is properly a divergence, not a metric. If you need an honest metric, use the Jensen-Shannon divergence (a symmetric average of two KLs and provably bounded), the Hellinger distance, or the Wasserstein distance.
What does the asymmetry actually do?
It determines whether the approximating distribution covers the mass of the target or chases its modes. D(P‖Q) — called the forward or M-projection — heavily penalises Q being small where P is large. So minimising the forward KL forces Q to spread out and cover every region P touches; it is mean-seeking and zero-avoiding. D(Q‖P) — the reverse or I-projection — penalises Q being large where P is small. Minimising it concentrates Q on one mode of P and ignores the others; it is mode-seeking and zero-forcing. Variational autoencoders use reverse KL, so they collapse onto one mode by design; expectation propagation uses forward KL and produces broader posteriors.
Why does KL go to infinity when Q hits zero?
Look at the integrand P(x) log(P(x)/Q(x)). If Q(x) = 0 at a point where P(x) > 0, the ratio is infinity and the log is infinity, so the integrand blows up. Intuitively, Q claims this region is impossible, but P is putting mass on it — coding even one such sample under Q costs infinitely many bits. Whenever you fit a model Q to data P, you must give Q non-zero support over every region the data lives in. This is why Laplace smoothing, Bayesian priors and softmax outputs all add a small floor — they keep Q strictly positive everywhere, preventing the divergence from blowing up.
How is KL related to cross-entropy and entropy?
By the identity H(P, Q) = H(P) + D(P‖Q). The cross-entropy of P and Q is the entropy of P plus the KL divergence from P to Q. When you train a classifier with cross-entropy loss against the empirical distribution P, the entropy term H(P) is a constant (it depends only on the data) so minimising cross-entropy is equivalent to minimising KL divergence. The classifier is implicitly performing maximum likelihood — choosing Q to best match the data distribution P in the KL sense. Conversely, KL divergence is the extra bits the cross-entropy loss adds on top of the unavoidable entropy of the data itself.
What is the ELBO and why does KL show up?
The Evidence Lower BOund is the basis of variational inference. You want the intractable posterior p(z | x), so you pick a tractable family q(z; φ) and minimise D(q(z; φ)‖p(z | x)). Algebraically, log p(x) = ELBO(φ) + D(q‖p) where the ELBO is a quantity you can compute and the KL is positive, so log p(x) ≥ ELBO. Maximising the ELBO is equivalent to minimising the KL divergence between approximate and true posteriors. Variational autoencoders, neural network Bayesian inference and mean-field methods are all ELBO maximisers — and what they really do is shrink KL.
How does Gibbs' inequality prove KL ≥ 0?
Apply Jensen's inequality to the convex function -log. Specifically -D(P‖Q) = Σ P(x) log(Q(x)/P(x)) = E_P[log(Q/P)]. By Jensen (since log is concave) this is at most log E_P[Q/P] = log Σ P(x) · Q(x)/P(x) = log Σ Q(x) = log 1 = 0. So -D(P‖Q) ≤ 0, i.e. D(P‖Q) ≥ 0. Equality holds iff Q/P is constant P-almost-everywhere, which (given both sum to 1) forces P = Q. Gibbs' inequality is the workhorse non-negativity result used to prove the Cramér-Rao bound, the data-processing inequality, and the consistency of maximum likelihood.
What units is KL divergence measured in?
The same units as the logarithm. With log base 2 you get bits — the natural unit for information theory and coding. With natural log you get nats — the natural unit for calculus and most machine learning. With log base 10 you get dits or hartleys. Conversion: 1 nat ≈ 1.4427 bits, 1 bit = ln 2 nats. Numerical libraries (PyTorch, TensorFlow, scipy.stats.entropy) default to nats; information-theory textbooks default to bits.