Information Theory
Mutual Information
Bits of information one random variable carries about another — every kind of dependence, not just linear
I(X; Y) = H(X) − H(X|Y) = D_KL(p(x,y) ‖ p(x)p(y)) — bits about Y you learn from X. Zero iff independent; captures nonlinear structure correlation misses.
- DefinitionI(X; Y) = Σ p(x, y) log[p(x, y) / (p(x) p(y))]
- Equivalent formsH(X) − H(X|Y) = H(X) + H(Y) − H(X, Y)
- Sign≥ 0; equality iff X ⊥ Y
- SymmetryI(X; Y) = I(Y; X)
- Data-processingX → Y → Z ⇒ I(X; Z) ≤ I(X; Y)
- Used inInfoNCE, decision trees, feature selection, information bottleneck
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Three equivalent definitions
Mutual information has multiple faces, each illuminating a different intuition. For discrete random variables X and Y with joint distribution p(x, y) and marginals p(x), p(y):
I(X; Y) = Σ_{x, y} p(x, y) · log [ p(x, y) / (p(x) · p(y)) ] (KL form)
= H(X) − H(X | Y) (uncertainty form)
= H(X) + H(Y) − H(X, Y) (Venn form)
= D_KL ( p(x, y) ‖ p(x) p(y) ) (divergence form).
The first form expresses MI as the KL divergence between the joint p(x, y) and the product of marginals p(x)p(y) — the distribution X and Y would have if they were independent. The KL is zero iff p(x, y) = p(x)p(y), so MI is zero iff X and Y are independent. The KL is non-negative, so MI is non-negative.
The second form (uncertainty reduction) says I(X; Y) = H(X) − H(X | Y): mutual information equals the marginal entropy of X minus the conditional entropy of X given Y. Knowing Y reduces your uncertainty about X by exactly I(X; Y) bits.
The third form is the Venn form: I(X; Y) = H(X) + H(Y) − H(X, Y). Joint entropy is at most the sum of marginal entropies; MI fills in the gap. Visualised as a Venn diagram, the intersection of the two entropy circles is exactly I(X; Y).
Worked example — two correlated coins
Let X be a fair coin (P = 0.5 heads). Let Y be a copy of X corrupted by noise — with probability 0.8, Y = X; with probability 0.2, Y is flipped. The joint distribution:
p(X=0, Y=0) = 0.5 · 0.8 = 0.40
p(X=0, Y=1) = 0.5 · 0.2 = 0.10
p(X=1, Y=0) = 0.5 · 0.2 = 0.10
p(X=1, Y=1) = 0.5 · 0.8 = 0.40.
The marginals are p(X = 0) = p(X = 1) = 0.5 and similarly for Y. So p(X) · p(Y) at every cell equals 0.25. Compute the KL term by term:
I(X; Y) = 0.40 · log₂(0.40/0.25) + 0.10 · log₂(0.10/0.25)
+ 0.10 · log₂(0.10/0.25) + 0.40 · log₂(0.40/0.25)
= 2 · 0.40 · log₂(1.6) + 2 · 0.10 · log₂(0.4)
= 0.80 · 0.678 − 0.20 · 1.322
= 0.543 − 0.264
= 0.278 bits.
Knowing Y reduces your uncertainty about X by 0.278 bits. Sanity check via the uncertainty form: H(X) = 1 bit (fair coin), H(X | Y) = 1 − 0.278 = 0.722 bits. Independently compute H(X | Y) from p(X | Y): given Y = 0, P(X = 0 | Y = 0) = 0.4 / 0.5 = 0.8; the binary entropy at p = 0.8 is 0.722 bits. Match. The mutual information is consistent with the uncertainty-reduction interpretation.
For a noiseless channel (Y = X exactly), I(X; Y) = 1 bit — the full bit of X. For pure noise (Y independent of X), I(X; Y) = 0 bits. The 0.278 bits captures how much signal survives through the 20%-error channel.
Five properties that make MI useful
Non-negative. I(X; Y) ≥ 0 by the non-negativity of KL. Equality iff X and Y are independent.
Symmetric. I(X; Y) = I(Y; X). The information X gives about Y equals the information Y gives about X. This makes MI a sound measure of dependence (unlike conditional entropy, which is asymmetric).
Bounded. I(X; Y) ≤ min(H(X), H(Y)). You can never learn more about Y than the total entropy of Y. Equality holds in the noiseless case where Y is a deterministic function of X (or X of Y).
Data-processing inequality. If X → Y → Z is a Markov chain (Z depends on X only through Y), then I(X; Z) ≤ I(X; Y). Information cannot be created by post-processing; only destroyed. This is why we cannot recover lost detail by applying a function — and why feature engineering can never add information, only reorganise it.
Chain rule. I(X; Y, Z) = I(X; Y) + I(X; Z | Y). The information X carries about (Y, Z) decomposes into "information about Y" plus "additional information about Z given Y." Iterating gives I(X; Y_1, …, Y_n) = Σ I(X; Y_k | Y_ Pearson correlation ρ(X, Y) measures the strength of a linear relationship. It is zero for dependent variables whose relationship is symmetric or periodic. Take Y = X² with X uniform on [−1, 1] — clearly dependent. The Pearson correlation between X and Y: So Pearson reports zero correlation. But Y is a deterministic function of X via |X| = √Y — knowing Y reduces the uncertainty about X dramatically (from full uniform [−1, 1] down to two points). Mutual information correctly reports a large positive value here, capturing the nonlinear dependence Pearson misses. Other classic correlation failures: Y = sin(2πX) with X uniform (high MI, zero linear correlation); Y = X · noise where noise has zero mean (zero correlation, positive MI); chaotic and recurrent systems where Pearson averages out but MI shows persistent memory. Whenever you suspect a nonlinear or symmetric dependence, MI is the right tool and Pearson is misleading. Pearson is fast and parametric but blind to nonlinear structure. Spearman/Kendall handle monotonic nonlinearity. Distance correlation (Székely 2007) and HSIC (Gretton 2005) are kernel-based and characterise independence — like MI — but with cheaper estimation. Mutual information is the canonical information-theoretic measure but is the hardest to estimate in high dimensions; the others are practical tools that approximate or restrict it.JavaScript — computing mutual information
// Discrete MI from a joint distribution matrix (in bits)
function mutualInformation(joint) {
const rows = joint.length;
const cols = joint[0].length;
// Marginals
const px = Array(rows).fill(0);
const py = Array(cols).fill(0);
for (let i = 0; i < rows; i++)
for (let j = 0; j < cols; j++) {
px[i] += joint[i][j];
py[j] += joint[i][j];
}
// Sum p(x,y) log[ p(x,y) / (p(x) p(y)) ]
let mi = 0;
for (let i = 0; i < rows; i++)
for (let j = 0; j < cols; j++) {
const p = joint[i][j];
if (p > 0) mi += p * Math.log2(p / (px[i] * py[j]));
}
return mi;
}
// Worked example: 80%-correlated binary channel
const joint = [
[0.40, 0.10], // X=0, Y=0 | X=0, Y=1
[0.10, 0.40], // X=1, Y=0 | X=1, Y=1
];
console.log(mutualInformation(joint)); // 0.278 bits
// Independent variables: MI = 0 by construction
const indep = [
[0.25, 0.25],
[0.25, 0.25],
];
console.log(mutualInformation(indep)); // 0.000 bits
// Noiseless copy: MI = H(X) = 1 bit
const exact = [
[0.50, 0.00],
[0.00, 0.50],
];
console.log(mutualInformation(exact)); // 1.000 bits
// MI from samples — bin into a 2D histogram then plug in
function miFromSamples(xs, ys, bins = 20) {
const xmin = Math.min(...xs), xmax = Math.max(...xs);
const ymin = Math.min(...ys), ymax = Math.max(...ys);
const hist = Array.from({length: bins}, () => Array(bins).fill(0));
for (let i = 0; i < xs.length; i++) {
const ix = Math.min(bins - 1, Math.floor((xs[i] - xmin) / (xmax - xmin) * bins));
const iy = Math.min(bins - 1, Math.floor((ys[i] - ymin) / (ymax - ymin) * bins));
hist[ix][iy]++;
}
// Normalise to probability
const total = xs.length;
for (let i = 0; i < bins; i++)
for (let j = 0; j < bins; j++)
hist[i][j] /= total;
return mutualInformation(hist);
}
Why MI is better than correlation for measuring dependence
Cov(X, Y) = E[X · X²] − E[X] · E[X²]
= E[X³] − 0 · 1/3
= 0 (since X³ is odd over symmetric support).Where mutual information shows up
mutual_info_classif and mutual_info_regression. Beats correlation-based filters when feature-target relationships are nonlinear.Common pitfalls
Mutual information vs alternative dependence measures
Mutual information Pearson correlation Spearman / Kendall Distance correlation HSIC Captures nonlinear dependence Yes — all kinds No (only linear) Monotonic only Yes (all) Yes (all) Zero iff independent Yes (provably) No No Yes (provably) Yes (with characteristic kernel) Bounded Bounded by min(H(X), H(Y)) [−1, 1] [−1, 1] [0, 1] after normalisation Unbounded Estimable from samples Hard (need densities) Trivially Trivially O(n² log n) O(n²) Closed-form for Gaussian Yes (−½ log det Σ form) Yes No Yes Yes Best for Quantifying total dependence in bits Quick linear screening Robust monotonic checking Independence testing Independence testing in high d
Frequently asked questions
How is mutual information different from correlation?
Pearson correlation measures linear dependence — the slope of the best straight-line fit. Mutual information measures every kind of dependence — linear, polynomial, periodic, chaotic, mediated by latent variables. Take Y = sin(X) for X uniform on [0, 2π]. Pearson correlation is exactly zero (the sine is symmetric); mutual information is large (Y is a deterministic function of X). Pearson is cheap and has a closed-form formula; mutual information requires either a parametric model or a density estimator. Both have their place — use Pearson for quick screening of linear relationships, mutual information when you suspect or care about nonlinear structure.
Why does I(X; Y) = 0 imply independence?
By definition I(X; Y) = D_KL(p(x, y)‖p(x)p(y)). The KL divergence between two distributions is zero if and only if those distributions are equal. So I(X; Y) = 0 iff p(x, y) = p(x)p(y) for all (x, y) — which is exactly the definition of independence. This is unlike correlation: correlation can be zero for dependent variables (e.g. Y = X² with X symmetric around zero). Mutual information is the strongest test of independence because it directly measures the divergence between joint and the independence assumption.
What is the chain rule for mutual information?
I(X; Y, Z) = I(X; Y) + I(X; Z | Y), where I(X; Z | Y) = H(X | Y) − H(X | Y, Z) is the conditional mutual information. Iterate this and you get I(X; Y₁, …, Y_n) = Σ I(X; Y_k | Y_<k) — exact decomposition into per-step contributions. This is the basis of greedy feature selection: pick the next feature whose conditional MI with the target is largest given features already selected. It also leads to the data-processing inequality: if X → Y → Z is a Markov chain, then I(X; Z) ≤ I(X; Y) — information cannot be created by post-processing.
Why is mutual information hard to estimate from samples?
Because it depends on densities. For continuous random variables you need to estimate p(x, y), p(x), and p(y) from finite samples — and the natural plug-in estimator has worst-case O(n^(−2/(d+4))) convergence rate, painfully slow in high dimensions. Modern neural estimators (MINE 2018, InfoNCE 2018, CLUB 2020) sidestep density estimation by optimising variational lower or upper bounds on MI using neural networks. They scale to high-dimensional X, Y but only provide bounds, not exact values. The standard practical rule: trust binned/k-NN estimators in low dimensions (d ≤ 5), use variational neural estimators when you only need a lower-bound objective for representation learning.
What is conditional mutual information?
I(X; Y | Z) is the mutual information between X and Y given Z — the reduction in uncertainty about X from knowing Y, after already knowing Z. Formally I(X; Y | Z) = H(X | Z) − H(X | Y, Z). If X ⊥ Y | Z (conditional independence), then I(X; Y | Z) = 0; this is how Bayesian networks and graphical models test independencies. CMI can be larger than unconditional MI (Z hides a relationship), smaller (Z explains away the dependence), or the same. It is the right tool when you want to measure whether a feature adds information beyond what is already known.
How is mutual information used in contrastive learning?
InfoNCE (van den Oord et al., 2018) is a lower bound on mutual information between two views of the data. With K-1 negative samples and one positive, the InfoNCE loss is −log[exp(f(x, y_+)) / Σ exp(f(x, y_k))] where f is a critic. Minimising this loss maximises a lower bound on I(X; Y_+). SimCLR, MoCo, BYOL, CLIP and most modern self-supervised methods are InfoNCE-based — they learn representations that maximise mutual information between augmentations of the same image, or between images and captions. The mutual-information bound is the theoretical underpinning of the entire "contrastive learning" field.
What is the information bottleneck?
Tishby's information bottleneck (1999) frames representation learning as: find a stochastic encoder p(T | X) that maximises I(T; Y) (T is informative about the label) while minimising I(T; X) (T discards everything else about X). The tradeoff is controlled by a Lagrange multiplier β: maximise I(T; Y) − β · I(T; X). The optimal T is a sufficient statistic of X for Y with minimal complexity. Recent work (Schwartz-Ziv & Tishby 2017) argues neural-net training itself implicitly performs information-bottleneck compression, though this interpretation remains contested.