Information Theory

Mutual Information

Bits of information one random variable carries about another — every kind of dependence, not just linear

I(X; Y) = H(X) − H(X|Y) = D_KL(p(x,y) ‖ p(x)p(y)) — bits about Y you learn from X. Zero iff independent; captures nonlinear structure correlation misses.

DefinitionI(X; Y) = Σ p(x, y) log[p(x, y) / (p(x) p(y))]
Equivalent formsH(X) − H(X|Y) = H(X) + H(Y) − H(X, Y)
Sign≥ 0; equality iff X ⊥ Y
SymmetryI(X; Y) = I(Y; X)
Data-processingX → Y → Z ⇒ I(X; Z) ≤ I(X; Y)
Used inInfoNCE, decision trees, feature selection, information bottleneck

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Three equivalent definitions

Mutual information has multiple faces, each illuminating a different intuition. For discrete random variables X and Y with joint distribution p(x, y) and marginals p(x), p(y):

I(X; Y) = Σ_{x, y} p(x, y) · log [ p(x, y) / (p(x) · p(y)) ]    (KL form)
        = H(X) − H(X | Y)                                       (uncertainty form)
        = H(X) + H(Y) − H(X, Y)                                 (Venn form)
        = D_KL ( p(x, y) ‖ p(x) p(y) )                          (divergence form).

The first form expresses MI as the KL divergence between the joint p(x, y) and the product of marginals p(x)p(y) — the distribution X and Y would have if they were independent. The KL is zero iff p(x, y) = p(x)p(y), so MI is zero iff X and Y are independent. The KL is non-negative, so MI is non-negative.

The second form (uncertainty reduction) says I(X; Y) = H(X) − H(X | Y): mutual information equals the marginal entropy of X minus the conditional entropy of X given Y. Knowing Y reduces your uncertainty about X by exactly I(X; Y) bits.

The third form is the Venn form: I(X; Y) = H(X) + H(Y) − H(X, Y). Joint entropy is at most the sum of marginal entropies; MI fills in the gap. Visualised as a Venn diagram, the intersection of the two entropy circles is exactly I(X; Y).

Worked example — two correlated coins

Let X be a fair coin (P = 0.5 heads). Let Y be a copy of X corrupted by noise — with probability 0.8, Y = X; with probability 0.2, Y is flipped. The joint distribution:

p(X=0, Y=0) = 0.5 · 0.8 = 0.40
p(X=0, Y=1) = 0.5 · 0.2 = 0.10
p(X=1, Y=0) = 0.5 · 0.2 = 0.10
p(X=1, Y=1) = 0.5 · 0.8 = 0.40.

The marginals are p(X = 0) = p(X = 1) = 0.5 and similarly for Y. So p(X) · p(Y) at every cell equals 0.25. Compute the KL term by term:

I(X; Y) = 0.40 · log₂(0.40/0.25) + 0.10 · log₂(0.10/0.25)
        + 0.10 · log₂(0.10/0.25) + 0.40 · log₂(0.40/0.25)
        = 2 · 0.40 · log₂(1.6)  + 2 · 0.10 · log₂(0.4)
        = 0.80 · 0.678          − 0.20 · 1.322
        = 0.543                  − 0.264
        = 0.278 bits.

Knowing Y reduces your uncertainty about X by 0.278 bits. Sanity check via the uncertainty form: H(X) = 1 bit (fair coin), H(X | Y) = 1 − 0.278 = 0.722 bits. Independently compute H(X | Y) from p(X | Y): given Y = 0, P(X = 0 | Y = 0) = 0.4 / 0.5 = 0.8; the binary entropy at p = 0.8 is 0.722 bits. Match. The mutual information is consistent with the uncertainty-reduction interpretation.

For a noiseless channel (Y = X exactly), I(X; Y) = 1 bit — the full bit of X. For pure noise (Y independent of X), I(X; Y) = 0 bits. The 0.278 bits captures how much signal survives through the 20%-error channel.

Five properties that make MI useful

Non-negative. I(X; Y) ≥ 0 by the non-negativity of KL. Equality iff X and Y are independent.

Symmetric. I(X; Y) = I(Y; X). The information X gives about Y equals the information Y gives about X. This makes MI a sound measure of dependence (unlike conditional entropy, which is asymmetric).

Bounded. I(X; Y) ≤ min(H(X), H(Y)). You can never learn more about Y than the total entropy of Y. Equality holds in the noiseless case where Y is a deterministic function of X (or X of Y).

Data-processing inequality. If X → Y → Z is a Markov chain (Z depends on X only through Y), then I(X; Z) ≤ I(X; Y). Information cannot be created by post-processing; only destroyed. This is why we cannot recover lost detail by applying a function — and why feature engineering can never add information, only reorganise it.

Chain rule. I(X; Y, Z) = I(X; Y) + I(X; Z | Y). The information X carries about (Y, Z) decomposes into "information about Y" plus "additional information about Z given Y." Iterating gives I(X; Y_1, …, Y_n) = Σ I(X; Y_k | Y_

JavaScript — computing mutual information

// Discrete MI from a joint distribution matrix (in bits)
function mutualInformation(joint) {
  const rows = joint.length;
  const cols = joint[0].length;
  // Marginals
  const px = Array(rows).fill(0);
  const py = Array(cols).fill(0);
  for (let i = 0; i < rows; i++)
    for (let j = 0; j < cols; j++) {
      px[i] += joint[i][j];
      py[j] += joint[i][j];
    }
  // Sum p(x,y) log[ p(x,y) / (p(x) p(y)) ]
  let mi = 0;
  for (let i = 0; i < rows; i++)
    for (let j = 0; j < cols; j++) {
      const p = joint[i][j];
      if (p > 0) mi += p * Math.log2(p / (px[i] * py[j]));
    }
  return mi;
}

// Worked example: 80%-correlated binary channel
const joint = [
  [0.40, 0.10],   // X=0, Y=0 | X=0, Y=1
  [0.10, 0.40],   // X=1, Y=0 | X=1, Y=1
];
console.log(mutualInformation(joint));   // 0.278 bits

// Independent variables: MI = 0 by construction
const indep = [
  [0.25, 0.25],
  [0.25, 0.25],
];
console.log(mutualInformation(indep));   // 0.000 bits

// Noiseless copy: MI = H(X) = 1 bit
const exact = [
  [0.50, 0.00],
  [0.00, 0.50],
];
console.log(mutualInformation(exact));   // 1.000 bits

// MI from samples — bin into a 2D histogram then plug in
function miFromSamples(xs, ys, bins = 20) {
  const xmin = Math.min(...xs), xmax = Math.max(...xs);
  const ymin = Math.min(...ys), ymax = Math.max(...ys);
  const hist = Array.from({length: bins}, () => Array(bins).fill(0));
  for (let i = 0; i < xs.length; i++) {
    const ix = Math.min(bins - 1, Math.floor((xs[i] - xmin) / (xmax - xmin) * bins));
    const iy = Math.min(bins - 1, Math.floor((ys[i] - ymin) / (ymax - ymin) * bins));
    hist[ix][iy]++;
  }
  // Normalise to probability
  const total = xs.length;
  for (let i = 0; i < bins; i++)
    for (let j = 0; j < bins; j++)
      hist[i][j] /= total;
  return mutualInformation(hist);
}

Why MI is better than correlation for measuring dependence

Pearson correlation ρ(X, Y) measures the strength of a linear relationship. It is zero for dependent variables whose relationship is symmetric or periodic. Take Y = X² with X uniform on [−1, 1] — clearly dependent. The Pearson correlation between X and Y:

Cov(X, Y) = E[X · X²] − E[X] · E[X²]
          = E[X³]    −  0   ·  1/3
          = 0          (since X³ is odd over symmetric support).

So Pearson reports zero correlation. But Y is a deterministic function of X via |X| = √Y — knowing Y reduces the uncertainty about X dramatically (from full uniform [−1, 1] down to two points). Mutual information correctly reports a large positive value here, capturing the nonlinear dependence Pearson misses.

Other classic correlation failures: Y = sin(2πX) with X uniform (high MI, zero linear correlation); Y = X · noise where noise has zero mean (zero correlation, positive MI); chaotic and recurrent systems where Pearson averages out but MI shows persistent memory. Whenever you suspect a nonlinear or symmetric dependence, MI is the right tool and Pearson is misleading.

Where mutual information shows up

Decision trees (C4.5, CART, XGBoost). The split criterion is "information gain" — the mutual information between the candidate split feature and the label distribution. The split with highest MI is selected, recursively, until a stopping criterion is met. ID3, C4.5, random forests, gradient-boosted trees all use this principle.
Feature selection. Choose features that maximise I(X_subset; Y) — the bits the features carry about the target. Implemented in scikit-learn as mutual_info_classif and mutual_info_regression. Beats correlation-based filters when feature-target relationships are nonlinear.
Contrastive self-supervised learning. InfoNCE (van den Oord et al., 2018) is the loss for SimCLR, MoCo, BYOL, CLIP and most modern representation learners. Minimising InfoNCE maximises a lower bound on the mutual information between two views of the same data point — the formal justification for "make augmentations of the same image embed close together."
Neural mutual-information estimators (MINE, CLUB, InfoNCE). When closed-form MI is intractable (high-dimensional X, Y), use a neural network as a variational approximator. MINE (Belghazi et al., 2018) gives a lower bound via Donsker-Varadhan; CLUB gives an upper bound. Used in disentanglement, information-bottleneck implementations and causality testing.
Information bottleneck. Tishby's framework treats representation learning as minimising I(T; X) subject to keeping I(T; Y) large — compress X while retaining everything about Y. The theoretical foundation for sufficient statistics, autoencoders and the deep-learning interpretation of representation hierarchies.
Causal inference and graphical models. Conditional mutual information I(X; Y | Z) = 0 tests conditional independence — the building block of causal-discovery algorithms (PC, FCI). DAG factorisations of joint distributions are determined by their CMI = 0 structure.
Communications channel capacity. Shannon's noisy-channel coding theorem says the channel capacity is C = sup_{p(x)} I(X; Y) — the maximum mutual information between input and output over all input distributions. Below capacity, error-correcting codes can achieve arbitrarily low error; above capacity, no code can. The principle behind every modern wireless and storage system.

Common pitfalls

Treating MI as a normalised similarity. MI is in bits or nats, not a unitless [0, 1] score. Use normalised mutual information (NMI = 2 · I(X;Y) / (H(X) + H(Y))) or adjusted mutual information (AMI) for clustering-quality comparisons.
Sample-size dependence. Empirical MI from histograms is biased upward by roughly (number of bins × number of bins − 1) / (2n) nats. With small samples and many bins, the noise dominates the signal — you'll see "spurious MI" even between independent variables. Use plug-in shrinkage estimators (Chao-Shen, Miller-Madow) or k-nearest-neighbour estimators (KSG, Kraskov et al.).
Binning artefacts in continuous variables. The choice of bin width strongly affects empirical MI for continuous data. Too few bins underestimates; too many overestimates. Cross-validate the bin width or use kernel density estimation instead.
Curse of dimensionality. Histogram-based MI converges as O(n^(−2/(d+4))) for d-dimensional X. In d = 20 you need astronomical sample sizes; neural variational estimators (MINE, InfoNCE) scale better but only give bounds, not exact values.
Conflating MI with causality. High mutual information does not imply causal influence — confounders Z can induce MI between X and Y with no direct relationship. Conditional mutual information I(X; Y | Z) and interventions are required to disentangle causality from correlation.
Using MI bounds as MI values. InfoNCE is a lower bound on MI, often loose. Reporting "InfoNCE = 5 bits" does not mean the true MI is 5 bits — it could be 10. Always cite whether you are reporting an estimator (typically a lower bound) versus a true value (rarely available outside toy examples).

Mutual information vs alternative dependence measures

	Mutual information	Pearson correlation	Spearman / Kendall	Distance correlation	HSIC
Captures nonlinear dependence	Yes — all kinds	No (only linear)	Monotonic only	Yes (all)	Yes (all)
Zero iff independent	Yes (provably)	No	No	Yes (provably)	Yes (with characteristic kernel)
Bounded	Bounded by min(H(X), H(Y))	[−1, 1]	[−1, 1]	[0, 1] after normalisation	Unbounded
Estimable from samples	Hard (need densities)	Trivially	Trivially	O(n² log n)	O(n²)
Closed-form for Gaussian	Yes (−½ log det Σ form)	Yes	No	Yes	Yes
Best for	Quantifying total dependence in bits	Quick linear screening	Robust monotonic checking	Independence testing	Independence testing in high d

Pearson is fast and parametric but blind to nonlinear structure. Spearman/Kendall handle monotonic nonlinearity. Distance correlation (Székely 2007) and HSIC (Gretton 2005) are kernel-based and characterise independence — like MI — but with cheaper estimation. Mutual information is the canonical information-theoretic measure but is the hardest to estimate in high dimensions; the others are practical tools that approximate or restrict it.

Frequently asked questions

How is mutual information different from correlation?

Pearson correlation measures linear dependence — the slope of the best straight-line fit. Mutual information measures every kind of dependence — linear, polynomial, periodic, chaotic, mediated by latent variables. Take Y = sin(X) for X uniform on [0, 2π]. Pearson correlation is exactly zero (the sine is symmetric); mutual information is large (Y is a deterministic function of X). Pearson is cheap and has a closed-form formula; mutual information requires either a parametric model or a density estimator. Both have their place — use Pearson for quick screening of linear relationships, mutual information when you suspect or care about nonlinear structure.

Why does I(X; Y) = 0 imply independence?

By definition I(X; Y) = D_KL(p(x, y)‖p(x)p(y)). The KL divergence between two distributions is zero if and only if those distributions are equal. So I(X; Y) = 0 iff p(x, y) = p(x)p(y) for all (x, y) — which is exactly the definition of independence. This is unlike correlation: correlation can be zero for dependent variables (e.g. Y = X² with X symmetric around zero). Mutual information is the strongest test of independence because it directly measures the divergence between joint and the independence assumption.

What is the chain rule for mutual information?

I(X; Y, Z) = I(X; Y) + I(X; Z | Y), where I(X; Z | Y) = H(X | Y) − H(X | Y, Z) is the conditional mutual information. Iterate this and you get I(X; Y₁, …, Y_n) = Σ I(X; Y_k | Y_<k) — exact decomposition into per-step contributions. This is the basis of greedy feature selection: pick the next feature whose conditional MI with the target is largest given features already selected. It also leads to the data-processing inequality: if X → Y → Z is a Markov chain, then I(X; Z) ≤ I(X; Y) — information cannot be created by post-processing.

Why is mutual information hard to estimate from samples?

Because it depends on densities. For continuous random variables you need to estimate p(x, y), p(x), and p(y) from finite samples — and the natural plug-in estimator has worst-case O(n^(−2/(d+4))) convergence rate, painfully slow in high dimensions. Modern neural estimators (MINE 2018, InfoNCE 2018, CLUB 2020) sidestep density estimation by optimising variational lower or upper bounds on MI using neural networks. They scale to high-dimensional X, Y but only provide bounds, not exact values. The standard practical rule: trust binned/k-NN estimators in low dimensions (d ≤ 5), use variational neural estimators when you only need a lower-bound objective for representation learning.

What is conditional mutual information?

I(X; Y | Z) is the mutual information between X and Y given Z — the reduction in uncertainty about X from knowing Y, after already knowing Z. Formally I(X; Y | Z) = H(X | Z) − H(X | Y, Z). If X ⊥ Y | Z (conditional independence), then I(X; Y | Z) = 0; this is how Bayesian networks and graphical models test independencies. CMI can be larger than unconditional MI (Z hides a relationship), smaller (Z explains away the dependence), or the same. It is the right tool when you want to measure whether a feature adds information beyond what is already known.

How is mutual information used in contrastive learning?

InfoNCE (van den Oord et al., 2018) is a lower bound on mutual information between two views of the data. With K-1 negative samples and one positive, the InfoNCE loss is −log[exp(f(x, y_+)) / Σ exp(f(x, y_k))] where f is a critic. Minimising this loss maximises a lower bound on I(X; Y_+). SimCLR, MoCo, BYOL, CLIP and most modern self-supervised methods are InfoNCE-based — they learn representations that maximise mutual information between augmentations of the same image, or between images and captions. The mutual-information bound is the theoretical underpinning of the entire "contrastive learning" field.

What is the information bottleneck?

Tishby's information bottleneck (1999) frames representation learning as: find a stochastic encoder p(T | X) that maximises I(T; Y) (T is informative about the label) while minimising I(T; X) (T discards everything else about X). The tradeoff is controlled by a Lagrange multiplier β: maximise I(T; Y) − β · I(T; X). The optimal T is a sufficient statistic of X for Y with minimal complexity. Recent work (Schwartz-Ziv & Tishby 2017) argues neural-net training itself implicitly performs information-bottleneck compression, though this interpretation remains contested.