Statistics
Wishart Distribution
Multivariate chi-squared — distribution of sample covariance matrices
W_p(n, Σ) is the distribution of A = XᵀX for n samples X ~ N_p(0, Σ). The multivariate analog of chi-squared. Powers MANOVA, portfolio risk, and Bayesian covariance.
- DefinitionA = XᵀX, X ∈ ℝ^(n×p), rows ~ N_p(0, Σ)
- MeanE[A] = nΣ
- Reduces toW_1(n, σ²) = σ²·χ²(n)
- Rank constraintn ≥ p for full-rank; else singular Wishart
- Conjugate priorInverse Wishart for Σ in multivariate normal
- Discovered byJohn Wishart, 1928
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The definition
Sample n independent random vectors x₁, x₂, …, xₙ from a multivariate normal N_p(0, Σ) — each is a p-dimensional Gaussian with mean 0 and p×p covariance Σ. Arrange them as rows of an n×p matrix X. The scatter matrix:
A = X^T X = Σᵢ xᵢ xᵢ^T
is a p×p positive semidefinite matrix. Its distribution is the Wishart with n degrees of freedom and scale matrix Σ:
A ~ W_p(n, Σ)
When n ≥ p, A is almost surely positive definite and has density:
f(A) = |A|^((n−p−1)/2) · exp(−tr(Σ⁻¹A)/2)
───────────────────────────────────────
2^(np/2) · |Σ|^(n/2) · Γ_p(n/2)
where Γ_p is the multivariate gamma function Γ_p(z) = π^(p(p−1)/4) ∏ᵢ₌₁ᵖ Γ(z − (i − 1)/2). The density lives on the cone of positive-definite p×p matrices.
When n < p, the matrix A has rank at most n < p, so it is singular and does not have a density on the full cone. This is the singular Wishart case, important when you have fewer samples than dimensions (genomics, finance, image processing).
The univariate case — recovering chi-squared
When p = 1, the random matrix is a single number. Set Σ = σ². The Wishart W_1(n, σ²) is the distribution of:
A = Σᵢ xᵢ² with each xᵢ ~ N(0, σ²)
Divide by σ² to get the scaled sum Σᵢ (xᵢ/σ)² = Σᵢ Zᵢ², which is exactly χ²(n). So:
W_1(n, σ²) = σ² · χ²(n)
The Wishart is the multivariate analog of chi-squared — keeping track of cross-products as well as squares.
Moments and worked example
The mean and (component-wise) variance:
E[A] = n · Σ
Var(Aᵢⱼ) = n · (Σᵢⱼ² + Σᵢᵢ Σⱼⱼ)
For diagonal entries Var(Aᵢᵢ) = 2n·Σᵢᵢ² — matching the univariate chi-squared.
Worked example. Take p = 2 with true covariance Σ = ((1, 0.5), (0.5, 1)). Draw n = 30 samples from N_2(0, Σ). The Wishart prediction:
- E[A] = 30·Σ = ((30, 15), (15, 30))
- Var(A₁₁) = 30·(1 + 1) = 60, so SD(A₁₁) = √60 ≈ 7.75. Typical A₁₁ ≈ 30 ± 7.75.
- Var(A₁₂) = 30·(0.25 + 1) = 37.5, SD(A₁₂) ≈ 6.12. Typical A₁₂ ≈ 15 ± 6.12.
- Sample covariance Ŝᵢⱼ = Aᵢⱼ/n with SD(Ŝ₁₂) ≈ 6.12/30 ≈ 0.20 — so the off-diagonal estimate jitters around the true 0.5 with standard deviation 0.2.
The Wishart density quantifies how much sample covariances vary around the true Σ for any n and p.
MANOVA — testing multivariate means
With p-dimensional response vectors yᵢⱼ in k groups, multivariate ANOVA decomposes the total scatter T = W + B into within-group scatter W and between-group scatter B. Under the null hypothesis of equal group means:
W ~ W_p(N − k, Σ) (independent of)
B ~ W_p(k − 1, Σ)
where Σ is the common within-group covariance. Test statistics:
| Statistic | Formula | Distribution under null | Strength |
|---|---|---|---|
| Wilks lambda | Λ = |W| / |W + B| | Wilks Λ_(p, k−1, N−k) | Likelihood ratio; standard |
| Pillai trace | tr(B(B + W)⁻¹) | Bartlett-corrected χ² | Most robust to violations |
| Hotelling-Lawley | tr(BW⁻¹) | Bartlett-corrected χ² | Powerful when effects similar |
| Roy's largest root | λ_max(BW⁻¹) | Distribution of largest eigenvalue | Best when one effect dominates |
All four reduce to a function of the eigenvalues of BW⁻¹ — themselves random because W is Wishart.
Inverse Wishart — conjugate prior for covariance
If A ~ W_p(n, Σ), then A⁻¹ ~ W_p⁻¹(n, Σ⁻¹) — the inverse Wishart distribution with density:
f(A) = |A|^(−(n+p+1)/2) · exp(−tr(Σ A⁻¹)/2)
─────────────────────────────────────
2^(np/2) · |Σ|^(−n/2) · Γ_p(n/2)
In Bayesian inference for a multivariate normal model with unknown covariance Σ:
Prior: Σ ~ W_p⁻¹(ν, Ψ)
Likelihood: x_i | Σ ~ N_p(μ, Σ)
Posterior: Σ | x_1,...,x_n ~ W_p⁻¹(ν + n, Ψ + S)
where S = Σᵢ (xᵢ − x̄)(xᵢ − x̄)^T
The hyperparameters ν (prior degrees of freedom) and Ψ (prior scale matrix) act as pseudo-counts and pseudo-scatter — just like α and β in a Beta prior, but matrix-valued. Common defaults: ν = p + 2 (weakly informative, ensures finite mean), Ψ = (ν − p − 1)·Σ_prior.
The full Normal-Inverse-Wishart family (NIW) is the standard conjugate prior for jointly modeling (μ, Σ) of a multivariate normal:
(μ, Σ) ~ NIW(μ_0, κ_0, ν_0, Ψ_0)
means: μ | Σ ~ N_p(μ_0, Σ/κ_0)
Σ ~ W_p⁻¹(ν_0, Ψ_0)
High dimensions — connection to Marchenko-Pastur
The Wishart eigenvalues are tied to the Marchenko-Pastur distribution. With Σ = I_p (white Wishart) and Σ as identity, the eigenvalues of A/n (the sample covariance) for large n, p with c = p/n ∈ (0, 1] have empirical density:
ρ(λ) = (1/(2πcλ)) √((λ₊ − λ)(λ − λ₋)) on [λ₋, λ₊]
λ± = (1 ± √c)²
When c = 0.1 (n = 10p), the bulk lives on [0.51, 1.62] — already a 3× spread despite Σ = I. When c = 0.5 (n = 2p), the bulk is [0.086, 2.91] — a 34× spread. When c approaches 1, λ₋ → 0 and the spectrum has a hard edge at zero.
This bias is huge in modern data. If you compute the inverse covariance from a sample with p = 100 and n = 200 (c = 0.5), the smallest eigenvalue is biased toward 0.086 instead of 1, so the inverse blows it up to ~12 instead of 1. Naive Σ⁻¹ multiplies noise by an order of magnitude.
Where Wishart appears
- MANOVA and discriminant analysis. Tests for differences in mean vectors across groups. Wishart underlies every multivariate F-like test.
- Bayesian covariance inference. The inverse Wishart is the conjugate prior for the covariance of a multivariate normal. NIW priors are used in Gaussian mixture models, Bayesian factor analysis, multivariate regression.
- Portfolio optimization. Mean-variance optimization needs Σ⁻¹. The Wishart noise in sample Σ explains why naive optimal portfolios perform poorly out-of-sample.
- Signal processing. Beamforming, MIMO channel estimation, radar — all involve sample covariance matrices from sensor arrays. Detection probabilities reduce to Wishart eigenvalue tail probabilities.
- Genomics. Co-expression networks, gene-gene correlations from RNA-seq data — sample covariance with p ≫ n (often p = 20,000 genes, n = 100 samples). Singular Wishart regime.
- Climate science. Principal component analysis of weather fields. Each station is a coordinate; covariance is the spatial correlation structure.
- Latent factor models. Factor analysis, probabilistic PCA, and Bayesian PCA all use the Wishart family to model covariance.
- Quantum mechanics. The reduced density matrix of a subsystem in a random pure state follows a Wishart-related distribution (Page's curve).
Python — sampling and using the Wishart
import numpy as np
from scipy.stats import wishart, invwishart
# Sample a Wishart matrix
p, n = 3, 20
Sigma = np.array([[1.0, 0.3, 0.1],
[0.3, 1.0, 0.5],
[0.1, 0.5, 1.0]])
# Via scipy
A = wishart.rvs(df=n, scale=Sigma)
print('A shape:', A.shape) # (3, 3)
print('A symmetric:', np.allclose(A, A.T))
print('A positive definite:', np.all(np.linalg.eigvalsh(A) > 0))
# Via direct construction — sample n vectors, form XᵀX
X = np.random.multivariate_normal(mean=np.zeros(p), cov=Sigma, size=n)
A_direct = X.T @ X
print('Mean comparison: E[A] should be nΣ =', n * Sigma)
# Confirm via many draws
draws = [wishart.rvs(df=n, scale=Sigma) for _ in range(5000)]
mean_A = np.mean(draws, axis=0)
print('Empirical mean A:')
print(mean_A)
print('Theoretical nΣ:')
print(n * Sigma)
# Inverse Wishart for Bayesian covariance update
nu_0, Psi_0 = 5, np.eye(p) * 2
posterior_iw = invwishart(df=nu_0 + n, scale=Psi_0 + A_direct)
print('Posterior mean Σ (E[X] = Ψ/(ν - p - 1)):')
print((Psi_0 + A_direct) / (nu_0 + n - p - 1))
Common pitfalls
- Confusing scale matrix and covariance. The Wishart's scale Σ is not E[A]; rather E[A] = nΣ. If you fit a Wishart by moment matching, divide observed A by n.
- Singular Wishart when n < p. Sample covariance is rank-deficient and not invertible. Use shrinkage estimators (Ledoit-Wolf) or factor models.
- Inverting sample covariance directly. Even when n > p, the inverse amplifies the smallest sample eigenvalue, which is biased downward by Marchenko-Pastur. Always shrink before inverting in high dimensions.
- Wishart vs Wishart of A/n. The Wishart describes the scatter matrix A, not the sample covariance S = A/n. Wishart of n·S has parameters (n, Σ); S itself has a rescaled Wishart.
- Wrong df in inverse Wishart prior. Inverse Wishart W_p⁻¹(ν, Ψ) has finite mean only if ν > p + 1. Setting ν = p + 1 makes the mean undefined; ν = p + 2 is the standard weakly-informative default.
- Ignoring correlation in entries. Entries of a Wishart matrix are correlated — you cannot model A by independent chi-squareds on the diagonal. The full joint structure matters for goodness-of-fit and asymptotic tests.
History
John Wishart introduced the distribution in his 1928 paper "The generalised product moment distribution in samples from a normal multivariate population," published in Biometrika. He derived it as the joint distribution of sample variances and covariances. R. A. Fisher had been edging toward the same result, but Wishart's paper was the first complete treatment. The distribution became central to multivariate statistics through the work of S. S. Wilks (Wilks's lambda, 1932), Harold Hotelling (T² distribution, 1931), and C. R. Rao (multivariate tests, 1948). Marchenko and Pastur derived the high-dimensional limiting spectral distribution in 1967, opening the door to the modern statistics of small-n-large-p data. Olkin and Rubin formalized the inverse Wishart conjugacy for Bayesian inference in 1962, and the Normal-Inverse-Wishart became the workhorse prior for multivariate Bayesian models.
Frequently asked questions
What is the Wishart distribution intuitively?
Sample n independent vectors x₁, …, xₙ from a multivariate normal N_p(0, Σ) — so each xᵢ is a p-dimensional Gaussian vector with covariance Σ. Stack them as rows of an n×p matrix X. Form the scatter matrix A = XᵀX. The distribution of A — over the cone of p×p positive-semidefinite matrices — is W_p(n, Σ). Equivalently, if S = A/n is the sample covariance, then n·S ~ W_p(n, Σ). It's the natural distribution of sums of outer products of Gaussian vectors. When p = 1 (univariate), Wishart reduces to chi-squared: W_1(n, σ²) = σ²·χ²(n).
How is Wishart used in MANOVA?
In Multivariate ANOVA you have p-dimensional response vectors organized into k groups, and you ask whether the group means differ. Under the null (equal means), the within-group scatter matrix W follows W_p(N − k, Σ) and the between-group scatter B follows an independent Wishart. The Wilks lambda test statistic Λ = |W|/|W + B| has a distribution (Wilks lambda) derived from these two independent Wisharts. Pillai's trace is most robust under non-normality, Wilks is the likelihood ratio, Roy's largest eigenvalue has the highest power when the alternative is a single dominant difference.
What is the inverse Wishart and why is it useful?
If A ~ W_p(n, Σ) then A⁻¹ follows the inverse Wishart distribution W_p⁻¹(n, Σ⁻¹). The inverse Wishart is the conjugate prior for the covariance matrix Σ of a multivariate normal in Bayesian inference. Prior IW(ν, Ψ) on Σ, combined with n samples from N_p(μ, Σ), produces posterior IW(ν + n, Ψ + S) where S is the sample scatter. The update is simple addition. The Normal-Inverse-Wishart family is the standard conjugate prior for jointly modeling unknown mean and covariance.
Why does sample covariance need n > p to be invertible?
Sample covariance S = XᵀX/n is a rank-min(n, p) matrix. When n < p the rank is at most n < p, so S is singular and cannot be inverted. When n is only marginally larger than p, S is invertible but ill-conditioned — small eigenvalues are wildly biased downward by Marchenko-Pastur, and inverting amplifies the noise. The classical solution is shrinkage: instead of using S, use S_shrunk = (1 − α)S + α·(tr(S)/p)·I_p with α chosen by Ledoit-Wolf cross-validation.
What's the mean of a Wishart matrix?
If A ~ W_p(n, Σ), then E[A] = nΣ. So the sample scatter is an unbiased estimate of n times the true covariance, and sample covariance S = A/n is unbiased for Σ. Variance: Var(Aᵢⱼ) = n(Σᵢⱼ² + ΣᵢᵢΣⱼⱼ). For the diagonal entries Aᵢᵢ — themselves chi-squared distributed with n degrees of freedom and scale Σᵢᵢ — Var(Aᵢᵢ) = 2nΣᵢᵢ², agreeing with the univariate chi-squared.
How does Wishart relate to random matrix theory?
When n, p → ∞ with p/n → c ∈ (0, 1], the eigenvalues of the standardized Wishart matrix A/n (with Σ = I) follow the Marchenko-Pastur distribution: density (1/(2πcλ))√((λ₊ − λ)(λ − λ₋)) on [λ₋, λ₊] with λ± = (1 ± √c)². It explains why sample covariance eigenvalues are biased even when n is large but comparable to p. Marchenko-Pastur underpins modern high-dimensional statistics — Ledoit-Wolf shrinkage, sparse covariance estimation, and the detection of signal eigenvalues that lie outside the bulk.
How is Wishart used in portfolio theory?
Modern portfolio theory's mean-variance optimization solves max wᵀμ − (λ/2)·wᵀΣw subject to wᵀ1 = 1, where Σ is the covariance of asset returns. The solution involves Σ⁻¹, but in practice you only have the sample covariance S. The Wishart distribution describes the sampling error in S, and inverse Wishart describes how badly biased Σ⁻¹ estimates can be. Empirically, naive mean-variance portfolios on raw sample covariance perform terribly out-of-sample. Wishart-aware shrinkage and Bayesian portfolio methods are now standard in quantitative finance.