Statistical Inference
Fisher Information
The curvature of the log-likelihood at the truth — and the variance floor it sets on every estimator
I(θ) = E[(∂ log f / ∂θ)²] = −E[∂² log f / ∂θ²] — how sharply the log-likelihood peaks. Sets Var(θ̂) ≥ 1/(n·I(θ)): the Cramér-Rao bound.
- Definition (variance of score)I(θ) = E[(∂ log f / ∂θ)²]
- Definition (curvature)I(θ) = − E[∂² log f / ∂θ²]
- Cramér-Rao boundVar(θ̂) ≥ 1 / (n · I(θ)) for unbiased θ̂
- Normal exampleI(μ) = 1/σ² for 𝒩(μ, σ²)
- Jeffreys priorp(θ) ∝ √det I(θ)
- IntroducedR. A. Fisher, 1922
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Two equivalent formulas
Fix a parametric model f(x; θ) and let ℓ(θ; x) = log f(x; θ) be the log-likelihood at a single observation. The score is the derivative U(θ; x) = ∂ℓ/∂θ — it measures how sharply the log-likelihood responds to a perturbation of θ. Under regularity conditions, the score has expectation zero at the true parameter: E_θ[U(θ; X)] = 0. The Fisher information is its variance:
I(θ) = Var_θ[ U(θ; X) ] = E_θ[ (∂ log f / ∂θ)² ].
By a calculation that differentiates the normalisation ∫ f dx = 1 twice w.r.t. θ, this equals the negative expected second derivative:
I(θ) = − E_θ[ ∂² log f / ∂θ² ] = − E_θ[ ℓ''(θ; X) ].
Both forms agree under mild regularity. The first (squared score) is easier to estimate from samples by Monte Carlo. The second (Hessian) makes the geometric interpretation transparent: I(θ) is the average curvature of the log-likelihood at the truth. A sharply peaked log-likelihood (large second derivative in magnitude) corresponds to high Fisher information; a flat log-likelihood corresponds to low information.
For n independent samples, information adds: I_n(θ) = n · I(θ). This is the formal statement that more data carries more information about the parameter.
Worked example — Fisher information of the normal mean
Take X ∼ 𝒩(μ, σ²) with σ² known. The log-density:
log f(x; μ) = − ½ log(2πσ²) − (x − μ)² / (2σ²).
Differentiate once w.r.t. μ — the score:
U(μ; x) = ∂ log f / ∂μ = (x − μ) / σ².
The expectation of the score is E[X − μ] / σ² = 0 — confirming the regularity property. Differentiate again to get the second derivative:
∂² log f / ∂μ² = − 1 / σ² (constant, independent of x).
Take negative expectation:
I(μ) = − E[ −1/σ² ] = 1 / σ².
So the Fisher information for the mean of a normal distribution is the inverse variance. Cramér-Rao with n samples:
Var(μ̂) ≥ 1 / (n · I(μ)) = σ² / n.
The sample mean μ̂ = (1/n) Σ X_i has variance exactly σ²/n — it achieves the Cramér-Rao bound. So the sample mean is the minimum-variance unbiased estimator (MVUE) of μ; no unbiased estimator can do better. This is the textbook reason for using x̄ rather than the median (or a midhinge, or a trimmed mean) when estimating a normal mean.
Numerical concreteness: with σ = 5 and n = 100, the Cramér-Rao bound gives Var(μ̂) ≥ 25/100 = 0.25, so the standard error ≥ 0.5. With σ = 1 and the same n, the standard error drops to 0.1 — five times more precision from sharper data. Doubling n halves the variance.
The geometric picture
Plot the log-likelihood ℓ(θ; x) as a function of θ at a fixed observation x. At the maximum (the MLE), the slope is zero and the second derivative is negative. The magnitude of that second derivative — the curvature at the peak — is the observed information J(θ̂). Its expectation under the model is the expected Fisher information I(θ).
A sharp peak: large curvature, large I, precise estimator. A small perturbation of θ away from the truth drops the log-likelihood significantly, so the MLE is well-localised. A flat peak: small curvature, small I, sloppy estimator. The log-likelihood barely changes as you wander away from the truth, so the MLE wanders too.
This connects directly to the standard error of an MLE. The asymptotic distribution of the MLE under regularity conditions is
√n · (θ̂_n − θ) ⟶ 𝒩(0, I(θ)⁻¹)
so a confidence interval is θ̂ ± 1.96 · √(1 / (n · I(θ̂))). For implementations on real data, software reports √(observed-information)⁻¹ as the "standard error" — exactly the inverse square root of the Hessian of the log-likelihood at the MLE.
The Cramér-Rao lower bound
Cramér (1946) and Rao (1945) independently proved: for any unbiased estimator T(X) of θ based on n iid samples from f(·; θ),
Var(T) ≥ 1 / (n · I(θ)).
The proof is a Cauchy-Schwarz inequality applied to the score and the centred estimator. The bound is achieved (with equality) by the MLE asymptotically — Fisher's "asymptotic efficiency" property. For finite n, biased estimators can have smaller mean-squared error (by trading bias for reduced variance), but no unbiased estimator can have smaller variance than the Cramér-Rao floor.
Practical consequences. (1) Standard errors of regression coefficients come from inverting the observed Fisher information matrix. (2) Hypothesis tests use the asymptotic normality of (θ̂ − θ₀)/√(I(θ̂)⁻¹/n) ∼ 𝒩(0, 1) — the Wald statistic. (3) Power calculations require knowing I(θ) under the alternative, which determines the asymptotic detectable effect size.
The Fisher information matrix
For multi-dimensional θ = (θ₁, …, θ_k), define the k × k Fisher information matrix
I(θ)_ij = E[ (∂ log f / ∂θ_i) · (∂ log f / ∂θ_j) ]
= − E[ ∂² log f / (∂θ_i ∂θ_j) ].
For 𝒩(μ, σ²) with both parameters unknown:
I(μ, σ²) = [ 1/σ² 0 ]
[ 0 1/(2σ⁴) ].
The off-diagonal entry is zero, so mean and variance estimates are asymptotically uncorrelated. The inverse matrix gives asymptotic variances Var(μ̂) ≥ σ²/n and Var(σ̂²) ≥ 2σ⁴/n. The matrix is the natural metric on the parameter manifold — natural-gradient descent (Amari 1998) pre-multiplies the gradient by I(θ)⁻¹ to align steps with the geometry of distributions rather than the arbitrary Cartesian parameter space.
JavaScript — computing Fisher information
// Expected Fisher information by Monte Carlo
// I(theta) = E_theta[ (d log f / d theta)^2 ]
function fisherMonteCarlo(scoreFn, samplerFn, theta, nSamples = 100_000) {
let sum = 0;
for (let i = 0; i < nSamples; i++) {
const x = samplerFn(theta);
const s = scoreFn(x, theta);
sum += s * s;
}
return sum / nSamples;
}
// Normal mean: score = (x - mu) / sigma^2; Fisher = 1 / sigma^2
function normalScore(x, mu, sigma = 1) {
return (x - mu) / (sigma * sigma);
}
function normalSampler(mu, sigma = 1) {
const u = 1 - Math.random(), v = Math.random();
return mu + sigma * Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}
const sigma = 1;
const I_hat = fisherMonteCarlo(
(x, mu) => normalScore(x, mu, sigma),
(mu) => normalSampler(mu, sigma),
0, // theta = mu (irrelevant by location invariance)
200_000
);
console.log(I_hat); // ≈ 1.000 (true value: 1/sigma^2 = 1)
// Cramer-Rao bound for n samples
function cramerRao(I, n) {
return 1 / (n * I);
}
console.log(cramerRao(1, 100)); // 0.01 — SE >= 0.1
console.log(cramerRao(1 / 25, 100)); // 0.25 — sigma=5 case
Where Fisher information shows up
- Standard errors in regression. R's
summary.glm(), Stata'sreg, Python'sstatsmodels.OLS()all report standard errors as √(diagonal of observed information inverse). The Fisher information matrix at the MLE is what makes confidence intervals possible. - Cramér-Rao bound and minimum-variance estimators. Any time you need to know "is this estimator efficient?" you compute its variance and compare to 1/(n · I(θ)). The sample mean for a normal; the sample proportion for a binomial; the MLE for any exponential family — all hit the bound.
- Asymptotic theory of MLE. The CLT-style result √n(θ̂ − θ) → 𝒩(0, I⁻¹) underpins Wald tests, score tests, likelihood ratio tests, AIC and BIC, profile likelihoods, and the entire framework of asymptotic inference for parametric models.
- Optimal experimental design. Pick design points to maximise det I(θ) (D-optimality), trace I(θ) (A-optimality), or eigenvalues (E-optimality). The 'best' clinical trial allocates patients to dose levels that maximise the Fisher information about the dose-response parameter — minimising the sample size required for a given precision.
- Jeffreys prior in Bayesian inference. p(θ) ∝ √det I(θ) is the unique reparameterisation-invariant 'objective' prior. For a binomial proportion p, Jeffreys prior is Beta(½, ½); for a Poisson rate, it is Gamma(½, 0). Used in objective Bayesian analyses when no informative prior is available.
- Natural gradient descent (Amari). Pre-multiply the gradient by I(θ)⁻¹ before taking a step. Steps respect the geometry of distributions: a step of length 0.1 in natural-gradient space changes the distribution by approximately the same KL divergence regardless of where on the manifold you are. The basis of TRPO and modern policy-gradient algorithms; second-order optimisation for neural nets (K-FAC, Shampoo) approximates the Fisher matrix.
- Information geometry. The Fisher matrix is the Riemannian metric on the manifold of probability distributions. Geodesics are shortest paths between distributions; α-divergences (Amari) interpolate between KL and other f-divergences. The foundation of dual coordinate systems, exponential families, and modern statistical geometry.
Common pitfalls
- Confusing observed and expected information. Observed J(θ̂) is the Hessian of the log-likelihood at the MLE on the actual data. Expected I(θ) is an integral over the model. They agree asymptotically. Software typically reports observed; theory typically discusses expected. Bayesian-frequentist matching usually prefers expected.
- Trusting Cramér-Rao for biased estimators. The bound applies to unbiased estimators. Biased estimators (shrinkage, James-Stein, MAP under informative priors) can beat the bound in MSE by trading bias for variance. Never report "Cramér-Rao says my variance can't be lower" without checking unbiasedness.
- Forgetting regularity conditions. Fisher's framework breaks when the support depends on θ (e.g. estimating the upper bound of a uniform distribution), or when log-likelihood is non-differentiable at the truth. The MLE is no longer asymptotically normal and the Cramér-Rao bound is sharp only for "regular" models.
- Ignoring the parameterisation. Fisher information transforms under reparameterisation: I_φ = I_θ · (dθ/dφ)². So I(σ) ≠ I(σ²). Standard errors computed on σ scale and back-transformed to σ² scale differ from those computed directly. Choose the parameterisation where the asymptotic normality is most accurate.
- Singular information matrix. If two parameters are confounded (perfectly collinear features in regression), the Fisher matrix is singular and not invertible — no asymptotic variance exists. Identify and remove the redundancy before estimation.
- Plug-in from small samples. Replacing I(θ) by I(θ̂) is asymptotically valid but biased for finite n. The bias-corrected standard errors used in modern software (sandwich estimator, Huber-White, robust SE) protect against this and against model misspecification.
Fisher information vs related concepts
| Fisher information | Shannon entropy | Mutual information | KL divergence | Observed information | |
|---|---|---|---|---|---|
| What it measures | Curvature of log-likelihood at θ | Uncertainty of one variable | Dependence between two variables | Distance between two distributions | Sample-specific curvature at MLE |
| Function of | Parameter θ | Distribution P | Joint of (X, Y) | Pair (P, Q) | Data and MLE θ̂ |
| Units | Inverse parameter² (variance⁻¹) | Bits or nats | Bits or nats | Bits or nats | Inverse parameter² |
| Sets a lower bound on | Estimator variance (Cramér-Rao) | Lossless code length (source coding) | Channel capacity | Approximation error (Gibbs) | Asymptotic MLE SE |
| Local quadratic approximates | KL divergence around θ | Itself | Itself | Itself globally | Expected Fisher in large n |
| Best when | Parametric estimation precision | Coding and communication | Measuring dependence | Comparing two specific distributions | Standard errors from real data |
The five quantities form a coherent family — Fisher information is the local quadratic of KL divergence around the true parameter; Shannon entropy is the floor of cross-entropy when the coding distribution matches the data; mutual information is the KL between joint and product of marginals; observed information is the sample-realisation of Fisher information. All five share the same information-theoretic spine introduced (in pieces) by Shannon, Fisher, Cramér and Rao between 1922 and 1948.
Frequently asked questions
Why are there two equivalent formulas for Fisher information?
I(θ) = E[(∂ log f / ∂θ)²] (variance of the score) and I(θ) = −E[∂² log f / ∂θ²] (negative expected second derivative). The two forms agree under the regularity conditions that the support of f does not depend on θ and the order of expectation and differentiation can be swapped. The trick: differentiate the identity ∫ f(x; θ) dx = 1 twice w.r.t. θ; the cross terms combine to give the equivalence. In practice the first form (squared score) is easier to evaluate by Monte Carlo on samples; the second form (Hessian) is easier in closed form. PyTorch's autograd makes both forms easy to compute symbolically.
What is the Cramér-Rao lower bound?
For any unbiased estimator θ̂ of θ based on n iid samples, Var(θ̂) ≥ 1/(n · I(θ)). The variance of any honest estimator cannot fall below the inverse of n times the Fisher information per observation. The bound is tight asymptotically: the MLE achieves it as n → ∞, which is what Fisher's "asymptotic efficiency" result means in plain language. For finite n, some estimators (especially biased ones) can have smaller mean-squared error — but no unbiased estimator can have smaller variance than 1/(n · I(θ)). Cramér and Rao independently proved this in 1945–1946; the bound is now the universal precision floor of statistical estimation.
What is the Fisher information matrix?
The multivariate generalisation. For θ ∈ ℝ^k, I(θ) is a k×k matrix with entries I_ij(θ) = E[(∂ log f / ∂θ_i)(∂ log f / ∂θ_j)] — equivalently, the negative expected Hessian of log f. Its inverse I(θ)⁻¹ is the asymptotic covariance matrix of the MLE: √n (θ̂ − θ) → 𝒩(0, I(θ)⁻¹). Diagonal elements give per-parameter variances; off-diagonal elements give correlations between parameter estimates. In natural gradient descent (Amari), the Fisher matrix is used as the natural metric on the parameter manifold — gradient steps are pre-multiplied by I(θ)⁻¹ to align with the geometry of distributions.
What is the difference between expected and observed information?
Expected Fisher information I(θ) is the integral over the model — a property of the model evaluated at θ. Observed information J(θ̂) is −∂² log L / ∂θ² evaluated at the MLE on the actual data — a property of the sample. Asymptotically the two coincide (by the law of large numbers). For finite samples they differ: observed information is what software packages report as standard errors, because it adapts to the realised data; expected information requires solving an integral analytically. Efron and Hinkley (1978) argued for observed information on conditional-inference grounds; expected information is more common in textbooks.
What is Jeffreys prior and why does it use Fisher information?
Jeffreys' prior is p(θ) ∝ √det I(θ) — proportional to the square root of the Fisher information determinant. It is the unique "objective" prior that is invariant under reparameterisation (the prior of φ = g(θ) is the same as transforming Jeffreys' prior of θ through the change-of-variables formula). For a binomial proportion this gives a Beta(½, ½) prior; for a Poisson rate, Gamma(½, 0); for a normal mean with known variance, a flat prior. Jeffreys (1946) proposed it as a default "uninformative" prior; subsequent work showed it has good frequentist matching properties but can be improper (not integrable) for multi-parameter models.
How does Fisher information relate to KL divergence?
Fisher information is the local quadratic approximation of KL divergence around the true parameter. For nearby parameters θ and θ + dθ, D_KL(p(·; θ)‖p(·; θ + dθ)) ≈ ½ dθᵀ I(θ) dθ. The Fisher information matrix is the Hessian of the KL divergence at θ. This connection makes information geometry possible: the space of probability distributions is a Riemannian manifold with the Fisher matrix as its natural metric. Geodesics are shortest paths between distributions; natural gradient descent moves along the steepest direction in this geometry.
How is Fisher information computed for a normal distribution?
For 𝒩(μ, σ²) with known σ, log f = −½ log(2πσ²) − (x − μ)² / (2σ²). Differentiate twice w.r.t. μ: ∂² log f / ∂μ² = −1/σ². Take negative expectation: I(μ) = 1/σ². So the Fisher information for the mean is the inverse variance; sharper data (smaller σ) carry more information per sample. The Cramér-Rao bound gives Var(μ̂) ≥ σ²/n, exactly the variance of the sample mean — which therefore achieves the bound. For unknown σ, the Fisher information matrix has I(μ) = 1/σ², I(σ²) = 1/(2σ⁴), and zero off-diagonal entries — mean and variance estimates are asymptotically uncorrelated.