Statistics
Maximum Likelihood Estimation
Pick the parameter that makes your observed data most probable — Fisher 1922's keystone of modern statistics
You have data and a parametric model. Maximum likelihood estimation picks the single parameter value that makes your observed dataset the most probable thing the model could have produced. Fisher framed it in 1922 and proved its three landmark properties — consistency, asymptotic normality, asymptotic efficiency — that turned statistics from craft into science.
- FormalisedR. A. Fisher, 1922
- Estimatorθ̂ = arg max ℓ(θ)
- Asymptotic variance1 / (n · I(θ))
- Lower bound achievedCramér-Rao
- Bayesian limitMAP with flat prior
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Likelihood, not probability
You observe data x = (x₁, …, x_n) and assume the data was generated by some distribution f(x; θ) with unknown parameter θ. The probability density of seeing this exact dataset, assuming independence, is
L(θ ; x) = ∏_{i=1..n} f(x_i ; θ)
The trick is in the perspective. As a function of x with θ fixed, this is the probability of the data. As a function of θ with the observed x fixed, the same expression is called the likelihood. The MLE picks the value of θ that maximises the likelihood:
θ̂ = arg max_θ L(θ ; x) = arg max_θ ℓ(θ ; x)
where ℓ(θ ; x) = Σ_{i=1..n} log f(x_i ; θ)
Working with the log-likelihood ℓ rather than the raw L is universal practice. The logarithm turns a product of n densities (which underflow to zero on any moderate sample) into a sum that differentiates cleanly term by term. Because log is monotonic, the maximiser of ℓ is the same as the maximiser of L.
The mechanical recipe: write ℓ(θ), differentiate, set the gradient to zero, solve. The solution θ̂ is the maximum likelihood estimate. For most well-behaved models the gradient equation has a closed-form root; for everything else (logistic regression, mixture models, neural nets) you solve it numerically with Newton-Raphson, IRLS, or gradient descent.
Worked example: MLE for the normal distribution
Suppose x₁, …, x_n are iid samples from a normal distribution with unknown mean μ and variance σ². The density is f(x; μ, σ²) = (2π σ²)^(−1/2) exp(−(x−μ)²/(2σ²)). The log-likelihood across the whole sample:
ℓ(μ, σ²) = -n/2 · log(2π) - n/2 · log(σ²) - 1/(2σ²) · Σ (x_i - μ)²
Differentiate with respect to μ first, holding σ² fixed:
∂ℓ/∂μ = (1/σ²) · Σ (x_i - μ) = 0
⟹ μ̂ = (1/n) Σ x_i = x̄ (sample mean)
Now differentiate with respect to σ² (holding μ at its MLE):
∂ℓ/∂σ² = -n/(2σ²) + 1/(2σ⁴) · Σ (x_i - x̄)² = 0
⟹ σ̂² = (1/n) · Σ (x_i - x̄)² (NOT 1/(n-1))
Two observations. First, the MLE of the mean is exactly the sample mean — Fisher's framework just recovers the everyday estimator. Second, the MLE of the variance divides by n, not by n − 1. The familiar n − 1 denominator (Bessel's correction) gives an unbiased estimator of σ², but the unbiased estimator is not the MLE. The MLE is biased downward by a factor of (n−1)/n, which vanishes as n → ∞ — illustrating that consistency does not imply unbiasedness for finite samples.
Numerical example: with n = 5 samples (4.1, 5.3, 4.9, 5.0, 4.7) the sample mean is x̄ = 4.80 and the squared deviations sum to (−0.7)² + 0.5² + 0.1² + 0.2² + (−0.1)² = 0.80. The MLE of σ² is 0.80/5 = 0.160; the unbiased estimator is 0.80/4 = 0.200. For decision-theoretic purposes (likelihood ratios, AIC, BIC) you usually want the MLE; for reporting "the variance" you usually want the unbiased version.
Fisher's three results
Fisher's 1922 paper On the mathematical foundations of theoretical statistics established three properties that, together, made maximum likelihood the dominant inferential paradigm for the next century. Under mild regularity conditions on f (smoothness in θ, support not depending on θ, finite Fisher information):
- Consistency. θ̂_n → θ_true in probability as n → ∞. The estimator converges to the truth as the sample grows. This is the minimum thing you can ask of a procedure that calls itself an estimator.
- Asymptotic normality. √n (θ̂_n − θ_true) → 𝒩(0, I(θ)⁻¹) in distribution. The estimator's sampling distribution is approximately normal centred at the truth, with variance 1/(n · I(θ)) where I(θ) is the Fisher information. This gives a closed-form standard error and confidence interval without resampling.
- Asymptotic efficiency. The variance 1/(n · I(θ)) equals the Cramér-Rao lower bound — the smallest possible variance of any unbiased estimator. So no asymptotically unbiased competitor can beat the MLE in large samples.
The third result is the spectacular one. Of all the statistically reasonable estimators you might invent, the MLE asymptotically achieves the theoretical minimum variance. You may do better in finite samples with a clever Bayesian prior or shrinkage estimator, but you cannot do better forever. Fisher information I(θ) = −E[∂²ℓ/∂θ²] is the curvature of the expected log-likelihood at the true θ — a sharper peak gives a more informative likelihood.
Maximum likelihood vs alternatives
| Maximum likelihood | Method of moments | Bayesian / MAP | Least squares | |
|---|---|---|---|---|
| What you maximise | L(θ ; x) | (none — match sample moments to theoretical moments) | p(θ | x) | − Σ (y_i − μ_i(θ))² |
| Asymptotic efficiency | Yes (Cramér-Rao) | Generally no | Same as MLE with flat prior | Equivalent to MLE for Gaussian noise |
| Small-sample bias | Possible | Possible | Depends on prior | Possible |
| Computational cost | Closed form for exponential families; iterative otherwise | Closed form usually | MCMC or VI; expensive | One linear-algebra solve in linear case |
| Reports uncertainty? | Yes (Fisher info inverse) | Bootstrap or sandwich estimator | Yes (full posterior) | Yes (residual variance) |
| Handles informative priors | No (or by penalisation) | No | Yes by design | Via regularisation (ridge, lasso) |
| Best when | Large samples, well-specified model | Quick rough estimate | Small samples or strong prior knowledge | Gaussian regression with many predictors |
The four columns are not always rivals — they often coincide. Least squares with Gaussian errors equals MLE. MAP with a flat prior equals MLE. Method-of-moments estimators sometimes coincide with MLEs (e.g. the sample mean for a normal). The differences become important in small samples, with informative priors, or when the model is misspecified.
Where MLE shows up
- Logistic regression. The standard binary classifier in epidemiology, marketing, and credit scoring. Coefficients are estimated by maximising Σ [y_i log p_i + (1 − y_i) log(1 − p_i)] — the Bernoulli log-likelihood — via iteratively reweighted least squares (IRLS), a Newton-Raphson variant tailored to GLMs.
- Generalised linear models. Poisson regression for counts, gamma regression for waiting times, ordinal regression for ranked outcomes. R's
glm(), Stata'sglm, Python'sstatsmodelsall fit these by Newton's method on the log-likelihood. - Neural network training. Cross-entropy loss against a softmax output is the negative log-likelihood of a categorical distribution. ImageNet classifiers, transformer language models and diffusion models are all fit by MLE — gradient descent maximising log-likelihood is what training really does.
- Hidden Markov model training (Baum-Welch). Speech recognition (pre-2012), part-of-speech tagging, gene-finding (HMMER, GeneScan) all train HMMs by the EM algorithm, which is iterative MLE with latent variables. Each EM iteration increases the log-likelihood until convergence.
- Survival analysis (Cox regression). Hazard ratios in clinical trials are estimated by maximising the partial likelihood — a Cox-specific likelihood that profiles out the baseline hazard. Used in essentially every published survival study from oncology trials to actuarial mortality models.
Likelihood-based tests
Once you have an MLE, three classical tests examine whether a parameter equals a hypothesized value θ₀:
- Likelihood ratio test (Wilks). Compute Λ = 2(ℓ(θ̂) − ℓ(θ₀)). Under H₀, Λ asymptotically follows χ² with degrees of freedom equal to the number of constrained parameters. Used everywhere — model selection (deviance), regression nesting, mixed-effects modelling.
- Wald test. Use the asymptotic normality of θ̂: (θ̂ − θ₀)² · I(θ̂) ~ χ²₁. Cheaper than the LRT (no refitting under H₀) and what most regression-package output reports as "z" or "Wald" p-values.
- Score test (Rao). Evaluate the score (gradient of ℓ) at θ₀ and use its variance. Useful when fitting under H₀ is cheap but the alternative is hard. Related to Lagrange-multiplier tests in econometrics.
Under regularity conditions all three tests are asymptotically equivalent — they have the same null distribution and the same power. They differ in finite-sample behaviour and computational cost. The LRT is generally the most accurate but requires fitting two models; the Wald is the cheapest but most fragile near boundaries.
Variants and extensions
- Maximum a posteriori (MAP). Add a log-prior to the log-likelihood and maximise the sum: θ̂_MAP = arg max [ℓ(θ) + log p(θ)]. With a flat prior, MAP collapses to MLE. With a Gaussian prior, MAP becomes ridge-regularised MLE — the connection to ridge regression and weight decay in deep learning.
- Penalised likelihood (Lasso, ridge, elastic net). Subtract a penalty λ · ‖θ‖₁ or ‖θ‖₂² from the log-likelihood. Equivalent to MAP with a Laplace or Gaussian prior. Standard in high-dimensional regression where pure MLE overfits.
- Profile likelihood. When θ has a parameter of interest θ₁ and a nuisance θ₂, profile out θ₂ by maximising over it: ℓ_profile(θ₁) = max_{θ₂} ℓ(θ₁, θ₂). Treat ℓ_profile as if it were a one-parameter likelihood for inference on θ₁.
- Quasi-likelihood. When the full distribution is unknown but the mean-variance relationship is, write down a "likelihood" that depends only on those moments. Wedderburn 1974. Standard in GLM theory for overdispersed counts.
- EM algorithm for missing-data MLE. Iterate between (E) expected sufficient statistics and (M) MLE on the completed data. Provably non-decreasing log-likelihood per iteration. The standard fitting algorithm for mixture models, HMMs, and factor-analysis-style latent-variable models.
Common pitfalls
- Trusting the MLE on a misspecified model. Fisher's properties assume the data really do come from f(x; θ) for some θ. If the model is wrong, the MLE converges to the parameter that minimises Kullback-Leibler divergence to the truth — not to anything physically meaningful. White's sandwich estimator gives robust standard errors when you suspect misspecification.
- Local maxima and unbounded likelihoods. Mixture-model likelihoods can have an unbounded global maximum at degenerate solutions (one component sitting on a single observation with zero variance). EM finds local maxima depending on initialisation. Always run from multiple random starts.
- Boundary effects. When the true parameter is on the boundary of the parameter space (variance components, mixture proportions), the asymptotic χ² distribution of the LRT no longer holds. Use a 50:50 mixture of χ² distributions or simulate the null.
- Ignoring the sample-size assumption. Asymptotic results require "large enough" n. With 30 observations and 10 parameters, you do not have asymptotic regime — Wald confidence intervals can be badly miscalibrated. Use likelihood-ratio intervals, the bootstrap, or a Bayesian credible interval instead.
- MLE on the wrong scale. The MLE is invariant under reparameterisation (the MLE of σ is the square root of the MLE of σ²), but its sampling distribution is not symmetric under nonlinear transforms. Confidence intervals computed on one scale and back-transformed differ from intervals computed directly on the target scale; choose the scale where the asymptotic normality is most accurate.
Why MLE became the keystone
Fisher published On the mathematical foundations of theoretical statistics in 1922 in the Philosophical Transactions of the Royal Society A. In a single hundred-page paper he formalised the words parameter, statistic, likelihood, information, sufficiency, consistency and efficiency, and proved that maximum-likelihood estimators have all the asymptotic optimality properties one could ask for. Before Fisher, statistical estimation was a folk craft of method-of-moments matchings and ad-hoc weighted averages; after Fisher, it had a unified theory.
A century later that theory still holds. Nearly every applied statistical procedure you encounter — from generalised linear models to Cox regression to mixed-effects modelling to neural-network training — is maximum likelihood under the hood. Bayesian methods extend it; penalisation and regularisation modify it; bootstrap and sandwich estimators robustify it. But the core idea is unchanged: pick the parameter that makes your data most probable, and let Fisher's three theorems do the rest.
Frequently asked questions
What is the difference between probability and likelihood?
Probability and likelihood involve the same function f(x; θ) but treat different arguments as fixed. Probability fixes θ and sees f as a function of x — "given this coin is fair, how likely is each pattern of flips?" Likelihood fixes x (the observed data) and sees f as a function of θ — "given these flips, which value of the bias θ makes the data most plausible?" The MLE is the θ that maximises the likelihood interpreted that way. Likelihoods are not probabilities and need not integrate to one over θ.
Why do we maximise the log-likelihood instead of the likelihood?
Three reasons. First, log is monotonic — the maximiser of L is also the maximiser of log L, so the answer is unchanged. Second, the likelihood is a product L(θ) = ∏ f(x_i; θ); taking log turns the product into a sum, which differentiates cleanly term by term. Third, products of small numbers (each f(x_i; θ) < 1 typically) underflow to zero in floating-point on any sample of more than a few hundred points, while sums of logs stay well-conditioned.
Is the MLE always unbiased?
No. Maximum likelihood is consistent (it converges to the true parameter as sample size grows) but it is not generally unbiased for finite samples. The classic example is the MLE of the variance of a normal distribution: σ̂² = (1/n) Σ (x_i - x̄)². Its expectation is ((n-1)/n)·σ², so the MLE underestimates the true variance. Bessel's correction divides by n-1 instead of n to make the estimator unbiased — at the cost of slightly higher variance than the MLE.
What does asymptotic efficiency mean?
An estimator is asymptotically efficient if its variance approaches the Cramér-Rao lower bound — the minimum variance achievable by any unbiased estimator — as the sample size grows. The MLE achieves this lower bound under regularity conditions. Concretely, no other estimator can have smaller asymptotic variance, so for large samples the MLE makes the best possible use of the data. For small samples, other estimators (Bayesian, James-Stein) can beat it, but the MLE is asymptotically optimal.
How is MLE connected to neural network training?
Training a neural network classifier with cross-entropy loss is exactly maximum likelihood estimation. The model defines a conditional distribution p(y | x; θ) over class labels parameterised by network weights θ. The cross-entropy of the predicted distribution against the one-hot label is -log p(y_true | x; θ). Summing over the training set and minimising is identical to maximising Σ log p(y_i | x_i; θ) — the log-likelihood. Every classification network is an MLE.
What is Fisher information?
Fisher information I(θ) measures how sharply the log-likelihood peaks at the true parameter — equivalently, how much information about θ a single observation carries. Formally I(θ) = -E[∂² log f / ∂θ²] = E[(∂ log f / ∂θ)²]. The asymptotic variance of the MLE is 1/(n · I(θ)), so a high Fisher information means a precise estimator. Fisher introduced the quantity in his 1922 paper and named it; it is the building block of the Cramér-Rao bound, Wald tests, and most asymptotic statistical theory.
How does MLE relate to Bayesian estimation?
The MLE maximises the likelihood L(θ; x); the maximum a posteriori (MAP) estimator maximises the posterior p(θ | x) ∝ L(θ; x) · p(θ) where p(θ) is the prior. With a uniform (or improper flat) prior, MAP and MLE coincide. With an informative prior, MAP regularises the MLE — for instance ridge regression's penalty term is the negative log of a Gaussian prior on weights. Bayesian estimation reports the full posterior distribution, not just its mode; MLE reports a point estimate plus standard error.