Stochastic Processes

Metropolis-Hastings Algorithm

Sample from any distribution proportional to a known function — the MCMC workhorse

Metropolis-Hastings samples π(x) known only up to a constant by proposing moves x → x' and accepting with min(1, π(x')/π(x) · q(x|x')/q(x'|x)). Detailed balance makes π stationary. Foundation of Bayesian inference; cited 50,000+ times. Optimal acceptance ≈ 0.234 in high dimensions.

  • Acceptance ratioα = min(1, π(x') q(x|x') / (π(x) q(x'|x)))
  • Needs onlyπ(x) evaluable up to a constant
  • AuthorsMetropolis et al. 1953; Hastings 1970
  • Stationary becausedetailed balance with π
  • Optimal acceptance≈ 0.234 (high-d random walk)
  • VariantsGibbs, HMC, NUTS, parallel tempering, RJ-MCMC

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Metropolis-Hastings works

Goal: generate samples from a target distribution π(x) on some state space (continuous or discrete), where you can evaluate π(x) only up to an unknown normalising constant Z. Construct a Markov chain whose stationary distribution is exactly π, then read its trajectory as a stream of (correlated) samples.

The algorithm has three ingredients:

  1. A current state x ∈ supp(π).
  2. A proposal density q(x' | x) — any distribution from which you can sample candidates x'.
  3. An acceptance test based on the ratio
α(x, x') = min(1, [π(x') · q(x | x')] / [π(x) · q(x' | x)]).

For each iteration:

  • Sample x' ~ q(· | x).
  • Sample u ~ Uniform(0, 1).
  • If u < α(x, x'), accept — set x ← x'. Otherwise reject — leave x unchanged.
  • Record x as the new sample.

The normalising constant Z cancels in the ratio π(x')/π(x), so only the unnormalised target is needed. For symmetric proposals (q(x'|x) = q(x|x'), as in random-walk Metropolis with a Gaussian step), the q ratio is 1 and α simplifies to min(1, π(x')/π(x)).

Worked example — sampling from a banana

Take Rosenbrock's banana density in 2D: π(x, y) ∝ exp(−(1 − x)²/2 − 10(y − x²)²/2). Hard to sample directly; trivial to evaluate. Random-walk Metropolis with isotropic Gaussian proposal q(x' | x) = N(x, σ² I):

// Start at (0, 0); step size σ = 0.5
x ← (0, 0)
for t = 1 to 10000:
    x' ← x + σ · randn(2)              // propose
    α ← min(1, π(x') / π(x))           // accept ratio (q symmetric)
    if rand() < α: x ← x'              // accept or reject
    sample[t] ← x

With σ = 0.5 you get an acceptance rate ≈ 0.30 and the chain explores the banana shape in a few thousand iterations. With σ = 5.0 most moves are rejected (acceptance < 0.05) and the chain stalls; with σ = 0.05 every move is accepted (acceptance ≈ 0.99) but each step is tiny and the chain crawls.

Why it works — detailed balance

The transition kernel of the M-H chain is

P(x → x') = q(x' | x) · α(x, x'),    x ≠ x',

with a holding probability at x equal to 1 − ∫ q(x' | x) α(x, x') dx'. The chain satisfies detailed balance with respect to π:

π(x) P(x → x') = π(x') P(x' → x).

Verification: substitute α and consider two cases based on which side of the min is active. Say π(x') q(x|x') ≤ π(x) q(x'|x). Then α(x, x') = π(x') q(x|x') / (π(x) q(x'|x)) and α(x', x) = 1, so

π(x) · q(x'|x) · α(x, x') = π(x') · q(x|x') = π(x') · q(x|x') · 1 = π(x') · P(x' → x).

Detailed balance ⇒ π is stationary (integrate both sides over x). Add irreducibility (chain can reach any positive-π state) and aperiodicity (no forced cycles), and the chain converges to π from any starting point — by the ergodic theorem for Markov chains.

Tuning — step size and acceptance rate

The dominant tuning parameter is the proposal scale σ (the variance of the Gaussian step in random-walk Metropolis). Optimal scaling theory (Roberts, Gelman, Gilks 1997) shows that for d-dimensional Gaussian targets in the limit d → ∞:

  • Optimal proposal scale: σ ≈ 2.38 / √d times the target's standard deviation.
  • Optimal acceptance rate: 0.234.
  • Optimal "efficiency" (ESS per step) scales as O(1/d).

In low dimensions (d = 1) the optimal acceptance is closer to 0.44. Practical advice: aim for 0.20-0.40 acceptance during burn-in by adjusting σ; once the chain mixes well, freeze σ and discard burn-in samples. Adaptive M-H methods (Haario, Saksman, Tamminen 2001) automate this.

Variants — Gibbs, HMC, and beyond

  • Random-walk Metropolis. Symmetric Gaussian proposal; α simplifies to min(1, π(x')/π(x)). The default for low- to moderate-dimensional targets.
  • Independence sampler. Proposal q(x' | x) = g(x') does not depend on x. Acceptance ratio becomes [π(x') g(x)] / [π(x) g(x')]. Efficient when g is a good approximation to π; can be terrible when g is far from π.
  • Gibbs sampling. Cycle through coordinates, sampling each x_i from its full conditional π(x_i | x_{−i}). Every move is accepted (α ≡ 1). Tractable when full conditionals have closed-form (conjugate hierarchical models).
  • Metropolis-within-Gibbs. Mix Gibbs (where conditionals are tractable) with M-H steps (where they aren't). The pattern in BUGS/JAGS.
  • Hamiltonian Monte Carlo (HMC). Augment with momentum, integrate Hamilton's equations using leapfrog, propose the endpoint of the trajectory. Uses gradients of log π. Much lower autocorrelation in high dimensions. Used by Stan, PyMC, NumPyro.
  • No-U-Turn Sampler (NUTS). HMC variant that auto-tunes trajectory length by detecting when the trajectory reverses direction. The default sampler in Stan since 2014.
  • Parallel tempering. Run multiple chains at different "temperatures" π(x)^(1/T_i); swap states between adjacent chains. Hot chains mix easily; cold chain stays on target. Solves multimodality.
  • Reversible jump MCMC. M-H across models of different dimensions (Green 1995). Used for Bayesian model selection.
  • Langevin / MALA. Proposal includes a drift term ∇ log π(x); acceptance compensates. Cheaper than HMC, still uses gradients.

JavaScript — random-walk Metropolis

function randn() {
  const u = 1 - Math.random(), v = Math.random();
  return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}

// Rosenbrock "banana" density (unnormalised)
function bananaLogPi([x, y]) {
  return -0.5 * (1 - x) ** 2 - 0.5 * 10 * (y - x * x) ** 2;
}

function metropolisHastings(logPi, dim, sigma, nIter, x0) {
  const samples = new Array(nIter);
  let x = x0.slice();
  let logPiX = logPi(x);
  let accepted = 0;
  for (let t = 0; t < nIter; t++) {
    const xProp = x.map(xi => xi + sigma * randn());
    const logPiProp = logPi(xProp);
    const logAlpha = logPiProp - logPiX;          // symmetric proposal
    if (Math.log(Math.random()) < logAlpha) {
      x = xProp;
      logPiX = logPiProp;
      accepted++;
    }
    samples[t] = x.slice();
  }
  return { samples, acceptance: accepted / nIter };
}

// Run
const result = metropolisHastings(bananaLogPi, 2, 0.5, 10000, [0, 0]);
console.log(result.acceptance);  // ≈ 0.30 — close to the 0.234 optimum
// Discard first 1000 as burn-in
const posterior = result.samples.slice(1000);

Why detailed balance gives stationarity

If π(x) P(x, x') = π(x') P(x', x) for all x, x', then

∫ π(x) P(x, x') dx = ∫ π(x') P(x', x) dx = π(x') · 1 = π(x'),

using the fact that P is a transition kernel (rows integrate to 1). The left side is by definition the distribution at time t + 1 if the distribution at time t is π. So π is invariant — fixed by the chain.

To get convergence, also need:

  • Irreducibility — any state with positive π is reachable from any starting state in some number of steps.
  • Aperiodicity — the chain doesn't have a forced cycle.

For continuous random-walk Metropolis with a positive proposal density (Gaussian), irreducibility and aperiodicity hold automatically. The ergodic theorem then guarantees that for any L¹ function f,

(1/N) Σ_{t=1}^N f(X_t) → ∫ f(x) π(x) dx    almost surely,

so empirical averages over the chain converge to expectations under π. This is the foundational justification for "run the chain, average the samples".

When Metropolis-Hastings is the right tool

  • Bayesian posterior inference. Compute posterior expectations, credible intervals, marginal distributions when the posterior is known only up to a normaliser.
  • Statistical physics. Sample configurations of Ising models, lattice gauge theories, polymer chains under Boltzmann weights exp(−βH(x))/Z.
  • Image denoising and inverse problems. Sample posterior over images given noisy measurements (Markov random field priors).
  • Phylogenetics. MrBayes uses M-H across tree topologies and branch lengths.
  • Computer graphics. Metropolis light transport (Veach & Guibas 1997) samples paths through scenes.
  • Combinatorial optimisation. Simulated annealing is M-H with temperature schedule.
  • Latent-variable models. Hidden Markov models, mixtures, topic models — M-H is a fallback when EM or Gibbs is not available.

Common pitfalls

  • No burn-in. Early samples depend on the (arbitrary) starting state; including them biases estimates. Discard the first 10-50% of iterations.
  • Treating samples as independent. M-H samples are autocorrelated. Effective sample size (ESS) is much smaller than total iterations — use ESS in confidence intervals, not raw N.
  • One short chain. Run multiple chains from different starts; if they disagree, the sampler is stuck in a mode. Gelman-Rubin R̂ > 1.01 is a red flag.
  • Wrong proposal scale. Acceptance < 0.10 means steps too large; > 0.60 means steps too small. Tune until acceptance is 0.20-0.40 (random walk) or ≈ 0.65 (HMC).
  • Multimodal target. Random-walk M-H gets stuck in one mode. Use parallel tempering, mode-jumping proposals, or HMC with multiple chains.
  • Wrong acceptance formula for asymmetric proposals. Forgetting the q(x|x')/q(x'|x) ratio biases the stationary distribution. Symmetric q (e.g. Gaussian random walk) is the only case where it cancels.
  • Discrete / mixed dimensions. Vanilla M-H doesn't handle moves between models of different dimension; use reversible-jump MCMC.

Applications

Bayesian statistics

Practically every Bayesian model with a non-conjugate likelihood is fit via MCMC. M-H (or its variants HMC/NUTS) underpins Stan, PyMC, NumPyro, BUGS, JAGS — the entire applied Bayesian ecosystem. From 1990s hierarchical models in epidemiology to modern Bayesian deep learning, MCMC is the inference engine.

Statistical physics — the original 1953 paper

Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller introduced the algorithm to compute equation-of-state properties of hard-sphere fluids on the MANIAC computer at Los Alamos. The same algorithm now powers lattice QCD, Ising model simulations of phase transitions, and Monte Carlo studies of every quantum spin system.

Phylogenetics — MrBayes

Bayesian inference of evolutionary trees: state space includes tree topology (a discrete combinatorial object), branch lengths, and substitution-model parameters. M-H proposes tree rearrangements (nearest-neighbour interchange, subtree pruning and regrafting) and parameter perturbations; acceptance under the phylogenetic likelihood gives posterior samples over trees.

Computer graphics — Metropolis light transport

Render an image by sampling light paths through a 3D scene under the rendering equation. Veach and Guibas (1997) showed that M-H over path space converges faster than naive path tracing for hard scenes (caustics, indirect illumination through small apertures).

Simulated annealing for optimisation

Use π(x) = exp(−f(x)/T) as the target and decrease T over iterations. As T → 0 the mass concentrates on the global minima of f. Each iteration is M-H. Used to solve TSP, VLSI placement, protein folding heuristics.

Frequently asked questions

Why does the Metropolis-Hastings acceptance ratio give π as stationary?

The chain satisfies detailed balance: π(x) P(x → x') = π(x') P(x' → x), where P(x → x') = q(x'|x) · α(x, x') is the transition kernel. With the M-H acceptance α(x, x') = min(1, [π(x') q(x|x')]/[π(x) q(x'|x)]), one can verify by case analysis (either α equals the ratio or its reciprocal) that the equation holds. Detailed balance implies π is stationary for the chain. Add irreducibility (the chain can reach any state) and aperiodicity (no forced cycles), and the chain converges to π from any starting point. The unnormalised density suffices: π(x')/π(x) cancels any normalising constant — only ratios matter.

What's the optimal acceptance rate?

For random-walk Metropolis with Gaussian proposals, the optimal acceptance rate in high dimensions is ≈ 0.234. Below it, the proposal step is too large — most moves are rejected, the chain barely moves. Above it, the step is too small — moves are accepted but each step explores little, autocorrelation is high. The 0.234 result (Roberts, Gelman, Gilks 1997) assumes the target factorises into i.i.d. components in the limit d → ∞; for low dimensions, optimal is closer to 0.44. Practical tuning: aim for 0.20-0.40 acceptance by adjusting the proposal step size during burn-in. For Hamiltonian Monte Carlo, the optimal is much higher (~0.65) because each step takes a long trajectory.

How does Metropolis differ from Gibbs sampling?

Gibbs sampling is a special case where the proposal updates one coordinate at a time, sampling exactly from its full conditional distribution π(x_i | x_{−i}). Every Gibbs proposal is accepted (the M-H ratio is identically 1). Gibbs is efficient when the full conditionals are tractable (conjugate models). When they aren't, you use generic M-H. Gibbs gets stuck in highly correlated targets — it can only move along axes. Random-walk Metropolis with a tuned proposal can outperform Gibbs on correlated targets. Practical Bayesian software (BUGS, JAGS) mixes Gibbs and Metropolis depending on which conditionals are conjugate.

What is burn-in and how long should it be?

Burn-in is the initial portion of the chain discarded so that samples reflect π rather than the (arbitrary) starting state. Length depends on geometry: well-mixing chains may need only 1,000 burn-in iterations; pathological multimodal targets may need millions. Diagnostics: trace plots (the chain should look like white noise around the posterior, not drift); Gelman-Rubin R̂ across multiple chains (should be < 1.01); effective sample size relative to total iterations. No universal rule — you tune by inspection. A rough heuristic: throw away the first 10-50% of iterations.

How does Hamiltonian Monte Carlo improve on M-H?

Hamiltonian Monte Carlo (HMC) uses gradient information of log π to propose long, coherent moves through phase space, augmenting the state with auxiliary momentum variables and integrating Hamilton's equations of motion (with leapfrog) for some number of steps. The proposal is then a candidate position; M-H acceptance compensates for numerical integration error. Result: much lower autocorrelation than random-walk Metropolis in high dimensions — effective sample size per gradient evaluation is far higher. HMC is what powers modern Bayesian software (Stan, PyMC, NumPyro). Requires differentiable log-target, which rules out discrete parameters.

When does Metropolis-Hastings fail?

Multimodal targets with separated modes — the chain can get stuck in one mode and never visit the others. Diagnosed by running multiple chains from different starts and noticing they disagree. Solutions: parallel tempering (run hot copies that mix more easily, exchange states), simulated annealing, or mode-jumping proposals. Also fails on very high-dimensional targets with strong correlations — random-walk Metropolis becomes infeasibly slow because the step size must shrink as O(d^{-1/2}). HMC or no-U-turn sampler (NUTS) addresses this. Pseudo-likelihoods, intractable normalising constants (doubly intractable problems), and reversible jump across model dimensions all need M-H variants (exchange algorithm, RJ-MCMC).

MCMC algorithms compared — M-H vs Gibbs vs HMC

Pick the right MCMC sampler for the target.

Algorithm Needs Per-iter cost Optimal acceptance Strengths Weaknesses
Random-walk M-H π(x) up to constant 1 evaluation ≈ 0.234 (high d) General, simple Slow in high d, multimodal
Independence sampler π(x) up to constant + good g 1 evaluation Up to 1.0 if g ≈ π Fast when g close to π Catastrophic if g far from π
Gibbs sampling Tractable full conditionals Sample from each conditional 1.0 (always accepts) Closed-form on conjugate models Slow on correlated targets
HMC ∇ log π(x) + π L gradient evals (leapfrog) ≈ 0.65 Excellent in high d Discrete parameters; tuning L, ε
NUTS ∇ log π(x) + π Auto-tuned trajectory ≈ 0.80 No L tuning Still slower than well-tuned HMC
Parallel tempering π(x) + temperature ladder k × M-H cost Per chain ≈ 0.234 Handles multimodality k× compute, swap tuning
SMC / particle filters Sequential factorisation of π N particles per step N/A Embarrassingly parallel Particle degeneracy without resampling