Stochastic Processes

Ergodic Theorem

For an ergodic system, time average equals space average — Birkhoff 1931

The Birkhoff ergodic theorem: for an ergodic measure-preserving system (X, T, μ) and any f ∈ L¹(μ), (1/N) Σ f(T^n x) → ∫ f dμ almost surely. Time average = space average. Foundation of statistical mechanics, justifies MCMC, generalises the strong law of large numbers to arbitrary measure-preserving dynamics.

Statement(1/N) Σ f(T^n x) → ∫ f dμ a.s.
AuthorG. D. Birkhoff 1931 (pointwise); von Neumann 1932 (L²)
ErgodicityNo non-trivial T-invariant sets: μ(A) ∈ {0, 1}
GeneralisesStrong LLN (i.i.d. sequences are ergodic)
Foundation ofStatistical mechanics, MCMC, information theory
Mixing hierarchyBernoulli ⊂ K ⊂ mixing ⊂ weakly mixing ⊂ ergodic

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the ergodic theorem works

Setup. A measure-preserving dynamical system is a triple (X, T, μ) where X is a measurable space, μ is a probability measure on X, and T : X → X is a measurable map satisfying μ(T⁻¹ A) = μ(A) for all measurable A — the dynamics preserves μ. The orbit of x ∈ X is the sequence x, Tx, T²x, T³x, …

Two natural ways to "average" a measurable function f : X → ℝ:

Time average along the orbit of x: A_N f(x) = (1/N) Σ_{n=0}^{N−1} f(T^n x).
Space average: ⟨f⟩ = ∫_X f dμ — a single number, integrating f against the invariant measure.

Birkhoff's theorem (1931) says that for any f ∈ L¹(μ), the time average converges μ-almost-everywhere to a T-invariant function f̄. If the system is ergodic — meaning every T-invariant measurable set has μ-measure 0 or 1 — then every T-invariant function is constant μ-a.s., so

lim_{N→∞} (1/N) Σ_{n=0}^{N−1} f(T^n x) = ∫_X f dμ    for μ-a.e. x.

Time average = space average. A single typical orbit knows the integral of every L¹ function.

Worked example — the irrational rotation

Take X = [0, 1) (the unit circle), Lebesgue measure μ, and the rotation T_α(x) = (x + α) mod 1 for an irrational α. T_α preserves μ (translation is measure-preserving on the circle), and the system is ergodic — there are no proper T_α-invariant measurable subsets because the orbit of any x is equidistributed (Weyl's equidistribution theorem, 1916).

Apply the ergodic theorem to f(x) = 1_{[a, b]}(x), the indicator of a subinterval. The time average is the fraction of the orbit that falls in [a, b] in the first N steps. The space average is b − a (Lebesgue measure of [a, b]). So

lim_{N→∞} #{n < N : T_α^n x ∈ [a, b]} / N = b − a.

The orbit spends a fraction b − a of its time in [a, b] — exactly the interval's length. Take α = (√5 − 1)/2 (golden ratio fractional part). The first decimal digits of nα mod 1 are equidistributed: each digit 0-9 appears with frequency 1/10 along the orbit. This is the ergodic theorem made arithmetic.

Recovering the strong law of large numbers

The strong law of large numbers (SLLN) is a special case. Let (Y_n) be an i.i.d. sequence with mean ⟨Y⟩. Set X = ℝ^ℕ (sample paths), μ = product measure, and T = left shift (T(y_0, y_1, y_2, …) = (y_1, y_2, …)). T preserves μ; the system is ergodic by Kolmogorov's 0-1 law (i.i.d. ⇒ ergodic shift). Apply Birkhoff to f(y) = y_0:

(1/N) Σ_{n=0}^{N−1} Y_n → E[Y_0] = ⟨Y⟩    almost surely.

That is the SLLN. The ergodic theorem is the SLLN for dependent but stationary, ergodic sequences. It applies to autocorrelated data — exactly what MCMC produces.

Why this matters for MCMC

An MCMC sampler is a Markov chain X_0, X_1, X_2, … with stationary distribution π. The chain's transition kernel T preserves π by construction (detailed balance). Suppose the chain is also irreducible and aperiodic. Then the system is ergodic, and Birkhoff applies to any f ∈ L¹(π):

(1/N) Σ_{n=1}^N f(X_n) → ∫ f dπ    π-almost-surely.

So a single long MCMC trajectory yields consistent estimates of E_π[f] for every f. No independence required — autocorrelation between samples is acceptable. This is the theoretical foundation that makes "run the chain, average the samples" rigorous.

The CLT for ergodic Markov chains then gives Monte Carlo error bars: (1/√N) Σ (f(X_n) − E_π[f]) ⇒ Normal(0, σ²) where

σ² = Var_π[f] + 2 Σ_{k=1}^∞ Cov_π(f(X_0), f(X_k)).

The second term — autocovariance — inflates the variance compared with i.i.d. samples. The "effective sample size" is N · Var_π[f] / σ² — typically much less than N for slowly mixing chains.

The mixing hierarchy

Ergodicity is the weakest member of a hierarchy of "randomisation" properties. Stronger to weaker:

Bernoulli. Isomorphic to an i.i.d. shift. Maximum disorder.
K-system (Kolmogorov). Has no zero-entropy factors — knowing the recent past gives no useful prediction.
Mixing. μ(T⁻ⁿ A ∩ B) → μ(A) μ(B) for all measurable A, B — events n steps apart become asymptotically independent.
Weakly mixing. The same convergence holds for "most" n in a density-1 sense.
Ergodic. No non-trivial invariant sets.

Irrational rotations are ergodic but not weakly mixing (orbits are periodic-like, never forget initial conditions). The Bernoulli shift on {0, 1}^ℕ is K, mixing, weakly mixing, and ergodic — the strongest level. Most MCMC samplers we use in practice are ergodic; well-tuned ones are mixing or stronger.

Variants

Von Neumann's mean ergodic theorem (1932). Convergence of A_N f to ∫ f dμ in L²(μ) norm. Easier to prove than Birkhoff but weaker (no pointwise statement). Predates Birkhoff by a few months.
Kingman's subadditive ergodic theorem (1968). Generalises Birkhoff to subadditive sequences (where g_{m+n} ≤ g_m + g_n ∘ T^m). Used to compute Lyapunov exponents in random matrix products and to prove Furstenberg's theorem.
Shannon-McMillan-Breiman theorem. The ergodic theorem applied to log-likelihoods. For an ergodic stationary process, (1/N) log p(X_1, …, X_N) → −h almost surely, where h is the entropy rate. Foundation of information theory.
Hopf's ratio ergodic theorem. Extends Birkhoff to σ-finite (not necessarily finite) invariant measures.
Polynomial / non-conventional ergodic averages. Furstenberg's proof of Szemerédi's theorem uses (1/N) Σ f(T^n x) g(T^{n²} x) — polynomial time averages, with their own convergence theory.
Continuous-time version. For a measure-preserving flow (T_t)_{t ≥ 0}, the time average is (1/T) ∫_0^T f(T_t x) dt; converges to ∫ f dμ a.s. for ergodic flows.

JavaScript — verifying the ergodic theorem on the irrational rotation

// X = [0, 1), T(x) = (x + alpha) mod 1, alpha irrational.
// Indicator of [a, b]. Time average should converge to b - a.
function irrationalRotation(alpha, N, a, b, x0 = 0.0) {
  let x = x0;
  let count = 0;
  const trace = [];
  for (let n = 0; n < N; n++) {
    if (x >= a && x < b) count++;
    if ((n + 1) % 1000 === 0) trace.push({ n: n + 1, timeAvg: count / (n + 1) });
    x = (x + alpha) % 1;
  }
  return { finalAvg: count / N, expected: b - a, trace };
}

const alpha = (Math.sqrt(5) - 1) / 2;  // golden ratio fractional
console.log(irrationalRotation(alpha, 100000, 0.2, 0.7));
// finalAvg ≈ 0.500, expected = 0.500.  Convergence is O((log N)/N) — much
// faster than 1/√N i.i.d. — because rotations have low discrepancy.

// Monte Carlo estimate of E_pi[f] from MCMC samples — ergodic-theorem in
// practice. Pass the chain output and any function of state.
function ergodicAverage(samples, f) {
  let sum = 0;
  for (const x of samples) sum += f(x);
  return sum / samples.length;
}

Proof sketch — the maximal ergodic inequality

Birkhoff's proof reduces to the maximal ergodic theorem: for f ∈ L¹(μ) and any λ > 0,

μ({x : sup_N A_N f(x) > λ}) ≤ (1/λ) ∫_{sup A_N f > λ} f dμ.

The supremum is over all averages along the orbit. This is the analogue of the Hardy-Littlewood maximal inequality. Once you have it, Birkhoff follows from a covering argument: the limsup and liminf of A_N f are both T-invariant; the maximal inequality lets you sandwich them between ⟨f⟩ ± ε for any ε > 0; hence they coincide and equal ⟨f⟩.

Modern proofs use Garsia's "Sun rise" lemma or, in the L² case, the spectral theorem applied to the unitary Koopman operator U_T f = f ∘ T on L²(μ). Decomposing U_T into eigenspaces gives the L² convergence directly. The pointwise version requires more delicate work to control sets where convergence might fail.

When the ergodic theorem applies

Statistical mechanics. Replacing ensemble averages by time averages — every textbook thermodynamics calculation tacitly invokes ergodicity (or microcanonical equivalence).
MCMC justification. A single long Markov chain gives consistent estimates of any posterior expectation, no independence required.
Random matrix products. Lyapunov exponents — long-run growth rates of products of i.i.d. matrices — are computed via Kingman's subadditive ergodic theorem.
Information theory. Entropy rate is a time average of log-likelihood (Shannon-McMillan-Breiman). Used in source coding bounds and data-compression theory.
Number theory. Equidistribution of orbits (Weyl), normal numbers (Borel), Furstenberg's proof of Szemerédi's theorem — all use ergodic theorems.
Stochastic differential equations. Ergodic SDEs (Ornstein-Uhlenbeck, Langevin) admit ergodic theorems with respect to their stationary distribution. Foundation of stochastic optimisation theory.
Reinforcement learning. Stochastic approximation (Robbins-Monro) and Q-learning convergence proofs invoke ergodic theorems for the underlying Markov chain.

Common pitfalls

Assuming ergodicity without checking. Many natural dynamical systems are not ergodic — KAM tori in Hamiltonian mechanics, reducible Markov chains, periodic chains. Time averages then depend on starting point.
Confusing ergodicity with mixing. Irrational rotations are ergodic but not mixing — they have "memory" of initial conditions in a precise sense.
Ignoring convergence rate. The ergodic theorem gives convergence but no a priori rate. For Markov chains with slow mixing, the chain may take 10⁶ iterations to give one effective sample.
Forgetting the L¹ requirement. Birkhoff needs f ∈ L¹(μ). Without integrability the time averages can fail to converge (Cauchy-distributed observables, for example).
Treating ergodicity as observable. Whether a real system (a physical fluid, a Markov chain run for finite N) is ergodic is hypothesis, not measurement. Empirical mixing tests bound but do not certify.
Boltzmann's hypothesis being literally true. KAM theorem (1954) showed that integrable Hamiltonian systems are not ergodic and that nearby systems retain large non-ergodic regions. Statistical mechanics works in practice because of approximate, not exact, ergodicity for the relevant observables.

Applications

Statistical mechanics — the original motivation

Boltzmann (1870s) needed time-averaged quantities (pressure, temperature) along a trajectory of N gas molecules to equal ensemble averages (integrals over phase space with the Maxwell-Boltzmann distribution). He hypothesised — without proof — that the time average and ensemble average coincide. Birkhoff's theorem provides the rigorous version, conditional on ergodicity. Rigorous proofs of ergodicity for specific Hamiltonians (Sinai billiards 1970, hard-sphere gases) followed.

MCMC and Bayesian computation

Every Bayesian posterior expectation E_π[f] computed by MCMC is justified by the ergodic theorem. Without it, one would have to argue that successive samples are independent (they aren't) or that Monte Carlo with autocorrelated samples is somehow consistent (Birkhoff makes it so).

Information theory — entropy rate

For an ergodic stationary source (X_1, X_2, …) with finite alphabet, the entropy rate h = lim (1/N) H(X_1, …, X_N) governs achievable compression. Shannon-McMillan-Breiman strengthens this to almost-sure convergence: (−1/N) log p(X_1, …, X_N) → h for almost every realisation. Asymptotic equipartition — the basis of lossless compression bounds.

Random matrix products and Lyapunov exponents

For an i.i.d. sequence of d×d matrices M_n, the Lyapunov exponent is λ = lim (1/n) log ||M_n M_{n−1} ⋯ M_1||. Kingman's subadditive ergodic theorem proves the limit exists a.s. for any stationary ergodic matrix sequence — used in dynamical systems, products of random transfer matrices in disordered systems, and stability of stochastic SDEs.

Number theory — equidistribution

Weyl's equidistribution theorem (1916): if α is irrational, {nα mod 1} is equidistributed in [0, 1). The ergodic theorem for the irrational rotation generalises this to all integrable functions: time averages along the orbit equal Lebesgue averages. Furstenberg's ergodic-theoretic proof of Szemerédi's theorem (arithmetic progressions in dense sets of integers) uses much deeper variants.

Frequently asked questions

What does 'ergodic' actually mean?

A measure-preserving dynamical system (X, T, μ) is ergodic if every T-invariant measurable subset A ⊆ X has measure 0 or 1. Equivalently, every T-invariant L¹ function is constant μ-almost surely. The intuition: there are no non-trivial "islands" that the dynamics leaves untouched. The orbit of almost every starting point visits every region of positive measure infinitely often, with frequency proportional to that region's measure. For Markov chains, ergodicity = irreducibility + aperiodicity. For physical systems, ergodicity is a strong hypothesis — Boltzmann's H-theorem assumed it; rigorous proof for realistic systems is hard.

What's the difference between Birkhoff and von Neumann's theorems?

Both compute time averages of f, but they converge in different senses. Birkhoff's pointwise theorem (1931) says (1/N) Σ f(T^n x) converges μ-almost-surely. Von Neumann's mean ergodic theorem (1932) — proved first historically — says the same averages converge in L²(μ) norm. The L² convergence is weaker (easier to prove, doesn't pin down behaviour on null sets) but pointwise is what physics needs (an individual experiment trace converging). For ergodic systems, both limits equal ∫ f dμ. For non-ergodic systems, the limit is E[f | I] — conditional expectation given the σ-algebra I of invariant sets — under both theorems.

How does the ergodic theorem justify MCMC?

MCMC constructs a Markov chain whose stationary distribution is the target π. If the chain is ergodic (irreducible and aperiodic), the ergodic theorem says (1/N) Σ_{n=1}^N f(X_n) → E_π[f] = ∫ f dπ almost surely, for any L¹ function f. So we don't need independent samples — a single long trajectory yields consistent estimates of any posterior expectation, mean, variance, quantile, or marginal. This is the entire theoretical foundation for reading MCMC output as samples from π. The Central Limit Theorem for ergodic Markov chains then gives Monte Carlo error bars: the sample mean is asymptotically Normal with variance σ²/N, where σ² incorporates autocorrelation.

Is every Markov chain ergodic?

No. Reducible chains (some states unreachable from others) have multiple invariant measures — the time average depends on starting state. Periodic chains (forced cycles) don't converge to a stationary distribution at all; their long-run distribution oscillates. For finite-state Markov chains, ergodic = irreducible + aperiodic, and there is a unique stationary distribution. For continuous-state chains, you need additional regularity (Harris recurrence) to get ergodic theorems. Practical MCMC samplers (Metropolis-Hastings with positive proposal density, Gibbs with positive conditionals) are designed to be ergodic by construction.

How does ergodicity differ from mixing?

Mixing is strictly stronger than ergodicity. A system is mixing if μ(T^{-n} A ∩ B) → μ(A) μ(B) as n → ∞ for all measurable A, B — the dynamics "forgets" initial conditions in the limit. Mixing implies ergodicity but not vice versa (an irrational rotation of the circle is ergodic but not mixing). The hierarchy is: Bernoulli (i.i.d.) ⊂ K-systems ⊂ mixing ⊂ weakly mixing ⊂ ergodic. Stronger properties give faster convergence to equilibrium. In MCMC, mixing speed (often quantified by the spectral gap of the transition kernel) governs how many iterations are needed for the chain to forget its starting state.

What's the connection to statistical mechanics?

Boltzmann (1870s) hypothesised that the time average of a thermodynamic observable along a single trajectory of a large mechanical system equals its ensemble average (microcanonical, weighted by the invariant measure on energy shell). This is the "ergodic hypothesis" — the existence of a single number ⟨A⟩_time = ⟨A⟩_ensemble for all observables A. Birkhoff's theorem rigorously establishes this provided the system is ergodic with respect to the invariant measure. Whether a given Hamiltonian system is actually ergodic is a hard problem — proven for some examples (Sinai billiards), false for others (integrable systems). The KAM theorem shows many physical systems are non-ergodic on a positive-measure subset of phase space — but the ergodic hypothesis is robust enough that statistical mechanics works in practice.

Ergodic theorems — variants and predecessors

Time-average results across mathematics.

Theorem	Year	Convergence sense	Setting	What it gives
SLLN (Borel/Kolmogorov)	1909/1933	Almost sure	i.i.d. random variables	(1/N) Σ Y_n → E[Y]
Von Neumann mean ergodic	1932	L²(μ)	Measure-preserving T	A_N f → ∫ f dμ in L²
Birkhoff pointwise ergodic	1931	Almost sure	Measure-preserving T, f ∈ L¹	A_N f → ∫ f dμ a.s.
Kingman subadditive ergodic	1968	Almost sure	Subadditive g_n ≤ g_m + g_{n−m} ∘ T^m	(1/n) g_n → constant a.s.
Shannon-McMillan-Breiman	1948–58	Almost sure	Ergodic stationary process	−(1/n) log p(X_1, …, X_n) → h
Hopf ratio ergodic	1937	Almost sure	σ-finite invariant measure	Σ f(T^n x) / Σ g(T^n x) → ∫ f / ∫ g
Furstenberg multiple ergodic	1977	L²	Polynomial powers T^p(n)	Proves Szemerédi's theorem