Probability

Law of Large Numbers

Q: What's the difference between weak and strong LLN?

Weak LLN says X̄ₙ → μ in probability: for every ε > 0, P(|X̄ₙ − μ| > ε) → 0 as n → ∞. Strong LLN says X̄ₙ → μ almost surely: P(lim X̄ₙ = μ) = 1 — there is no exceptional event of positive probability where the limit fails. Almost sure convergence is strictly stronger; it implies in-probability convergence but not vice versa. The classic example: a sequence Y_n that is 0 except on a set of probability 1/n where it is 1 converges in probability to 0 but not a.s. (by Borel-Cantelli).

(X̄ₙ) → μ as n → ∞: weak (in probability), strong (almost surely)

The Law of Large Numbers (LLN) states that the sample average X̄ₙ = (X₁ + … + Xₙ)/n of i.i.d. random variables with finite mean μ converges to μ as n → ∞. Two strengths: Weak LLN (Bernoulli 1713) — X̄ₙ → μ in probability: P(|X̄ₙ − μ| > ε) → 0. Strong LLN (Kolmogorov 1933) — X̄ₙ → μ almost surely: P(lim X̄ₙ = μ) = 1. The strong LLN is genuinely stronger; e.g. for Cauchy distribution, neither holds because E[X] doesn't exist. Foundation of statistical estimation: empirical mean, frequentist probability interpretation, Monte Carlo integration. Glivenko–Cantelli theorem extends LLN to empirical CDFs uniformly. Counterexamples: Cauchy distribution, certain Markov chains without ergodicity.

Weak LLNin probability (Bernoulli 1713)
Strong LLNalmost surely (Kolmogorov 1933)
Requiresfinite E[|X|]
Cauchyfails (no mean)
Glivenko–Cantelliempirical CDF uniform
FoundationMonte Carlo, statistics

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why LLN matters

Statistical inference. Sample mean estimates population mean; sample variance estimates population variance; sample proportion estimates population proportion. Without LLN there's no consistency.
Monte Carlo integration. Average n samples of f(X) to estimate E[f(X)]. LLN guarantees convergence; CLT gives the 1/√n error rate that beats grid methods in high dimensions.
Frequentist probability. "P(heads) = 1/2" is interpreted via long-run frequency: fraction of heads in n flips → 1/2. LLN makes this rigorous — without it the frequentist interpretation is hand-waving.
Empirical distributions. Glivenko–Cantelli says the empirical CDF converges uniformly to the true CDF — foundation of bootstrap, KS tests, and most nonparametric statistics.
Insurance and gambling. The casino's edge of 1.4% on a craps pass-line bet means a long-run profit of roughly 1.4% of total wagers — guaranteed by LLN, not luck.
Machine learning. Empirical risk minimization works because empirical loss converges to true loss by the LLN; uniform versions (like Vapnik–Chervonenkis) extend this to function classes.
Ergodic theorems. Birkhoff ergodic theorem extends LLN to time averages of ergodic dynamical systems — connecting probability with statistical mechanics.

Formal statements

Let X₁, X₂, ... be i.i.d. with E[|X|] < ∞ and E[X] = μ. Define the sample mean X̄ₙ = (X₁ + ... + Xₙ)/n.

Weak LLN: For every ε > 0,

P(|X̄ₙ − μ| > ε) → 0 as n → ∞

Often written X̄ₙ →ᴾ μ.

Strong LLN: There is a set A with P(A) = 1 such that for every ω ∈ A,

X̄ₙ(ω) → μ as n → ∞

Often written X̄ₙ →ᵃ·ˢ· μ. The strong LLN implies the weak LLN, but not conversely.

Historical thread

Cardano (1564, posthumously published). Informal observation that long runs of dice tend toward expected averages.
Jacob Bernoulli (1713, Ars Conjectandi, posthumous). First rigorous proof of weak LLN for Bernoulli trials. Bernoulli called it the "golden theorem"; took him 20 years.
Poisson (1837). Coined "law of large numbers" (la loi des grands nombres) and extended Bernoulli's result.
Chebyshev (1867). Strong tools (Chebyshev's inequality) to give simpler proofs.
Borel (1909). Strong LLN for Bernoulli trials — the first a.s. statement in probability.
Kolmogorov (1930, 1933). Modern strong LLN for general i.i.d. random variables; proved the converse — finite mean is necessary as well as sufficient.
Etemadi (1981). Strong LLN with only pairwise independence — a far weaker hypothesis than i.i.d.

Proof sketch — weak LLN via Chebyshev

Assume Var(X) = σ² < ∞. Then Var(X̄ₙ) = σ²/n by independence. Chebyshev's inequality gives:

P(|X̄ₙ − μ| > ε) ≤ Var(X̄ₙ) / ε² = σ² / (n ε²) → 0

Two-line proof. Chebyshev is loose — Cramér's tighter exponential bound (using MGFs) gives much faster convergence. The result extends to finite-mean variables with the truncation argument (replace X with X · 1{|X| ≤ a_n} and let a_n grow slowly).

Proof sketch — strong LLN via Kolmogorov

The classical proof uses Kolmogorov's three-series theorem and the Kolmogorov maximal inequality on partial sums. A modern alternative — Etemadi's proof — uses subsequence arguments and a Borel–Cantelli step:

Truncate X_n to Y_n = X_n · 1{|X_n| ≤ n}; show E[Y_n] → μ.
Bound P(X_n ≠ Y_n) using ∑ P(|X| > n) ≤ E[|X|] < ∞.
By Borel–Cantelli, X_n = Y_n eventually a.s.
Use Chebyshev on a geometric subsequence Ȳ_{kᵐ} → μ a.s.
Sandwich to extend a.s. convergence from the subsequence to all n.

Why Cauchy fails

The Cauchy distribution has density f(x) = 1/(π(1 + x²)). Its mean is undefined: ∫₀^∞ x f(x) dx and ∫_{-∞}^0 x f(x) dx are both infinite, with opposite signs. So E[|X|] = ∞ — the LLN hypothesis fails. Worse: if X₁, ..., Xₙ are i.i.d. Cauchy, the average X̄ₙ has the same Cauchy distribution as a single X. Averaging doesn't help; in fact, the standardized sum doesn't converge to anything classical (it converges to a stable distribution of index 1, which is just Cauchy again).

Glivenko–Cantelli theorem

The empirical CDF based on samples X₁, ..., Xₙ is:

F̂ₙ(x) = (1/n) ∑_{i=1}^n 1{Xᵢ ≤ x}

For each fixed x, the LLN gives F̂ₙ(x) → F(x) a.s. (since the indicator has mean F(x)). Glivenko–Cantelli strengthens to uniform convergence:

sup_{x ∈ ℝ} |F̂ₙ(x) − F(x)| → 0  a.s.

Foundation of empirical processes theory, the Kolmogorov–Smirnov test, and bootstrap. Vapnik–Chervonenkis (1971) generalize to indicator sets and arbitrary function classes.

Rate of convergence — meet the CLT

LLN says X̄ₙ → μ but doesn't quantify the rate. The central limit theorem refines it:

√n (X̄ₙ − μ) / σ  →  N(0, 1)  in distribution

So typical deviations |X̄ₙ − μ| are of order σ/√n. The Berry–Esseen theorem makes the rate explicit:

sup_x |P(√n(X̄ₙ − μ)/σ ≤ x) − Φ(x)| ≤ C · ρ / (σ³ √n)

where ρ = E[|X − μ|³] and C is a universal constant (best known bound C ≈ 0.4748). Without further hypotheses, 1/√n is the best possible rate.

Beyond i.i.d.

Etemadi LLN. Pairwise independence + identical distribution + E[|X|] < ∞ ⇒ strong LLN. Independence beyond pairwise isn't needed.
Markov chains. If a chain is irreducible, aperiodic, positive recurrent (i.e., ergodic), then time averages of f(Xₙ) converge a.s. to ∫ f dπ where π is the stationary distribution. This is the Birkhoff ergodic theorem.
Stationary sequences. Mixing conditions give LLN-type results — Rosenblatt mixing, ϕ-mixing, etc.
Martingale-difference sequences. If E[Xₙ | F_{n-1}] = 0 and ∑ E[X_n²]/n² < ∞, then S_n/n → 0 a.s.

Common misconceptions

"Gambler's fallacy." If a coin lands heads 10 times in a row, LLN does not say tails is now more likely. Each new flip is independent; the past doesn't push the average toward 1/2. Future flips average to 1/2; the 10 past heads don't get reversed.
"LLN always works." Requires finite E[|X|]. Cauchy, Pareto with shape ≤ 1, and other heavy-tailed distributions break it.
"Strong LLN means convergence everywhere." Almost surely — there's a measure-zero exception set. For specific ω in that set, X̄ₙ(ω) might never settle.
"Average gets closer to μ each step." Not monotone. X̄ₙ fluctuates around μ; it just gets closer in the limit on average — typical deviation σ/√n.
"Weak LLN is enough." Weak LLN allows occasional rare large deviations to keep happening; strong LLN bans them in the limit. The difference matters in sequential statistical procedures.
"More data always helps." Without finite mean, more data doesn't help (Cauchy). Without independence, more correlated data adds little new information.
"LLN proves the empirical mean is the best estimator." It only proves consistency. Optimality requires additional efficiency criteria — Cramér–Rao bound, MLE asymptotic theory.

Frequently asked questions

What's the difference between weak and strong LLN?

Weak LLN says X̄ₙ → μ in probability: for every ε > 0, P(|X̄ₙ − μ| > ε) → 0 as n → ∞. Strong LLN says X̄ₙ → μ almost surely: P(lim X̄ₙ = μ) = 1 — there is no exceptional event of positive probability where the limit fails. Almost sure convergence is strictly stronger; it implies in-probability convergence but not vice versa. The classic example: a sequence Y_n that is 0 except on a set of probability 1/n where it is 1 converges in probability to 0 but not a.s. (by Borel-Cantelli).

Why does the Cauchy distribution fail LLN?

The Cauchy distribution has density 1/(π(1 + x²)). Its mean ∫ x · 1/(π(1+x²)) dx is undefined (the integrals over (−∞, 0] and [0, ∞) are both infinite, with opposite signs). Without finite E[X], LLN doesn't apply. In fact, X̄ₙ for i.i.d. Cauchy variables has the same Cauchy distribution as a single observation — averaging doesn't help at all. This shows LLN's finite-mean hypothesis is essential.

How is LLN used in Monte Carlo integration?

To compute ∫ f(x) p(x) dx for a probability density p, sample X_1, ..., X_n ~ p i.i.d. and form (1/n) ∑ f(X_i). By the strong LLN, this average converges a.s. to E[f(X)] = ∫ f p dx. The error is O(σ_f / √n) by the central limit theorem — independent of the integrand's dimension. This is why Monte Carlo beats deterministic quadrature in high dimensions: 1/√n error vs 1/n^(1/d) for grid methods.

What is the Glivenko-Cantelli theorem?

The empirical CDF F̂_n(x) = (1/n) ∑ 1{X_i ≤ x} converges to the true CDF F(x) uniformly: sup_x |F̂_n(x) − F(x)| → 0 a.s. This strengthens the LLN — for each fixed x, F̂_n(x) → F(x) by the LLN, but Glivenko-Cantelli gives uniform convergence. Foundation of nonparametric statistics, Kolmogorov-Smirnov tests, and bootstrap methods. Sometimes called the fundamental theorem of statistics.

How fast does X̄ₙ converge to μ (CLT bound)?

The central limit theorem refines the LLN: √n(X̄ₙ − μ) → N(0, σ²) in distribution. So |X̄ₙ − μ| is on the order of σ/√n with Gaussian tails. The Berry-Esseen theorem makes the rate explicit: the distribution of √n(X̄ₙ − μ)/σ is within C E[|X|³]/(σ³ √n) of the standard normal in Kolmogorov distance. Improving from 1/√n is impossible without further assumptions — that's the optimal rate.

Does LLN apply to non-i.i.d. cases?

Yes, with weakened hypotheses. Markov chains satisfying ergodicity have an LLN (the Birkhoff ergodic theorem). Stationary processes with finite mean and certain mixing conditions also satisfy LLN. The Etemadi LLN drops independence and assumes only pairwise independence with identical distribution. Martingale-difference-sequence variants exist. The key is some form of asymptotic independence plus finite mean.