Statistics

Central Limit Theorem

Why the bell curve appears everywhere — sums of anything become normal

The Central Limit Theorem says — sums of many independent random variables (with finite variance) tend toward a normal distribution, regardless of the original distributions. This is why bell curves appear in heights, test scores, measurement errors, and physical processes — they're all sums of many small random effects. It's the most consequential theorem in statistics.

  • StatementSample mean of n i.i.d. variables → N(μ, σ²/n) as n → ∞
  • Required conditionsIndependent, identically distributed, finite variance
  • Rate of convergence1/√n — normal approximation improves as √n
  • Rule of thumbn ≥ 30 sufficient for most applications
  • First provenLaplace (1810), generalized 1900s (Lyapunov, Lindeberg)
  • GeneralizationStable distributions (Lévy) for heavy tails

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The theorem

Let X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with mean μ and finite variance σ². Then the standardized sample mean:

(X̄ − μ) / (σ/√n)

converges in distribution to the standard normal N(0, 1) as n → ∞. Equivalently:

X̄ ~ approximately N(μ, σ²/n)  for large n

The sample mean is approximately normal, centered at the true mean, with variance shrinking as 1/n.

Why this is consequential

Without CLT, statistics would be very different. Consider:

  • You can compute confidence intervals for population means using normal-based math, even when the underlying data isn't normal.
  • Hypothesis tests (t-test, z-test) work on non-normal data, as long as samples are reasonably sized.
  • Quality control and Six Sigma work because process averages are approximately normal.
  • Polling — sample average of yes/no responses (binomial) is normal for n > ~30, justifying margin-of-error calculations.

The phrase "the data is approximately normal" is often shorthand for "we'll average many of them, and CLT applies." Individual data points may be wildly non-normal; their averages are not.

Worked examples

Example 1 — coin flips

X₁, ..., Xₙ are independent coin flips (Bernoulli). Each Xᵢ is 1 with prob 1/2, 0 with prob 1/2. Mean = 0.5, variance = 0.25.

For n = 100 flips, the sample mean (= proportion of heads) is approximately N(0.5, 0.25/100) = N(0.5, 0.0025). Standard deviation = 0.05.

So 95% of the time, the proportion of heads in 100 flips falls within 1.96 × 0.05 = 0.098 of 0.5 — i.e., in [0.402, 0.598]. The actual binomial bounds are very close to this CLT prediction.

Example 2 — sample mean from any distribution

Take 30 i.i.d. samples from any distribution with mean 50 and variance 100 (σ = 10). The sample mean is approximately N(50, 100/30) = N(50, 3.33). Standard error = √3.33 ≈ 1.83.

So about 95% of sample means (over many resamples) fall in 50 ± 2(1.83) = [46.34, 53.66]. This holds whether the original distribution was uniform, exponential, beta, or anything else with the same mean and variance.

Example 3 — sum of dice

Roll 100 dice and sum. Each die has mean 3.5, variance 2.917. Sum has mean 350, variance 291.7, σ ≈ 17.

By CLT, the sum is approximately N(350, 291.7). 95% of sums fall in 350 ± 2(17) = [316, 384]. Try this experimentally — actual results match the CLT prediction very closely. Even with discrete (not normal) inputs, the sum of 100 is essentially normal.

Conditions for CLT

The classical CLT requires:

  1. Independence. The Xᵢ are independent. CLT fails for correlated samples (time series, spatial data with structure).
  2. Identical distribution. All Xᵢ have the same distribution. Generalizations relax this — Lyapunov CLT, Lindeberg-Feller — for non-identically distributed but suitably bounded variables.
  3. Finite variance. Var(Xᵢ) = σ² < ∞. For infinite-variance distributions (Cauchy), CLT fails — the limit is a stable distribution, not normal.

For practical applications:

  • n ≥ 30 is a rule of thumb for the normal approximation to kick in.
  • Symmetric distributions converge faster (n = 10 may suffice).
  • Heavy-tailed or highly skewed distributions need larger n. Some require thousands of samples.
  • Always verify with a normality test (Shapiro-Wilk) or visualization (Q-Q plot) when in doubt.

JavaScript — observing CLT in action

// Sample mean of n uniform random variables
function sampleMean(n, sampler) {
  let sum = 0;
  for (let i = 0; i < n; i++) sum += sampler();
  return sum / n;
}

// Compute many sample means and look at their distribution
function cltDemo(n, samplerName, sampler, expectedMean, expectedSE) {
  const numTrials = 10000;
  const means = Array.from({length: numTrials}, () => sampleMean(n, sampler));

  const observedMean = means.reduce((s, x) => s + x, 0) / numTrials;
  const variance = means.reduce((s, x) => s + (x - observedMean)**2, 0) / numTrials;
  const observedSE = Math.sqrt(variance);

  console.log(`${samplerName} (n=${n})`);
  console.log(`  expected mean: ${expectedMean}, observed: ${observedMean.toFixed(3)}`);
  console.log(`  expected SE:   ${expectedSE.toFixed(3)}, observed: ${observedSE.toFixed(3)}`);

  // Test normality — proportion within 1, 2, 3 SDs
  const z = means.map(m => (m - observedMean) / observedSE);
  const within1 = z.filter(z => Math.abs(z) < 1).length / numTrials;
  const within2 = z.filter(z => Math.abs(z) < 2).length / numTrials;
  console.log(`  within 1 SE: ${(within1*100).toFixed(1)}% (expected 68.3%)`);
  console.log(`  within 2 SE: ${(within2*100).toFixed(1)}% (expected 95.4%)`);
  console.log();
}

// Uniform [0, 1] — mean 0.5, var 1/12
cltDemo(30, 'Uniform', () => Math.random(), 0.5, Math.sqrt(1/12) / Math.sqrt(30));

// Exponential — mean 1, var 1
const expSampler = () => -Math.log(1 - Math.random());
cltDemo(30, 'Exponential', expSampler, 1, 1 / Math.sqrt(30));

// Bernoulli — heavy skew at 0.1 — needs more samples
const bernSampler = () => Math.random() < 0.1 ? 1 : 0;
cltDemo(30, 'Bernoulli (p=0.1, n=30)', bernSampler, 0.1, Math.sqrt(0.09)/Math.sqrt(30));
cltDemo(200, 'Bernoulli (p=0.1, n=200)', bernSampler, 0.1, Math.sqrt(0.09)/Math.sqrt(200));

Generalizations of CLT

TheoremWhat it relaxesResult
Classical CLTI.i.d. + finite variance → normal
Lyapunov CLTIdentical distributionNon-identical Xᵢ with bounded moments → normal
Lindeberg-Feller CLTIdentical distributionSum of weakly-bounded variables → normal
Multivariate CLTSingle dimensionVector sums → multivariate normal
Martingale CLTIndependenceStationary sequences → normal
Generalized (stable) CLTFinite varianceHeavy-tailed → stable Lévy distributions

The classical CLT is the simplest case. Real-world data often has dependencies, non-identical distributions, or heavy tails — the generalizations cover those situations with different limiting distributions.

Where CLT is essential

  • Hypothesis testing. The t-test, z-test, ANOVA — all assume sample means are approximately normal, justified by CLT.
  • Confidence intervals. "Mean ± 1.96 × SE" gives a 95% CI for population means. Valid because sample means are approximately normal (by CLT) for sufficiently large samples.
  • Polling and surveys. Margin of error in polls is computed using CLT — sample proportions of binary responses are approximately normal.
  • Machine learning model evaluation. Cross-validation accuracy, A/B test results — all rely on CLT to interpret average performance metrics.
  • Quality control / Six Sigma. Process averages are normal by CLT; deviations from spec follow normal-based probabilities.
  • Physics — Brownian motion. Particle position is sum of many small kicks; position over time is normally distributed (CLT).
  • Modeling measurement errors. Each measurement error is a sum of many small independent factors — Gaussian by CLT. Justifies "errors are normally distributed" assumption in regression.

Common mistakes

  • Assuming individual values are normal. CLT applies to sample MEANS, not individual observations. Heights are normal; daily stock returns are not, despite many sums.
  • Forgetting the independence requirement. Time series with strong autocorrelation, observations from the same group/cluster — these violate independence, and standard CLT doesn't apply.
  • Using CLT for heavy-tailed distributions. Cauchy and similar distributions don't have finite variance; their sample means don't converge to normal. Use stable-distribution alternatives.
  • Trusting CLT for small samples without checking. n ≥ 30 is a rough guide. For very skewed or heavy-tailed underlying distributions, larger n is needed. Always verify when sample sizes are borderline.
  • Confusing CLT with LLN. LLN — sample mean converges to true mean. CLT — distribution of sample mean approaches normal. Different statements; both are needed for inference.
  • Forgetting that CLT speaks of distribution, not single trials. One sample mean isn't normal; the distribution of sample means (over many resamples) is. Sample mean from one experiment has unspecified value within that distribution.

Frequently asked questions

What does the central limit theorem actually say?

If X₁, X₂, ..., Xₙ are independent random variables drawn from the SAME distribution with mean μ and finite variance σ², then their sample mean (X₁ + ... + Xₙ)/n approaches a normal distribution N(μ, σ²/n) as n → ∞. The astonishing part — this works regardless of the original distribution. Bernoulli, exponential, uniform — all give normal sample means after enough samples.

How is CLT different from the Law of Large Numbers?

LLN says sample mean → true mean. CLT says HOW it approaches — the sampling distribution becomes normal centered at the true mean. LLN tells you about convergence; CLT tells you about the shape of the convergence. Both are needed for statistics — LLN justifies estimation, CLT justifies confidence intervals.

When does CLT fail?

When variance is infinite (Cauchy distribution — heavy tails, doesn't have a defined variance). When variables aren't independent (autocorrelated time series). When the variables aren't identically distributed (heteroscedastic data). For non-finite-variance distributions, the limit is a stable distribution (Lévy), not normal.

How big does n need to be?

The rule of thumb is n ≥ 30 — but it depends on how non-normal the original distribution is. For symmetric distributions close to normal, n = 5-10 may suffice. For very skewed or heavy-tailed distributions, even n = 100 may not be enough. Always check with simulation or normality tests when applying CLT-based inference.

How does CLT affect machine learning?

Many ML algorithms assume Gaussian noise. Linear regression's residuals are assumed normal — justified by CLT if errors come from many small independent sources. Hypothesis tests on model performance use t-tests (normal-based) on metrics. The Gaussian assumption is so pervasive partly because CLT makes it approximately right for large datasets.

Why is the CLT so important historically?

Before CLT, statisticians couldn't justify using normal-based methods on non-normal data. CLT showed that for sample means (the most common statistic), the normal approximation works regardless. This gave statistical methods their modern foundation — confidence intervals, hypothesis tests, ANOVA — all rely on CLT.

What's the connection between CLT and entropy?

The normal distribution is the maximum-entropy distribution given fixed mean and variance. CLT says — when you average many independent variables, you're effectively "mixing" their information, and the result tends toward maximum entropy subject to fixed mean and variance. The bell curve is the equilibrium of randomness; CLT is the dynamic that gets us there.