Probability
Chebyshev's Inequality
At most 1/k² of any distribution sits more than k standard deviations from the mean
Chebyshev's inequality says P(|X − μ| ≥ kσ) ≤ 1/k² for any distribution with finite variance. At k = 2 that's 25%, at k = 3 only 11.1% — universal, distribution-free, no shape assumed.
- StatementP(|X − μ| ≥ kσ) ≤ 1/k²
- AuthorPafnuty Chebyshev, 1867
- Proof toolMarkov applied to (X − μ)²
- k=2 bound≤ 25% (vs Normal's 4.6%)
- k=3 bound≤ 11.1% (vs Normal's 0.27%)
- Provesweak law of large numbers
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The inequality
For any random variable X with finite mean μ = E[X] and finite variance σ² = Var(X), and any real number k > 0:
P(|X − μ| ≥ kσ) ≤ 1/k²
Equivalently, setting t = kσ:
P(|X − μ| ≥ t) ≤ σ²/t²
No assumptions about the distribution's shape — discrete, continuous, skewed, multimodal, fat-tailed — the bound holds. This universality is what makes Chebyshev's inequality the workhorse of probability theory.
Concrete numbers
| k | Chebyshev bound 1/k² | Normal P(|Z| ≥ k) | Looseness factor |
|---|---|---|---|
| 1 | ≤ 100% (vacuous) | 31.7% | — |
| 2 | ≤ 25% | 4.6% | 5.5× |
| 3 | ≤ 11.1% | 0.27% | 41× |
| 4 | ≤ 6.25% | 0.0063% | 990× |
| 5 | ≤ 4% | 0.000057% | 70,000× |
| 10 | ≤ 1% | 10⁻²³ | 10²¹× |
Chebyshev is loose for nice (light-tailed) distributions and tight for adversarial ones. The middle column is the "tight" bound for a Normal; Chebyshev pays for not knowing the shape.
Proof sketch
Apply Markov's inequality to the non-negative random variable Y = (X − μ)² with threshold a = (kσ)²:
P((X − μ)² ≥ (kσ)²) ≤ E[(X − μ)²] / (kσ)²
= σ² / (k²σ²)
= 1/k²
The event {(X − μ)² ≥ (kσ)²} is identical to {|X − μ| ≥ kσ}, giving Chebyshev. Two lines, one application of Markov. The same recipe — apply Markov to a transformed non-negative quantity — produces Chernoff bounds (apply to e^{tX}), Cantelli's one-sided bound (apply with a translation), and Bernstein's inequality.
Worked example: investment returns
Suppose a stock has monthly return X with μ = 1% and σ = 4%. What's the maximum probability of a monthly loss exceeding 7%?
A loss of 7% means X ≤ −7%.
|X − μ| = |X − 1%| ≥ 8% when X ≤ −7%.
8% / σ = 8% / 4% = 2 standard deviations away.
P(|X − μ| ≥ 2σ) ≤ 1/2² = 25%
So the probability of a ≥ 7% monthly loss is at most 25%, regardless of return distribution.
If returns are Normal the actual probability would be far lower (~2%). But for skewed, heavy-tailed financial data, Chebyshev's 25% is the only universally valid worst-case bound — and that conservatism is exactly why risk managers care.
Weak law of large numbers — in three lines
Let X₁, …, Xₙ be i.i.d. with mean μ and variance σ². Define the sample mean X̄ₙ = (1/n)·ΣXᵢ. Then:
E[X̄ₙ] = μ (linearity)
Var(X̄ₙ) = σ²/n (i.i.d. variances add)
Chebyshev: P(|X̄ₙ − μ| ≥ ε) ≤ Var(X̄ₙ)/ε² = σ²/(nε²) → 0 as n → ∞.
So X̄ₙ → μ in probability — the weak law of large numbers. Bernoulli's 1713 proof took pages of combinatorics; Chebyshev's 1867 inequality collapsed it to a paragraph and revealed why the result is essentially about variance shrinking under averaging.
Markov vs Chebyshev vs Chernoff vs Hoeffding
| Bound | Form | Requires | Tail decay | Use when |
|---|---|---|---|---|
| Markov | P(X ≥ a) ≤ E[X]/a | X ≥ 0, finite mean | 1/a (linear) | only mean known, X non-negative |
| Chebyshev | P(|X−μ| ≥ kσ) ≤ 1/k² | finite variance | 1/k² (polynomial) | variance known, distribution arbitrary |
| Cantelli (one-sided) | P(X−μ ≥ kσ) ≤ 1/(1+k²) | finite variance | 1/k² but ~2× tighter | one-sided question |
| Chernoff | P(X ≥ a) ≤ e^{−tx}·M(t) | MGF exists in a neighborhood | e^{−Ω(a)} (exponential) | MGF available; sums of indep RVs |
| Hoeffding | P(X̄−μ ≥ ε) ≤ e^{−2nε²/(b−a)²} | bounded support [a, b] | e^{−Ω(nε²)} (exponential) | i.i.d. bounded; e.g. Bernoulli |
| Bernstein | P(X̄−μ ≥ ε) ≤ e^{−nε²/(2σ²+2bε/3)} | bounded + variance | between Chebyshev and Hoeffding | small variance, large support |
The hierarchy: Markov uses one moment, Chebyshev uses two, Chernoff/Hoeffding use all of them via the MGF. More information → tighter bound. Chebyshev's polynomial decay is dramatically beaten by Chernoff's exponential decay when the MGF exists.
Multidimensional and operator versions
Chebyshev generalizes to vectors. For a random vector X ∈ ℝᵈ with mean μ and covariance Σ:
P(||X − μ||₂ ≥ k·tr(Σ)^{1/2}) ≤ 1/k²
P((X − μ)ᵀΣ⁻¹(X − μ) ≥ k²) ≤ d/k² (Mahalanobis form)
And to matrices: matrix Chebyshev bounds the spectral norm of a sum of random matrices in terms of operator-variance, with applications to randomized numerical linear algebra (random projections, sketching, sparse recovery).
Where Chebyshev shows up
- Weak law of large numbers. The three-line proof above. Strong law needs sharper tools (Borel-Cantelli) but the weak version is pure Chebyshev.
- Concentration of measure. Variance bounds suffice when you only have second moments; tighter inequalities (Talagrand, log-Sobolev) refine in specific cases.
- Confidence intervals without normality. For unknown distributions, Chebyshev gives X̄ ± kσ/√n contains μ with probability ≥ 1 − 1/k². At k = 4.47, ≥ 95% — wider than Normal's 1.96, but distribution-free.
- Random matrix theory. Bound on the deviation of empirical eigenvalues from the limiting Wigner distribution.
- Algorithm analysis. Randomized algorithms with high probability of success: bound the variance of the estimator and Chebyshev bounds the failure probability.
- Monte Carlo. A Monte Carlo estimator with variance σ²/n has confidence radius σ/(√n ε)·k from Chebyshev — guides sample-size choice when the distribution is unknown.
- Polynomial Chebyshev (different person, same name). Note: Pafnuty Chebyshev also gave us Chebyshev polynomials; they're unrelated to the inequality but share an inventor.
The one-sided sharpening: Cantelli's inequality
Chebyshev bounds both tails together. For just one tail, Cantelli does better:
P(X − μ ≥ kσ) ≤ 1/(1 + k²)
At k = 2 this is 1/5 = 20% versus Chebyshev's 25%; at k = 3 it's 1/10 = 10% versus 11.1%. The improvement is modest because Chebyshev was already nearly tight for symmetric two-tailed events. Use Cantelli whenever the question is one-sided ("what's the chance returns drop below x?") — same effort, tighter bound.
Common pitfalls
- Treating Chebyshev as tight. For well-behaved distributions (Normal, exponential, bounded) Chebyshev is comically loose. Use it when you genuinely don't know the shape.
- Forgetting finite variance is required. Cauchy distributions have no variance — Chebyshev says nothing about them. Use median-based bounds or Lévy distance instead.
- Confusing with Chebyshev polynomials. Different topic, same Pafnuty Lvovich Chebyshev. The polynomials are a basis for [−1, 1] used in approximation theory.
- Applying when k < 1. The bound 1/k² > 1 is vacuous for k ≤ 1. Only useful for k strictly greater than 1.
- Forgetting variance is for the right random variable. If you're bounding X̄ₙ, use Var(X̄ₙ) = σ²/n, not σ². This is the most common mistake when applying Chebyshev to sample means.
- Plug-in estimator confusion. Replacing σ² with the sample variance s² turns Chebyshev's exact bound into an approximate one with extra sampling variability — confidence intervals from sample variance need t-distribution corrections.
Historical note
Pafnuty Lvovich Chebyshev (1821–1894) proved the inequality in 1867 in the context of his work on the weak law of large numbers. His student Andrey Markov (yes, that Markov) generalized Chebyshev's proof to extract the simpler inequality P(X ≥ a) ≤ E[X]/a in 1884 — what we now call Markov's inequality. So historically, Chebyshev came first and Markov is the simpler special case, even though pedagogically we now teach Markov first as the "building block."
Irénée-Jules Bienaymé proved a special case earlier in 1853; some sources call it the Bienaymé-Chebyshev inequality. The Cantelli one-sided refinement is due to Francesco Cantelli, 1928.
Frequently asked questions
Why is the bound 1/k² and not something tighter?
1/k² is the tightest bound possible if you only know μ and σ. A discrete distribution that puts mass 1/(2k²) at μ − kσ, mass 1/(2k²) at μ + kσ, and mass 1 − 1/k² at μ achieves equality. So no universal bound using only the first two moments can beat Chebyshev. Specific distributions can be tighter — for a Normal, P(|Z| ≥ 2) ≈ 4.6%, far below Chebyshev's 25% — but only because you've used more information (the actual shape).
How does Chebyshev follow from Markov's inequality?
Markov: for non-negative Y, P(Y ≥ a) ≤ E[Y]/a. Apply Markov to Y = (X − μ)² with a = (kσ)²: P((X − μ)² ≥ k²σ²) ≤ E[(X − μ)²] / (k²σ²) = σ² / (k²σ²) = 1/k². The event {(X − μ)² ≥ k²σ²} is the same as {|X − μ| ≥ kσ}, giving Chebyshev's inequality. The trick — applying Markov to a squared deviation — is the universal pattern for turning Markov-style first-moment bounds into variance-style second-moment bounds.
How does Chebyshev prove the weak law of large numbers?
Let X̄ₙ be the average of n i.i.d. draws with mean μ and variance σ². Then E[X̄ₙ] = μ and Var(X̄ₙ) = σ²/n. By Chebyshev: P(|X̄ₙ − μ| ≥ ε) ≤ σ²/(nε²). As n → ∞ the right side goes to 0, so X̄ₙ → μ in probability. This is the weak law of large numbers — proven in two lines using Chebyshev. The classical proofs of Bernoulli (1713) needed pages of combinatorics; Chebyshev's bound (1867) collapsed it to a one-paragraph argument.
What does Chebyshev say about k less than 1?
For k < 1 the inequality becomes vacuous: 1/k² > 1, but every probability is at most 1 anyway. The bound is only informative for k > 1. At k = 1 it says P(|X − μ| ≥ σ) ≤ 1 — also trivial. At k = 2 it gives ≤ 25%; at k = 3 it gives ≤ 11.1%; at k = 5 it gives ≤ 4%; at k = 10 it gives ≤ 1%. The bound decays like 1/k² — polynomial, not exponential. That's slow compared to Chernoff bounds (exponential decay), but Chebyshev needs only variance, not the full moment-generating function.
What's the one-sided version (Cantelli's inequality)?
Cantelli's inequality is the one-sided Chebyshev: P(X − μ ≥ kσ) ≤ 1/(1 + k²). At k = 2, Chebyshev gives 25% for the two-tailed bound; Cantelli gives 1/5 = 20% for the one-tailed bound on the upper tail alone. Cantelli is tighter for one-sided questions because half the variance lives in the other tail and shouldn't be charged. Derivation: minimize E[((X − μ) + λ)²] / (kσ + λ)² over λ > 0 and substitute the optimal λ = σ/k.
When is Chebyshev too loose to be useful?
Whenever the distribution has light tails (sub-Gaussian, sub-exponential). For a standard Normal, P(|Z| ≥ 3) ≈ 0.27% while Chebyshev allows 11.1% — 40× looser. For sums of bounded i.i.d. variables, Hoeffding's inequality gives exp(−2nε²/(b−a)²) — exponential decay versus Chebyshev's 1/(nε²) polynomial decay. Use Chebyshev when you can compute variance and nothing more; use Chernoff/Hoeffding when the MGF or bounded support is available.
What random variables can saturate Chebyshev's bound?
A three-point distribution does it. Place probability 1/(2k²) at μ − kσ, probability 1/(2k²) at μ + kσ, and probability 1 − 1/k² at μ. The mean is μ, the variance is exactly σ² (verify by computing E[(X − μ)²] = (1/(2k²))(kσ)² × 2 = σ²), and P(|X − μ| ≥ kσ) = 1/k² — Chebyshev with equality. The lesson: distributions with mass concentrated heavily at the mean plus occasional symmetric far-out point masses are what 'tight' looks like.