Probability
Markov's Inequality
For non-negative X, P(X ≥ a) is never more than E[X]/a
Markov's inequality says that for a non-negative random variable X, P(X ≥ a) ≤ E[X]/a. One mean, one threshold, one universal bound — the building block of every concentration inequality.
- StatementP(X ≥ a) ≤ E[X]/a, X ≥ 0, a > 0
- AuthorAndrey Markov, 1884
- Proofindicator trick — 3 lines
- Requiresnon-negativity, finite mean
- GeneratesChebyshev, Chernoff, Hoeffding
- a = 10·E[X] bound≤ 10%, no shape assumed
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The inequality
For any non-negative random variable X with E[X] < ∞, and any a > 0:
P(X ≥ a) ≤ E[X] / a
That's it. No assumption about the shape of X's distribution. No moments above the first. Just non-negativity and a finite mean. The bound is universal but typically loose — the price of asking so little.
Three-line proof
The standard proof uses the indicator-function trick:
For X ≥ 0 and a > 0:
a · 1[X ≥ a] ≤ X (case split: 1[X≥a] is 0 or 1; both cases ≤ X)
Take expectations:
a · P(X ≥ a) ≤ E[X]
Divide by a:
P(X ≥ a) ≤ E[X] / a ∎
The key observation: the indicator function a · 1[X ≥ a] is a step function that equals a when X ≥ a and 0 otherwise. Since X ≥ 0 always, and X ≥ a in the "step up" region, the indicator never exceeds X itself. Expectations preserve inequalities, and division by a positive a finishes it.
Worked example — Markov is loose
Suppose X is exponential with rate 1, so E[X] = 1. Ask for P(X ≥ 10):
Markov: P(X ≥ 10) ≤ E[X]/10 = 1/10 = 10%
True value: P(X ≥ 10) = e^(-10) ≈ 0.0045%
Looseness factor: 10% / 0.0045% ≈ 2200×
Markov gives 10% but the truth is 0.0045%. That's how loose Markov can be. Why use it? Because it's the only bound that needs nothing more than the mean — distribution-free, three-line proof, every other concentration bound is built from it.
When is Markov tight?
Markov is tight when X is binary — either 0 or exactly a. Let X = a with probability p and X = 0 with probability 1 − p:
E[X] = pa
P(X ≥ a) = p
Markov says P(X ≥ a) ≤ E[X]/a = pa/a = p ✓ (equality)
So Markov's worst case is a Bernoulli scaled to {0, a}. Any spread of probability mass between 0 and a, or any mass above a, makes Markov strictly loose. The lesson: Markov only "knows" about mean, so its tightness adversary uses mass entirely at 0 or at the threshold.
Deriving Chebyshev from Markov
This is the trick that makes Markov so powerful — apply it to a clever transformation.
Want: P(|X − μ| ≥ kσ) ≤ 1/k²
Let Y = (X − μ)², which is non-negative. Apply Markov at a = k²σ²:
P(Y ≥ k²σ²) ≤ E[Y] / (k²σ²) = σ² / (k²σ²) = 1/k²
The event {Y ≥ k²σ²} is exactly {|X − μ| ≥ kσ}. Done.
Same recipe gives Chernoff bounds: apply Markov to e^{tX} for a parameter t > 0 to be optimized. The MGF M(t) = E[e^{tX}] becomes the numerator and you get exponential decay in a.
Markov vs Chebyshev vs Chernoff vs Hoeffding
| Bound | Form | Requires | Tail decay | Use when |
|---|---|---|---|---|
| Markov | P(X ≥ a) ≤ E[X]/a | X ≥ 0, finite mean | 1/a (linear) | only mean known, X non-negative |
| Reverse Markov | P(X ≤ a) ≤ (b − E[X])/(b − a) | X ≤ b, a < E[X] | linear in (b − a) | bounded above, lower tail |
| Chebyshev | P(|X−μ| ≥ kσ) ≤ 1/k² | finite variance | 1/k² (polynomial) | variance known |
| Cantelli | P(X−μ ≥ kσ) ≤ 1/(1+k²) | finite variance | 1/k² (one-sided) | one-tail question |
| Chernoff | P(X ≥ a) ≤ e^{−ta}M(t) | MGF exists | e^{−Ω(a)} (exponential) | sums of indep RVs |
| Hoeffding | P(X̄−μ ≥ ε) ≤ e^{−2nε²/(b−a)²} | bounded i.i.d. | e^{−Ω(nε²)} (exponential) | bounded sums |
| Bernstein | P(X̄−μ ≥ ε) ≤ e^{−nε²/(2σ²+2bε/3)} | bounded + variance | between Chebyshev & Hoeffding | small variance, large support |
The progression: each row uses one more moment or property than the row above. Markov is the foundation; everything else is Markov applied to a cleverly chosen non-negative function of X.
Reverse Markov
If X ≤ b almost surely, then for a < E[X]:
P(X ≤ a) ≤ (b − E[X]) / (b − a)
Proof: apply Markov to b − X (non-negative, mean b − E[X]) at threshold b − a. This is rarely seen in textbooks but useful when bounding lower tails of bounded variables — e.g., guaranteeing a value isn't too small.
Where Markov shows up
- Randomized algorithm analysis. If E[runtime] = T, then P(runtime > cT) ≤ 1/c. Repeat the algorithm log₂(1/δ) times and take the fastest run — failure probability ≤ δ. This is how Las Vegas algorithms get high-probability guarantees.
- Probabilistic method. To prove an object with property P exists, define X = number of objects with property P, show E[X] > 0. Markov implies P(X > 0) > 0 (in the contrapositive direction). Erdős used this to revolutionize combinatorics.
- Streaming algorithms. Median-of-means estimators bound the failure probability via Markov on individual estimates, then amplify with the median trick to exponentially small failure.
- Concentration of measure. All higher-moment bounds (Chebyshev, Chernoff, Bernstein) factor through Markov applied to (X − μ)^k, e^{tX}, or other clever transforms.
- PageRank stability. Bounding the deviation of empirical PageRank from the limit measure uses Markov on the mixing time of the Markov chain.
- Queueing theory. Little's law involves expected queue length; Markov bounds the probability of long queues even without distributional assumptions.
- Markov vs Markov chains. Markov's inequality and Markov chains are due to the same Andrey Markov, but they're separate concepts. Same man, two famous contributions.
The polynomial generalization
For any non-decreasing, non-negative function φ and any a in its range:
P(X ≥ a) ≤ E[φ(X)] / φ(a)
Set φ(x) = x → standard Markov. Set φ(x) = x² → Chebyshev for non-negative X. Set φ(x) = e^{tx} → Chernoff. Set φ(x) = (1 + x)^k → kth-moment bound. Every concentration inequality is this template with a chosen φ; the art is picking the right φ to optimize the bound.
Common pitfalls
- Forgetting non-negativity. Markov fails for variables that can be negative. Always check or shift the variable: apply Markov to X − min(X) if X is bounded below.
- Using Markov when you have more information. If you know variance, use Chebyshev; if you know the MGF, use Chernoff. Markov ignores everything past the mean.
- Tightness expectations. Markov is often 100×–10000× looser than the truth. Use only when you genuinely have no other information.
- Confusing with Markov chains. Markov's inequality is a tail bound for a single random variable. Markov chains are stochastic processes. Same name, different topic.
- Applying to a-values below the mean. Markov is vacuous when a ≤ E[X] (the bound says ≤ 1, which is always true). Useful only when a > E[X].
- Not optimizing parameters. When using Markov inside a derivation (e.g., Chernoff's e^{tX} trick), the parameter t is yours to choose — optimize it.
Historical note
Andrey Markov (1856–1922) published the inequality in 1884 as part of his probability lectures at St. Petersburg. He was a student of Pafnuty Chebyshev, who had proved the related variance inequality in 1867. So pedagogically Markov is the "simpler" foundational step, but historically Chebyshev came first; Markov extracted the simpler statement underneath. Markov went on to invent Markov chains in 1906 by analyzing letter sequences in Pushkin's "Eugene Onegin" — same name, separate fame.
Frequently asked questions
What's the simplest way to prove Markov's inequality?
Indicator function trick. Note that a · 1[X ≥ a] ≤ X for any non-negative X — when X ≥ a both sides equal a or X (and X ≥ a so X ≥ a); when X < a the indicator is 0 and the LHS is 0 ≤ X. Take expectations: a · P(X ≥ a) ≤ E[X]. Divide by a > 0 to get P(X ≥ a) ≤ E[X]/a. Three lines, no calculus, no measure theory beyond basic integration. The non-negativity assumption is essential — otherwise a · 1[X ≥ a] can exceed X.
Why does Markov require non-negativity?
Without non-negativity the bound fails outright. Take X uniform on [−1, 1]: P(X ≥ 0.5) = 25%, but E[X]/0.5 = 0/0.5 = 0. The inequality says 25% ≤ 0, which is false. The indicator-function proof breaks because a · 1[X ≥ a] can exceed X when X is negative. Workaround: apply Markov to a transformed quantity that's non-negative — like (X − μ)², |X|, or e^{tX}. Every concentration inequality is Markov applied to a clever non-negative function of X.
How loose is Markov's inequality in practice?
Very loose, generally. Example: X = exponential with mean 1, ask for P(X ≥ 10). Markov gives ≤ 1/10 = 10%. Actual value: e^{−10} ≈ 0.0045%. Markov is 2200× looser. Markov is tight when X is essentially zero with a small chance of taking a large value — e.g., X = a · 1[Bernoulli] hits equality. As soon as the distribution has any 'spread' (variance, exponential decay), tighter bounds win. The point is universality, not tightness.
How does Markov derive Chebyshev?
Apply Markov to Y = (X − μ)² (non-negative) at a = (kσ)²: P((X−μ)² ≥ k²σ²) ≤ E[(X−μ)²] / (k²σ²) = σ²/(k²σ²) = 1/k². The event {(X−μ)² ≥ k²σ²} is identical to {|X−μ| ≥ kσ}. So Chebyshev: P(|X−μ| ≥ kσ) ≤ 1/k² — Markov applied to a squared deviation. The same recipe gives Chernoff (apply to e^{tX}) and Cantelli (translate and minimize). Markov is the engine; the others are choices of which non-negative quantity to bound.
When is Markov's inequality tight?
Two-point distributions concentrated at 0 and a saturate Markov. Let X = a with probability p and X = 0 with probability 1 − p. Then E[X] = pa, and P(X ≥ a) = p = E[X]/a — exact equality. This is the worst case: a distribution that's either zero or exactly at the threshold. Any 'spread' beyond this binary choice — values strictly between 0 and a, or values above a — makes Markov strictly loose. So tightness requires extreme mass concentration.
How is Markov used in algorithm analysis?
Randomized algorithms with expected running time T(n) — apply Markov to bound the probability the actual runtime exceeds 2T(n) at ≤ 50%, exceeds 10T(n) at ≤ 10%. Then run k independent copies and take the fastest: failure probability drops to 1/2^k or 1/10^k. This 'Las Vegas' technique converts an expected-time algorithm into one with high-probability worst-case time — the only ingredient needed is Markov's inequality. Median-of-means estimators in streaming algorithms work the same way.
Why is Markov so much weaker than Chernoff for sums?
For a sum S = X₁ + ... + Xₙ of i.i.d. bounded variables, Markov on S gives P(S ≥ na) ≤ E[X]/a — a constant independent of n. But the true tail decays exponentially: e^{−Ω(n)}. Chernoff captures this by applying Markov not to S but to e^{tS}, which factors into a product of MGFs — the n-dependence enters through the product. So Chernoff = Markov + MGF trick. The independence of the summands turns Markov's polynomial-loose bound into Chernoff's exponentially-tight bound.