Probability

Binomial Distribution

n independent yes/no trials with success probability p — the foundation of statistics, from coin flips to clinical trials

The Binomial distribution counts the number of successes in n independent trials, each with the same probability p of success. Coin flips, election polls, A/B tests, clinical trial outcomes, defect rates on a production line — all sit on the same two-parameter curve, and almost all of classical statistics begins by assuming Binomial data.

  • PMFC(n,k) p^k (1−p)^(n−k)
  • Meannp
  • Variancenp(1−p)
  • Sum ofn Bernoulli(p) trials
  • Normal limitde Moivre, 1733

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

From a single coin to a stack

Flip a coin once. Encode heads as 1 and tails as 0. The outcome is a Bernoulli random variable: it takes the value 1 with probability p and 0 with probability 1 − p. Now flip the coin n times independently and count the heads. The total is a Binomial(n, p) random variable, and its probability mass function is

P(X = k) = C(n, k) · p^k · (1 − p)^(n − k)     for k = 0, 1, ..., n

The factor C(n, k) = n! / (k! (n − k)!) counts the number of ways to choose which k of the n trials produced the successes. The factor p^k (1 − p)^(n − k) is the probability of any one specific sequence with k heads. Multiply, and you get the probability of seeing k heads at all.

The shape is governed by p. For p = 0.5 the PMF is symmetric and bell-like; the Binomial(10, 0.5) is the classic chess-piece histogram peaking at k = 5 with probability 252/1024 ≈ 0.246. For p < 0.5 the distribution is right-skewed (long tail towards k = n); for p > 0.5 it is left-skewed. The symmetry P(X = k | p) = P(X = n − k | 1 − p) lets you derive results for one regime from the other.

Mean, variance, and the Bernoulli decomposition

The cleanest way to see the moments is to write the Binomial as a sum of indicator variables:

X = I_1 + I_2 + ... + I_n     where each I_k ~ Bernoulli(p), independent

Each indicator has mean p and variance p(1 − p). Linearity of expectation gives E[X] = np immediately — no independence assumption needed. Independence then gives Var[X] = Σ Var[I_k] = np(1 − p). So:

  • Mean: np.
  • Variance: np(1 − p), maximised at p = 0.5.
  • Standard deviation: √(np(1 − p)), proportional to √n.
  • Mode: floor((n + 1)p), or the two integers around it if (n + 1)p is itself an integer.
  • Skewness: (1 − 2p)/√(np(1 − p)), zero only at p = 0.5.
  • MGF: M(t) = (1 − p + p e^t)^n.

The MGF expression makes additivity transparent: if X ~ Binomial(n₁, p) and Y ~ Binomial(n₂, p) are independent and have the same p, then X + Y ~ Binomial(n₁ + n₂, p). Crucially the two p's must agree — Binomial is not closed under addition over different success rates.

Worked example: ten coin flips

Flip a fair coin ten times. What is the probability of exactly 7 heads?

n = 10, p = 0.5, k = 7
P(X = 7) = C(10, 7) · (0.5)^7 · (0.5)^3
        = 120 · (0.5)^10
        = 120 / 1024
        ≈ 0.1172

So roughly 12% — counterintuitive to most people, who assume a 7/10 outcome is much rarer than 5/10. The full PMF of Binomial(10, 0.5) is symmetric around 5 and goes:

k:  0   1    2    3    4    5    6    7    8    9   10
C: 1  10   45  120  210  252  210  120   45   10    1
P: 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001

The probability of 7 or more heads is 0.117 + 0.044 + 0.010 + 0.001 ≈ 0.172. By symmetry this equals P(X ≤ 3), so a "really lopsided" outcome of 7-or-more or 3-or-fewer happens with probability about 0.34. That is why it takes long sequences of flips before a fair coin starts to look fair.

Now bias the coin to p = 0.7 (about the proportion of penalty kicks scored at the elite level) and ask the same question, k = 7:

P(X = 7 | p = 0.7) = C(10, 7) · (0.7)^7 · (0.3)^3
                  = 120 · 0.082354 · 0.027
                  ≈ 0.2668

The expected number of kicks scored is np = 7 exactly, and the most likely individual outcome is 7 — but the probability of that outcome is still only 27%, because the distribution is wide.

When to switch to a Normal or Poisson approximation

For large n the Binomial PMF is awkward to compute exactly — C(1000, 500) overflows naive integer arithmetic. Two limits replace it:

  1. Normal approximation (de Moivre 1733). When np > 5 and n(1 − p) > 5, X is approximately Normal with mean np and variance np(1 − p). Apply a continuity correction when computing CDF values: P(X ≤ k) ≈ Φ((k + 0.5 − np) / √(np(1 − p))). The 0.5 nudge accounts for the difference between a discrete bar at k and a continuous half-unit slab between k − 0.5 and k + 0.5.
  2. Poisson approximation. When n is large and p is small with np = λ moderate (a useful rule: n ≥ 50, p ≤ 0.05), X is approximately Poisson(np). The PMF simplifies from C(n, k) p^k (1 − p)^(n − k) to λ^k e^(−λ)/k!. This is the right approximation for rare-event counts: typos in a 50-page manuscript, defects in a wafer, mutations per kilobase.

The Normal approximation gets better as p gets closer to 0.5 and n grows. The Poisson approximation gets better as p shrinks toward 0 with np fixed. Both can be wrong simultaneously: Binomial(20, 0.3) sits in a moderate regime and the most accurate computation is just to use the exact PMF.

Binomial alongside its neighbours

Binomial(n, p)Bernoulli(p)Hypergeometric(N, K, n)Negative Binomial(r, p)
CountsSuccesses in n trialsSingle trial outcomeSuccesses in n draws without replacementTrials until r successes
SamplingWith replacementOne trialWithout replacementUntil target
MeannppnK/Nr(1−p)/p
Variancenp(1−p)p(1−p)n(K/N)(1−K/N)(N−n)/(N−1)r(1−p)/p²
Support0 to n{0, 1}max(0, n−N+K) to min(n, K)r, r+1, ...
Reduces ton=1 caseBinomial when N→∞Geometric when r=1
Common usePolling, A/B testsBuilding blockCard draws, election auditsReliability, returns

The Binomial sits at the centre of this family. Bernoulli is its single-trial atom; Hypergeometric is its without-replacement cousin; Negative Binomial flips the question (count trials until r successes) and is the natural model for overdispersed count data when used as a Poisson mixture.

Where the Binomial distribution shows up

  • Galton's bean machine (the quincunx). A bead falling through a triangular pegboard with n levels lands in bin k with probability C(n, k)/2^n — a Binomial(n, 0.5). Galton built one in the 1870s to demonstrate the Normal limit experimentally; modern museum versions still demonstrate the convergence in real time.
  • Election polling sample size. The standard ±3 percentage-point margin on a national poll comes from Binomial variance: with n = 1000 and p ≈ 0.5, the standard error of the proportion is √(0.25/1000) ≈ 0.0158, giving a 95% CI of ±1.96 × 0.0158 ≈ ±3.1%. Doubling the precision to ±1.5% requires quadrupling the sample.
  • A/B testing. Conversion rates are modelled as Binomial. The minimum detectable effect size for a two-proportion test is δ ≈ 2.8 √(2 p̄ (1 − p̄) / n) where p̄ is the pooled proportion. This is why low-conversion experiments need tens of thousands of users to detect 1% lifts.
  • Clinical trials. A two-arm Phase III trial with binary endpoint (responder vs non-responder) is a comparison of two Binomial proportions. Power calculations use the difference of binomial means and pooled variance; sample size formulas from Fleiss or Lachin descend directly from the Binomial PMF.
  • Quality control / defect inspection. Acceptance sampling plans (MIL-STD-105) compute the probability of accepting a lot with defect rate p as a Binomial CDF: a sample of n is accepted if the number of defectives is at most c. Operating characteristic curves are direct plots of the Binomial CDF.

Beyond two outcomes: the Multinomial generalisation

If each trial has m possible outcomes (not just success/failure) with probabilities p₁ + ... + p_m = 1, and we run n independent trials, the joint distribution of the counts (k₁, ..., k_m) is Multinomial:

P(K_1 = k_1, ..., K_m = k_m) = n! / (k_1! ... k_m!) · p_1^k_1 · ... · p_m^k_m

where k₁ + ... + k_m = n. Each marginal K_i is Binomial(n, p_i), but the joint distribution is the natural extension when you care about more than one category at once. Six-sided die rolls, multi-arm bandits, election results across multiple parties, document term-frequency vectors all fit into the Multinomial framework.

Variants and extensions

  • Beta-Binomial. p itself is Beta(α, β) distributed across observations; the Binomial is then a mixture. Allows overdispersion. Used in baseball batting averages (Brown's empirical Bayes shrinkage), CTR modelling, and Bayesian A/B testing where conjugate Beta priors update to a Beta-Binomial posterior predictive.
  • Negative Binomial (Pascal). Count of trials until r successes occur. Generalises the Geometric distribution (r = 1). Used in modelling the number of attempts until k orders, in Poisson-Gamma overdispersed counts, and in genome read-count models.
  • Hypergeometric. Sampling without replacement from a finite population of K successes among N. Reduces to Binomial in the large-population limit. Used in Fisher's exact test, in lottery and bridge-hand probabilities, in election audits.
  • Multinomial. Generalisation to m outcome categories per trial. Marginals are Binomial; covariance between categories is −n p_i p_j. Underwrites chi-squared goodness-of-fit and contingency-table analysis.
  • Bernoulli with random p. If p is uniform on [0, 1] and X ~ Binomial(n, p), then X is uniform on {0, 1, ..., n} — a counterintuitive consequence of the Beta(1,1) prior collapsing the variability. Used in Bayesian rule-of-succession derivations.

Common pitfalls

  • Treating dependent trials as independent. Drawing five cards from a 52-card deck is not Binomial — the second draw depends on the first. Use the hypergeometric. The error is small if the sample is much smaller than the population (the so-called "10% rule" — sample < 10% of population).
  • Wrong continuity correction. When approximating P(X ≤ k) by a Normal, use k + 0.5; for P(X < k) use k − 0.5; for P(X = k) use the slab [k − 0.5, k + 0.5]. Forgetting the half-unit shift can produce errors of several percentage points at moderate n.
  • Wald CI failure near the boundaries. The Wald p̂ ± z·√(p̂(1−p̂)/n) interval can produce negative lower limits or coverage probabilities far below the nominal 95% when p̂ is near 0 or 1. Use Wilson, Agresti-Coull, or Clopper-Pearson instead.
  • Treating the binomial coefficient as small. C(100, 50) ≈ 1.0 × 10²⁹. Direct factorial computation overflows; use log-space or the lgamma function. Most language libraries have stable implementations — call them rather than rolling your own.
  • Forgetting that Binomial assumes a fixed n. "Flip until I see 5 heads" is Negative Binomial, not Binomial. The distinction matters because the data-generating process determines the likelihood, and the wrong likelihood gives wrong p-values even with the same observed counts.

Frequently asked questions

When does the Binomial distribution apply?

When you have a fixed number n of trials, each trial has exactly two outcomes (success/failure), the probability of success p is the same on every trial, and the trials are independent of one another. These four conditions are sometimes called BINS: Binary outcomes, Independent trials, fixed Number of trials, Same probability. Sampling without replacement breaks the independence assumption — for that, the hypergeometric distribution is the correct tool.

Why is the mean of a Binomial np and the variance np(1-p)?

Write X as a sum of n independent Bernoulli(p) indicators X = I_1 + ... + I_n. Each I_k has mean p and variance p(1-p). Linearity of expectation gives E[X] = np for any p; independence gives Var[X] = Σ Var[I_k] = np(1-p). The variance is maximised at p = 0.5 — fair coins are the most uncertain — and tends to zero as p approaches 0 or 1, where the outcome becomes deterministic.

When can I approximate the Binomial with a Normal distribution?

When both np and n(1-p) are larger than about 5 (some texts use 10). The Normal approximation centres at μ = np with standard deviation σ = √(np(1-p)). Apply a continuity correction when computing P(X ≤ k): use Φ((k + 0.5 - μ)/σ) rather than Φ((k - μ)/σ), because the Binomial puts probability on integers but the Normal spreads it over half-units. This Normal approximation, due to de Moivre in 1733, is the historical first appearance of the central limit theorem.

When does the Binomial collapse to a Poisson distribution?

When n is large and p is small, with the product np = λ moderate. Then Binomial(n, p) ≈ Poisson(λ). A useful rule of thumb is n ≥ 50 and p ≤ 0.05. The intuition is that the Poisson PMF is the n→∞, p→0 limit of the Binomial PMF when np is held constant, so the rare-event regime — many trials, low success rate — is exactly the regime in which the simpler Poisson formula takes over.

What is the difference between the Binomial and the Hypergeometric distribution?

Binomial assumes sampling with replacement (or equivalently a population so large that draws are effectively independent). Hypergeometric is for sampling without replacement from a finite population. If a deck of 52 cards has 13 hearts and you draw 5 without replacement, you want the hypergeometric — the probability of a heart on the second draw depends on whether the first was a heart. The two distributions agree when the sample size is small relative to the population.

How do I compute a confidence interval for a Binomial proportion?

The simplest is the Wald interval: p̂ ± z·√(p̂(1-p̂)/n). It is fast but performs poorly when p is near 0 or 1, or when n is small. The Wilson score interval is the standard improvement and has better coverage near the boundaries. For very small n or extreme p, the Clopper-Pearson "exact" interval, derived from inverting the Binomial CDF, gives guaranteed coverage at the cost of being conservative.