Probability
Poisson Distribution
P(X=k) = λ^k e^(-λ)/k! — count of rare events in fixed time, from radioactive decay to Premier League goals
The Poisson distribution gives the probability of seeing exactly k events when events arrive independently at average rate λ in a fixed window. It is the universal counting law: clicks on a Geiger counter, goals in a football match, customers in a queue, photons hitting a CCD pixel — all sparse, independent, and rate-stable phenomena fall into the same one-parameter family.
- PMFλ^k e^(-λ) / k!
- Meanλ
- Varianceλ
- DiscoveredPoisson, 1837
- Validated byBortkiewicz, 1898
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Where the formula comes from
Take a stretch of time of length T. Suppose events fall on it independently at an average rate of λ events per unit time, and the chance of two events landing on the same instant is zero. We want the probability that exactly k events fall inside our window. The Poisson answer is
P(X = k) = λ^k · e^(-λ) / k! for k = 0, 1, 2, ...
(Here λ is taken to be the average count over the window — if the rate is r per second and the window is 5 seconds, then λ = 5r.) The shape of the PMF is governed by a single positive number. For λ < 1 the most likely count is 0 and the distribution is sharply right-skewed. For λ around 1 the curve has a flat hump near 0 and 1. For larger λ it rounds out into something that looks more and more like a bell curve, and in fact Poisson(λ) is well-approximated by Normal(λ, λ) once λ exceeds about 20.
The cleanest derivation runs through the Binomial. Slice the time window into n equal sub-intervals of length T/n. Make n large enough that the probability of two events in a single sub-interval is negligible. Let p be the chance an event lands in any given sub-interval. Then the number of sub-intervals containing an event is Binomial(n, p). To keep the average count λ = np constant as we refine the slicing, we send n → ∞ and p → 0 with np = λ fixed. The Binomial PMF
C(n, k) p^k (1-p)^(n-k)
tends, term by term, to λ^k e^(-λ)/k! by Stirling and the limit (1 − λ/n)^n → e^(-λ). The Poisson is what the Binomial relaxes to when trials are continuous and successes are rare.
Mean equals variance, and other one-parameter consequences
The Poisson is a one-parameter family — λ controls everything. Two consequences are unusually useful in practice:
- Mean = Variance = λ. A direct sum gives E[X] = λ, and a similar manipulation on E[X(X−1)] yields Var[X] = λ. So if you fit a Poisson to data, you only need one moment to estimate the parameter, and you can sanity-check the fit by comparing the sample mean to the sample variance — they should be close.
- Additivity. X ~ Poisson(λ₁), Y ~ Poisson(λ₂), independent ⇒ X + Y ~ Poisson(λ₁ + λ₂). This is why aggregated streams of independent events stay Poisson: thousand users with their own arrival rates sum into a Poisson process at the merged rate.
The moment generating function is M(t) = exp(λ(e^t − 1)). The skewness is 1/√λ — heavily right-skewed for small λ, nearly symmetric for λ in the hundreds. The kurtosis excess is 1/λ — more peaked than normal at small λ, normal-like at large λ.
From distribution to process
The Poisson distribution describes counts in a single window. The Poisson process glues these together into a continuous-time arrival model. Three equivalent definitions all pick out the same object:
- The number of events in any window of length t is Poisson(λt), and counts in disjoint windows are independent.
- The probability of exactly one event in a small window of length h is λh + o(h); the probability of two or more is o(h); and the count is independent of past arrivals.
- The waiting times between consecutive events are i.i.d. Exponential(λ).
The third characterisation is the one that makes simulation easy. To generate a Poisson process at rate λ, draw exponential gaps and accumulate them — the kth event time is the sum of k Exp(λ) variables, which has a Gamma(k, λ) distribution. The Poisson is the marginal count law of this clock.
Memorylessness is the philosophical point. If you've been waiting 30 seconds for the next bus and the gaps are Exp(λ), the distribution of your remaining wait is still Exp(λ) — the past wait carries no information about the future. Real bus schedules don't have this property, but radioactive nuclei and TCP packet arrivals do.
Worked example: ER admissions
An emergency department records on average 3.5 admissions per hour. Assume admissions arrive as a Poisson process. What is the probability that exactly 5 patients arrive in a given hour?
λ = 3.5
k = 5
P(X = 5) = (3.5)^5 · e^(-3.5) / 5!
= 525.219 · 0.030197 / 120
≈ 0.1322
So about a 13% chance. The full PMF rounds out as follows for the same hour:
P(X = 0) ≈ 0.0302
P(X = 1) ≈ 0.1057
P(X = 2) ≈ 0.1850
P(X = 3) ≈ 0.2158 ← mode
P(X = 4) ≈ 0.1888
P(X = 5) ≈ 0.1322
P(X = 6) ≈ 0.0771
P(X ≥ 7) ≈ 0.0653
From this you can answer staffing questions. The 95th percentile is 7 admissions; if you want to be 95% sure of having capacity, plan for 7. The probability of more than 10 admissions is about 0.4% — a tail event but not negligible across a year of hours.
Now suppose the night shift sees the same rate over an 8-hour period. The total count is Poisson(28), and by the additivity property you can reach this either by summing eight independent hourly Poisson(3.5) or by viewing the 8-hour window directly. Both perspectives are mathematically the same, which is the entire point of the Poisson process.
Poisson, Binomial, and Normal at a glance
| Binomial(n, p) | Poisson(λ) | Normal(μ, σ²) | |
|---|---|---|---|
| Support | {0, 1, ..., n} | {0, 1, 2, ...} | (−∞, ∞) |
| Mean | np | λ | μ |
| Variance | np(1−p) | λ | σ² |
| Parameters | 2 (n, p) | 1 (λ) | 2 (μ, σ²) |
| Discrete? | Yes | Yes | No |
| Use when | Fixed n trials | Rare events, fixed window | Continuous, large samples |
| Limiting case of | — | Binomial as n→∞, p→0 | Both, by CLT |
| Sum of independents | Same family iff equal p | Always Poisson | Always Normal |
The three distributions sit on a single ladder. Binomial(n, p) describes a fixed batch of yes/no trials. Push n high and p low so that np = λ is fixed and you get Poisson(λ). Push λ high and you get Normal(λ, λ) by the central limit theorem. The right tool depends on which limits dominate your data.
Where the Poisson distribution shows up
- Radioactive decay. A gram of radium emits about 3.7×10¹⁰ alpha particles per second. The count in any second-long window is Poisson with that mean. Geiger-counter electronics are calibrated against the predicted Poisson PMF — anomalies in the variance indicate detector dead-time or pile-up.
- Premier League goals. Match-by-match goal counts fit Poisson with λ ≈ 1.35 per team per match. Bookmakers price over/under markets using the implied Poisson — and the over-2.5-goals line of about 50% probability falls naturally out of two independent Poisson(1.35) team rates summed.
- Bortkiewicz's Prussian cavalry data. 1875–1894 horse-kick deaths in 14 corps. Total 196 deaths over 280 corps-years, λ ≈ 0.7. Predicted vs observed counts of corps-years with 0/1/2/3+ deaths: 144/91/32/13 expected, 144/91/32/13 observed. The fit is so good it became the Poisson's founding case study.
- Web server requests. Aggregate request streams to a busy web server are typically modelled as Poisson once you condition on time-of-day. Capacity planning uses Erlang-C (a queueing formula) which assumes Poisson arrivals into a server pool, and the SLA tail is computed from the Poisson PMF.
- Mutation counts in genome sequencing. Per-base mutation counts in DNA sequencing reads at low coverage are Poisson with λ equal to the mean coverage. Bayesian variant callers like GATK use the Poisson likelihood as their per-site model.
Estimating λ from data
The maximum likelihood estimator of λ from a sample x₁, ..., x_n is just the sample mean:
λ̂_MLE = (x₁ + x₂ + ... + x_n) / n
The MLE is unbiased, efficient, and equal to the method-of-moments estimator. The (asymptotic) standard error is √(λ̂/n), so a 95% confidence interval is approximately λ̂ ± 1.96 √(λ̂/n). For small counts, exact confidence intervals based on the chi-squared distribution are preferred:
Lower: ½ χ²(α/2; 2k) where k = total count
Upper: ½ χ²(1 − α/2; 2k+2)
Goodness-of-fit is checked with a chi-squared test bucketing observed counts vs Poisson-predicted counts, or with the dispersion test that compares sample variance to sample mean. A dispersion ratio (variance/mean) much larger than 1 is a red flag — switch to Negative Binomial.
Variants and extensions
- Compound Poisson. Sum of N i.i.d. random jumps where N is Poisson(λ). Used in insurance for total claims (number of claims is Poisson, claim sizes are heavy-tailed) and in finance for jump-diffusion models like Merton 1976.
- Inhomogeneous Poisson process. The rate λ(t) varies in time. Counts in any window are Poisson with mean ∫λ(t) dt over the window. Used in seismology, where aftershock rates follow Omori's law λ(t) ∝ 1/(t + c).
- Spatial Poisson process. The same idea in 2D or 3D — count of points in any region A is Poisson(∫_A λ(x) dx). Used in cosmology for galaxy distributions, in forestry for tree placement, in cellular biology for receptor positions on a membrane.
- Zero-inflated Poisson. A mixture: with probability p the count is forced to 0, otherwise it is Poisson(λ). Models data with an excess of zeros — visits to a clinic, accidents at safe intersections.
- Negative Binomial. Poisson-Gamma mixture: λ itself is drawn from a Gamma distribution per observation. Allows variance > mean (overdispersion) and is the standard fix when raw Poisson regression fails its dispersion test.
Common pitfalls
- Treating dependent events as independent. Goal scoring after a red card is not independent of the red card. Fitting a single λ across heterogeneous match contexts produces biased predictions even if the marginal mean looks right.
- Ignoring overdispersion. If the sample variance is twice the sample mean, Poisson confidence intervals are off by √2. The model fits the mean but lies about the spread, and downstream inferences (p-values, prediction intervals) are correspondingly wrong.
- Forgetting the rate vs count distinction. λ is not an instantaneous probability — it is an expected count in the window you have specified. Doubling the window doubles λ. Saying "λ = 0.7 per year" is meaningful; saying "λ = 0.7" with no window is not.
- Using Poisson when n is fixed and known. If you are doing 100 coin flips, use Binomial(100, p). The Poisson approximation is for the regime where n is enormous and p is small enough that n itself becomes irrelevant. With small n the Poisson tail at counts > n is positive but the Binomial cuts cleanly.
- Reading the PMF as a probability density. The Poisson is discrete. P(X = 1.5) is not 0.07 from some interpolated curve — it is undefined. Plot bars, not lines, and compare to integer bins of the data.
Frequently asked questions
When should I model count data with a Poisson distribution?
When events arrive independently of one another at a roughly constant average rate, no two events occur at exactly the same instant, and you are counting how many fall inside a fixed window of time, space, area or volume. The classic checklist is: count, not measurement; fixed window; events independent; rate stable; and observed mean approximately equal to observed variance. If the variance is much larger than the mean (overdispersion) the Poisson assumption is broken — usually in favour of the Negative Binomial.
Why is the mean of a Poisson distribution equal to its variance?
Both come out to λ from a direct calculation on the PMF. E[X] = Σ k · λ^k e^(-λ)/k! = λ after factoring λ out and re-indexing. E[X(X-1)] = λ², so Var[X] = E[X²] - (E[X])² = (λ² + λ) - λ² = λ. The deeper reason is structural: the Poisson is the limit of Binomial(n, λ/n) where np = λ and np(1-p) → λ as p → 0, so the binomial's mean and variance collide. Empirically this means a fitted Poisson model has zero free parameters once you have one estimate of the rate.
What is the connection between the Poisson distribution and the Binomial?
Poisson is the rare-event limit of the Binomial. If you split a window into n tiny sub-intervals and the probability of an event in each sub-interval is p = λ/n, then Binomial(n, λ/n) converges to Poisson(λ) as n → ∞. The textbook rule of thumb is that if n ≥ 50 and p ≤ 0.05, the Poisson approximation is accurate to within a few per cent, and it lets you replace a heavy combinatorial PMF with a single exponential-and-factorial.
What is a Poisson process?
A Poisson process is the continuous-time companion to the Poisson distribution. Events arrive on a timeline so that (a) the count in any window of length t is Poisson(λt), (b) counts in disjoint windows are independent, and (c) the time between consecutive events is exponentially distributed with rate λ. It is the canonical model for telephone call arrivals, neutron emissions, insurance claims, server requests, and any other "memoryless" arrival stream.
How do I add two independent Poisson random variables?
If X ~ Poisson(λ₁) and Y ~ Poisson(λ₂) are independent, then X + Y ~ Poisson(λ₁ + λ₂). This additive closure makes the Poisson a natural model for aggregating independent streams: web requests from a thousand users with individual rates sum to a Poisson with the summed rate, which is why total-traffic models in operations research are typically Poisson even when no single user is.
What goes wrong when count data is overdispersed?
Overdispersion means the observed variance exceeds the mean — the Poisson model's signature constraint is violated. Confidence intervals built from Poisson assumptions become anti-conservative: standard errors are underestimated, p-values are too small, and you over-reject the null. The standard fix is to switch to a Negative Binomial regression, which adds a dispersion parameter, or to use quasi-Poisson estimation that inflates standard errors by the dispersion factor.
Who invented the Poisson distribution?
Siméon Denis Poisson published the formula in his 1837 treatise Recherches sur la probabilité des jugements en matière criminelle as a tool for analysing wrongful conviction rates. The distribution sat almost unused for decades. The famous 1898 study by Ladislaus Bortkiewicz fitted Poisson(0.61) to deaths from horse-kicks in 14 Prussian cavalry corps over 20 years and showed an almost-perfect match — turning the Poisson into a mainstream statistical model.