Probability
Gamma Distribution
The waiting time for the α-th event in a Poisson process
The Gamma distribution Gamma(α, β) is a continuous distribution on (0, ∞) with shape α and rate β. Mean α/β, variance α/β². Contains exponential, Erlang and chi-squared as special cases.
- Densityf(x) = β^α x^(α−1) e^(−βx) / Γ(α)
- Supportx > 0
- Mean / Varianceα/β · α/β²
- α = 1exponential(β)
- α = k/2, β = 1/2chi-squared(k)
- Conjugate toPoisson rate
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The density
Gamma(α, β) has probability density function:
f(x; α, β) = (β^α / Γ(α)) · x^(α − 1) · e^(−βx) for x > 0
= 0 for x ≤ 0
Here α > 0 is the shape parameter and β > 0 is the rate parameter. Γ(α) is the gamma function — the normalizing constant that makes the density integrate to 1. For positive integers Γ(n) = (n − 1)!.
Some textbooks use scale θ = 1/β instead of rate β. Same distribution; just a different parametrization. The choice usually depends on the application: Bayesian statisticians prefer rate (it's the natural parameter), while engineers often prefer scale (it's the mean time per event).
Interpretation — waiting time
If events arrive according to a Poisson process with rate β (so the expected wait between events is 1/β), then the time T until the α-th event satisfies:
T = T₁ + T₂ + ... + T_α where each Tᵢ ~ Exp(β) independently
T ~ Gamma(α, β)
This works for integer α (Erlang case). For non-integer α the distribution still makes sense as a smooth extension via the gamma function. Concrete example: in a process with rate β = 2 events/minute, the wait for the third event is Gamma(3, 2) with mean 3/2 = 1.5 minutes.
Moments
| Quantity | Formula | Numerical (α = 3, β = 2) |
|---|---|---|
| Mean | α / β | 3/2 = 1.5 |
| Variance | α / β² | 3/4 = 0.75 |
| Std deviation | √α / β | √3/2 ≈ 0.866 |
| Mode (α > 1) | (α − 1) / β | 2/2 = 1.0 |
| Skewness | 2/√α | 2/√3 ≈ 1.155 |
| Excess kurtosis | 6/α | 2 |
| MGF | (1 − t/β)^(−α), t < β | — |
Skewness and kurtosis both decrease as α grows — large-shape Gammas approach a Normal distribution (consequence of the central limit theorem on the sum of α exponentials).
Special cases — what Gamma contains
| Special case | Parameters | Domain it dominates |
|---|---|---|
| Exponential(λ) | α = 1, β = λ | Memoryless waiting time; first-event timing |
| Erlang(k, λ) | α = k (integer), β = λ | k-th event in Poisson process; queueing |
| Chi-squared(k) | α = k/2, β = 1/2 | Sum of k squared standard Normals; hypothesis tests |
| Wishart(scalar) | 1D specialization | Covariance matrix priors in Bayesian stats |
| Beta(α, β) (related) | X/(X+Y) where X,Y ~ Gamma | Proportions, fractions, ratios |
Five named distributions under one roof. Whenever you see exponential, Erlang, or chi-squared, you're looking at a Gamma.
Gamma vs Chi-squared vs Exponential vs Erlang
| Distribution | Pdf form | Mean | Var | Use case |
|---|---|---|---|---|
| Gamma(α, β) | x^(α−1) e^(−βx) · β^α / Γ(α) | α/β | α/β² | Generic waiting time, scale-shape |
| Exponential(λ) | λ e^(−λx) | 1/λ | 1/λ² | Memoryless waiting; lifetime |
| Erlang(k, λ) | λ^k x^(k−1) e^(−λx)/(k−1)! | k/λ | k/λ² | k-th event in Poisson; M/M/c queues |
| Chi-squared(k) | x^(k/2−1) e^(−x/2)/(2^(k/2)Γ(k/2)) | k | 2k | Sum of k squared Normals; tests |
| Weibull(k, λ) | (k/λ)(x/λ)^(k−1) e^(−(x/λ)^k) | λΓ(1+1/k) | — | Failure analysis; heavier tails |
| Lognormal(μ, σ) | (1/(xσ√2π))exp(−(ln x − μ)²/2σ²) | e^(μ+σ²/2) | — | Multiplicative processes; income |
Exponential, Erlang, and chi-squared are nested inside Gamma. Weibull and Lognormal are alternative right-skewed distributions with different tail behavior.
Conjugacy with Poisson — Bayesian update
Place a Gamma(α, β) prior on the rate λ of a Poisson process. Observe n events in time T (data X ~ Poisson(λT)):
Prior: λ ~ Gamma(α, β)
Likelihood: X ~ Poisson(λT)
Posterior: λ | X = n ~ Gamma(α + n, β + T)
Posterior mean: (α + n) / (β + T)
The shape gains the count of observed events; the rate gains the total observation time. Closed-form Bayesian inference — no MCMC. The posterior mean interpolates between prior mean α/β and the MLE n/T, with weights determined by α relative to n and β relative to T.
Sum of independent Gammas
If X₁ ~ Gamma(α₁, β) and X₂ ~ Gamma(α₂, β) are independent (with the same rate β), then:
X₁ + X₂ ~ Gamma(α₁ + α₂, β)
The shapes add; the rate stays. This is "infinite divisibility" and makes Gamma natural for additive processes. It also gives the cleanest derivation that Gamma(n, β) is the sum of n independent Exp(β) variables for integer n — known as the Erlang special case.
Where the Gamma distribution shows up
- Queueing theory. Total service time for k customers, each served exponentially: Erlang. M/E_k/1 queues model multi-stage processing pipelines.
- Reliability engineering. Time to k-th failure in redundant systems; component lifetime distributions when failures are Poisson.
- Bayesian rate inference. Conjugate prior for Poisson rates — count data in clinical trials, web analytics, server load.
- Hypothesis testing. The chi-squared special case underlies goodness-of-fit, contingency tables, variance ratio tests.
- Hierarchical models. Gamma distributions on variance parameters (precision = 1/variance) appear throughout Bayesian hierarchical regression.
- Insurance. Claim size distributions are often Gamma-fitted; claim aggregation in compound Poisson models.
- Meteorology. Rainfall amounts at fixed time intervals are often right-skewed positive — Gamma fits well empirically.
- Telecommunications. Inter-arrival times in network packet flows; Gamma-mixed exponentials capture bursty traffic.
Parameter estimation
Method of moments — fast and simple:
α̂ = (sample mean)² / (sample variance)
β̂ = sample mean / sample variance
Maximum likelihood — more efficient but requires solving for α̂:
ψ(α̂) − ln α̂ = ln(geometric mean) − ln(sample mean)
β̂ = α̂ / sample mean
(ψ is the digamma function; solve for α̂ numerically — Newton-Raphson converges fast.)
In Python, scipy.stats.gamma.fit(data) handles MLE automatically. In R, MASS::fitdistr(data, "gamma") does the same.
Common pitfalls
- Rate vs scale confusion. R's dgamma uses rate β by default but accepts scale = 1/β. Python's scipy uses scale by default. Always check the parametrization before passing parameters across tools.
- Trying to fit Gamma to data including zeros. Gamma is supported on (0, ∞), strict. Zero-inflated or hurdle models handle "zero plus continuous positive" data more honestly.
- Forgetting α > 1 mode. The mode is at (α − 1)/β only if α > 1. For α ≤ 1 the mode is at 0; the density is monotonically decreasing.
- Sum of Gammas with different rates. The sum has no closed form unless rates match. Mixed-rate sums require generalized hypergeometric functions or numerical convolution.
- Confusing Gamma distribution with Gamma function. The function Γ(α) is the normalizing constant in the distribution's pdf, but the function itself is a separate object — Γ(α) = ∫x^{α−1}e^{−x}dx.
- Heavy-tailed alternative confusion. Gamma has exponential tail decay. Real-world "heavy-tailed" data (income, file sizes) often need lognormal or power-law fits instead — Gamma is too light-tailed for those.
Frequently asked questions
Why are there two parameters (shape and rate)?
α (shape) controls the form of the curve; β (rate) controls the horizontal scale. For α = 1 the density is purely decreasing (exponential); for α > 1 it's unimodal with a peak at (α − 1)/β; for α between 0 and 1 it's infinite at zero and decreasing. Increasing β squeezes the distribution toward zero; decreasing β stretches it out. The mean α/β shows the trade-off: bigger α extends the peak right, bigger β pulls it left. Some textbooks parametrize by scale θ = 1/β instead — same distribution, different parameter convention.
How does Gamma relate to the exponential distribution?
Gamma(1, β) = Exponential(β). At α = 1 the density β^1 x^0 e^{−βx}/Γ(1) reduces to βe^{−βx}, the exponential pdf. Interpretation: the time until the first event in a Poisson(β) process is exponential; the time until the α-th event is Gamma(α, β). A sum of α independent Exp(β) variables is exactly Gamma(α, β) when α is a positive integer — this is the Erlang special case, used in queueing theory and reliability engineering.
How does Gamma reduce to chi-squared?
Chi-squared with k degrees of freedom is exactly Gamma(α = k/2, β = 1/2). Verify: the chi-squared pdf is (1/2)^{k/2} x^{k/2 − 1} e^{−x/2} / Γ(k/2), which is Gamma(k/2, 1/2). For k = 2 this is Exp(1/2) — mean 2. For k = 4 it's Gamma(2, 1/2), mean 4. The connection runs deep: a chi-squared is a sum of squared standard normals, and Gamma(α, β) is the conjugate prior for the rate parameter of a Poisson — both perspectives point at sums of squared/transformed quantities.
What does "conjugate prior" mean for Gamma and Poisson?
If you place a Gamma(α, β) prior on the rate λ of a Poisson process and observe n events in time T (which is Poisson(λT) data), the posterior for λ is Gamma(α + n, β + T). Same family, updated parameters — conjugate. This makes Bayesian updating closed-form for Poisson rate inference: no MCMC, no numerical integration. The shape gains the count, the rate gains the observation time. Posterior mean (α + n)/(β + T) interpolates between prior mean α/β and MLE n/T.
What is the Gamma function Γ(α)?
The normalizing constant in the Gamma pdf. Γ(α) = ∫₀^∞ x^{α−1} e^{−x} dx. For positive integers Γ(n) = (n − 1)!, generalizing factorial to real (and complex) arguments. Γ(α + 1) = α · Γ(α) is the recursion. Important values: Γ(1) = 1, Γ(1/2) = √π, Γ(2) = 1, Γ(3) = 2. The Gamma function is the natural continuous extension of factorial — proven by Bohr-Mollerup characterization to be the unique log-convex extension. Computing Γ for non-integer α requires Stirling-like approximations or numerical libraries.
How do you fit a Gamma distribution to data?
Method of moments: α̂/β̂ = sample mean, α̂/β̂² = sample variance, so α̂ = (mean)²/var, β̂ = mean/var. Quick but biased for small samples. Maximum likelihood: no closed form for α̂; solve ψ(α) − ln α = ln(geometric mean) − ln(sample mean) numerically (ψ is the digamma function), then β̂ = α̂/sample mean. ML is more efficient than method of moments. The likelihood is unimodal in α, so Newton-Raphson converges quickly. Statistical libraries (scipy.stats.gamma.fit, R fitdistr) implement this automatically.
What real-world phenomena follow a Gamma distribution?
Service times in queueing (M/G/1 queues). Time-to-failure in reliability engineering when failures follow a Poisson process. Rainfall amounts in meteorology (positive, right-skewed). Insurance claim sizes. Latency distributions in network engineering (often Gamma-mixed with other parametric forms). Gene expression levels in single-cell sequencing. Anywhere you have a positive continuous quantity that's right-skewed with a clear scale and shape, Gamma is the first parametric family to try.