Probability
Dirichlet Distribution
Multivariate Beta on the simplex — distribution over probability vectors
Dir(α₁, …, α_k) is a distribution over probability vectors summing to 1 — the conjugate prior for the multinomial. Powers topic models (LDA), Bayesian categorical inference.
- Densityf(p) = ∏ᵢ pᵢ^(αᵢ−1) / B(α)
- Support(k−1)-simplex: pᵢ ≥ 0, Σpᵢ = 1
- Meanαᵢ / α₀ where α₀ = Σαⱼ
- ConjugacyDir(α) + counts n → Dir(α + n)
- LDA defaultsα = 50/K (documents), η ≈ 0.01 (topics)
- First studiedDirichlet 1839, Blei-Ng-Jordan LDA 2003
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The simplex and the density
The (k − 1)-simplex Δ^(k−1) is the set of probability vectors (p₁, …, p_k) satisfying pᵢ ≥ 0 and Σ pᵢ = 1. It's the natural home for categorical probabilities: with k = 3 it's a triangle in the plane (sometimes drawn as a barycentric coordinate diagram); with k = 4 it's a tetrahedron; for general k it's a (k − 1)-dimensional surface in ℝᵏ.
The Dirichlet distribution Dir(α₁, …, α_k) puts a probability density on the simplex:
f(p₁, …, p_k; α₁, …, α_k) = (1 / B(α)) · ∏ᵢ pᵢ^(αᵢ − 1)
B(α) = ∏ Γ(αᵢ) / Γ(Σᵢ αᵢ) (multivariate beta function)
The parameters αᵢ > 0 are called concentration parameters. Their sum α₀ = Σᵢ αᵢ is the total concentration. Properties:
- Mean: E[pᵢ] = αᵢ / α₀
- Variance: Var(pᵢ) = αᵢ(α₀ − αᵢ) / (α₀² (α₀ + 1))
- Covariance: Cov(pᵢ, pⱼ) = −αᵢ αⱼ / (α₀² (α₀ + 1)) — always negative
- Mode (when all αᵢ > 1): (αᵢ − 1) / (α₀ − k)
The negative covariance is structural: if one pᵢ increases, the others must decrease to keep the sum at 1. Dirichlet imposes a soft simplex constraint, not a free covariance structure.
Three regimes — concentration shapes
The behavior of Dir(α) on a 3-category simplex depends on α₀ and the relative αᵢ:
| α | α₀ | Behavior | Use case |
|---|---|---|---|
| (1, 1, 1) | 3 | Uniform on triangle — every probability vector equally likely | Non-informative prior |
| (0.5, 0.5, 0.5) | 1.5 | Mass piles at corners — sparse draws, one category wins | Sparse topic models |
| (5, 5, 5) | 15 | Bell around centroid (1/3, 1/3, 1/3) | Concentrated near uniform |
| (50, 50, 50) | 150 | Sharp Gaussian-like peak at centroid | Strong prior near uniform |
| (5, 1, 1) | 7 | Asymmetric — mode at (4/4, 0/4, 0/4) = first category dominates | Strong prior on category 1 |
| (0.1, 0.1, 0.1) | 0.3 | Extreme corner-piling — draws nearly always at one corner | Heavily sparse priors |
Small α₀ → sparse → corners. Large α₀ → dense → centroid. Asymmetric α → mode shifts toward the heaviest component.
Conjugacy with multinomial
The categorical/multinomial likelihood for observed counts (n₁, …, n_k) with total n = Σnᵢ given probabilities p is:
L(p | n) = (n! / ∏ nᵢ!) · ∏ pᵢ^(nᵢ)
Multiply by the Dirichlet prior:
f(p | α) · L(p | n) ∝ ∏ pᵢ^(αᵢ − 1) · ∏ pᵢ^(nᵢ)
= ∏ pᵢ^(αᵢ + nᵢ − 1)
posterior: p | n ~ Dir(α₁ + n₁, …, α_k + n_k)
The Bayes update is pure addition: prior + observed counts. This is what makes Dirichlet/multinomial inference computationally tractable, especially in topic models and language models with millions of words.
Worked example — word frequencies
Suppose you have a small corpus and want to model the relative frequency of three words: "the," "and," "of." Begin with a non-informative prior Dir(1, 1, 1).
Observe 1000 word tokens with counts (450, 320, 230). Posterior is Dir(451, 321, 231). Posterior mean is (451/1003, 321/1003, 231/1003) = (0.450, 0.320, 0.230) — essentially the empirical frequencies. With α₀ = 3 (weak prior), the posterior is dominated by data after 1000 observations.
Now compare with a strong prior Dir(100, 100, 100) — equivalent to having "pre-observed" 99 of each word. Posterior becomes Dir(550, 420, 330). Posterior mean (550/1300, 420/1300, 330/1300) = (0.423, 0.323, 0.254). The strong prior pulls the posterior partway toward uniform (1/3 each). After 1000 fresh observations, the prior still has measurable influence.
The "rule of α₀": prior worth = α₀ pseudo-observations. To dominate it, collect ≫ α₀ real data points.
Sampling — Gamma trick
The fastest way to sample from Dir(α₁, …, α_k):
1. Sample y_i ~ Gamma(α_i, 1) independently for i = 1, ..., k.
2. Return p_i = y_i / Σ_j y_j.
The resulting p is exactly Dir(α). This works because: if Y₁ ~ Gamma(α₁, 1) and Y₂ ~ Gamma(α₂, 1) are independent, then Y₁ / (Y₁ + Y₂) ~ Beta(α₁, α₂) — and the multivariate generalization gives Dirichlet.
An alternative: stick-breaking. Sample β₁ ~ Beta(α₁, α₂ + … + α_k), then p₁ = β₁. For the next, β₂ ~ Beta(α₂, α₃ + … + α_k), and p₂ = (1 − β₁)·β₂. Continue. This is the construction underlying the Dirichlet process (the infinite-dimensional generalization).
Latent Dirichlet Allocation (LDA)
LDA is the most influential application of Dirichlet in machine learning. It models each document as a mixture of K topics, where each topic is itself a distribution over the vocabulary:
For each topic k = 1, ..., K:
Draw β_k ~ Dir(η) (η ≈ 0.01, word-topic distribution)
For each document d:
Draw θ_d ~ Dir(α) (α = 50/K, document-topic mixture)
For each word position:
Draw a topic z ~ Categorical(θ_d)
Draw a word w ~ Categorical(β_z)
The α = 50/K default (Griffiths and Steyvers 2004) is small enough to encourage sparse document-topic mixtures (a typical document is "about" 2–5 topics, not all K). The η ≈ 0.01 default favors sparse topic-word distributions (each topic uses a focused vocabulary).
Inference recovers θ_d (the topic mixture for each document) and β_k (the word distribution for each topic) from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form: in Gibbs sampling, you sample each word's topic z given everything else by a closed-form Dirichlet-multinomial probability. This is what makes LDA inference scale to millions of documents.
Where Dirichlet shows up
- Topic models (LDA, HDP, supervised LDA, dynamic topic models). Document-topic and topic-word distributions are Dirichlet.
- Bayesian categorical inference. Standard prior for unknown class probabilities — replacing the maximum-likelihood estimate with regularized posterior means.
- Bayesian language modeling. Dirichlet-smoothed n-grams, Pitman-Yor extensions for power-law vocabularies, hierarchical Pitman-Yor for character-level models.
- Population genetics. Modeling allele frequencies across populations; the Wright-Fisher model has Dirichlet stationary distribution under certain assumptions.
- Decision trees and random forests with categorical splits. Bayesian leaf probabilities use Dirichlet smoothing.
- Reinforcement learning. Categorical Q-learning and posterior sampling (Thompson sampling for multi-armed bandits) use Dirichlet posteriors over action probabilities.
- Computational biology. Motif discovery, transcription factor binding site modeling, hidden Markov models for protein structure all use Dirichlet priors on emission and transition probabilities.
- Polya urn schemes. The classical Pólya urn (drawing balls and replacing them with extras of the same color) is exactly the predictive distribution of a Dirichlet-multinomial model.
Python — sampling and Bayesian update
import numpy as np
# Sample from Dir(α)
alpha = np.array([5.0, 5.0, 5.0])
sample = np.random.dirichlet(alpha)
print('one sample (sums to 1):', sample, sample.sum())
# Many samples — empirical mean should be α / α₀
samples = np.random.dirichlet(alpha, size=10000)
print('empirical mean:', samples.mean(axis=0))
print('theoretical mean:', alpha / alpha.sum())
# Bayesian update: prior Dir(α) + observed counts → posterior Dir(α + n)
prior = np.array([1.0, 1.0, 1.0])
counts = np.array([450, 320, 230])
posterior = prior + counts
print('posterior parameters:', posterior)
print('posterior mean:', posterior / posterior.sum())
# Manual sampling via Gamma trick
def dirichlet_sample(alpha):
ys = np.random.gamma(alpha, 1.0)
return ys / ys.sum()
# Verify
manual = np.array([dirichlet_sample(alpha) for _ in range(10000)])
print('manual sampler mean:', manual.mean(axis=0))
Common pitfalls
- Confusing α with mean. The mean is αᵢ/α₀, not αᵢ. Different α with same proportions but different sums give wildly different concentration. Dir(1, 1, 1) and Dir(100, 100, 100) have the same mean but vastly different variance.
- Negative covariance trap. Dirichlet enforces Cov(pᵢ, pⱼ) < 0. If your real probabilities have positive correlations (e.g., voter preferences clustering by ideology), Dirichlet is the wrong family — use logistic normal.
- Wrong concentration in LDA. Setting α too large gives uniform document-topic mixtures (every document touches every topic — uninformative). Setting α too small causes Gibbs sampler instability. Defaults α = 50/K, η = 0.01 are well-tested baselines.
- Ignoring the multivariate beta function. Forgetting the 1/B(α) normalization gives a non-density. Many implementations expose log-density to avoid underflow in high k.
- Adding α and counts of different scales. If your counts are millions and α is order 1, the prior is irrelevant. To get prior influence proportional to data, scale α up to match expected sample size.
- Using uniform Dir(1, ..., 1) for k = 2. This is Beta(1, 1), the uniform on [0, 1] — fine, but Jeffreys prior Beta(1/2, 1/2) often gives better calibration. The analog for k > 2 is Dir(1/2, ..., 1/2), the symmetric Jeffreys prior.
History
Peter Gustav Lejeune Dirichlet studied the integral that bears his name in 1839 — what later became the Dirichlet distribution's normalizing constant. The distribution as a probability law was developed by Sir Ronald Fisher and others in the 1930s for population genetics (allele-frequency models). Norman L. Johnson's textbook on continuous distributions canonized the modern parameterization. The Dirichlet's central role in machine learning began with Pearl's belief networks and Gelfand-Smith's Gibbs sampling in the late 1980s, then exploded with Blei, Ng, and Jordan's 2003 paper introducing Latent Dirichlet Allocation. The Dirichlet process (an infinite-dimensional generalization developed by Ferguson 1973 and popularized by Antoniak 1974) is the foundation of Bayesian nonparametrics — letting K grow with the data.
Frequently asked questions
What does a Dirichlet distribution look like?
Dir(α₁, …, α_k) lives on the (k − 1)-simplex — the set of probability vectors (p₁, …, p_k) with each pᵢ ≥ 0 and Σ pᵢ = 1. For k = 3, the simplex is a triangle and the Dirichlet density colors the triangle. Behavior depends on the concentration parameter α₀ = Σ αᵢ: α = (1, 1, 1) gives the uniform; α = (5, 5, 5) concentrates near the centroid; α = (0.1, 0.1, 0.1) puts almost all mass near corners. Asymmetric α like (3, 1, 1) shifts the mode toward the first category.
Why is Dirichlet the conjugate prior for the multinomial?
Multinomial likelihood for counts (n₁, …, n_k) given probabilities (p₁, …, p_k) is L(p) ∝ ∏ᵢ pᵢ^(nᵢ). Multiply by the Dirichlet prior density ∏ᵢ pᵢ^(αᵢ − 1): the product is ∏ᵢ pᵢ^(αᵢ + nᵢ − 1), which is the kernel of a Dir(α + n) density. So prior and posterior share the same algebraic form, with parameters updated by simple addition. The 'effective sample size' of a Dir(α) prior is α₀ − k — that's how many fresh observations the prior is worth.
How is Dirichlet used in Latent Dirichlet Allocation (LDA)?
LDA models each document as a mixture of K topics. The prior on document-topic proportions θ_d is Dir(α) with α typically set to 50/K — small enough to encourage sparsity, large enough to avoid degenerate solutions. The prior on topic-word distributions β_k is Dir(η) with η ≈ 0.01 — favoring sparse vocabulary distributions where each topic uses a focused set of words. Inference (Gibbs sampling or variational Bayes) recovers θ and β from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form, enabling efficient MCMC.
What's the difference between Beta and Dirichlet?
Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] — equivalent to Dirichlet on k = 2 categories. Dirichlet(α₁, …, α_k) extends to k probabilities summing to 1. Both are conjugate priors for their respective likelihoods (Beta for Bernoulli/binomial, Dirichlet for categorical/multinomial). Both have the same pseudo-count intuition. The marginal of any single Dirichlet component pᵢ is a Beta(αᵢ, α₀ − αᵢ), so questions about one category at a time reduce to Beta inference.
What does the concentration parameter α₀ control?
α₀ = Σ αᵢ is the total concentration. It controls how spread out the Dirichlet is around its mean. Small α₀ (< 1 in each component) makes draws sparse — most pᵢ near 0, with a few near 1. Large α₀ makes draws concentrated around the mean αᵢ/α₀. The variance of pᵢ is αᵢ(α₀ − αᵢ)/(α₀²(α₀ + 1)) — so as α₀ → ∞ the variance vanishes and draws approach the mean deterministically. Hyperparameter tuning for LDA centers on α₀.
How do you sample from a Dirichlet?
The standard method uses independent Gamma draws. For each i, sample yᵢ ~ Gamma(αᵢ, 1). Then set pᵢ = yᵢ / (Σⱼ yⱼ). The resulting (p₁, …, p_k) is exactly Dir(α₁, …, α_k). This works because the Gamma family has the property that ratios of Gammas with the same rate parameter are Dirichlet — a beautiful consequence of the Gamma distribution being closed under the operation.
When does Dirichlet break down?
Dirichlet imposes a negative correlation between components — if pᵢ goes up, the others go down to maintain the sum-to-one constraint. This forces Cov(pᵢ, pⱼ) < 0 for i ≠ j. If your real-world probabilities have positive correlations, Dirichlet is wrong. Use the logistic normal (multivariate Gaussian on logit-transformed proportions) or Pólya tree distributions. Dirichlet also has only k parameters — it can't capture richer covariance structures.