Probability

Dirichlet Distribution

Multivariate Beta on the simplex — distribution over probability vectors

Dir(α₁, …, α_k) is a distribution over probability vectors summing to 1 — the conjugate prior for the multinomial. Powers topic models (LDA), Bayesian categorical inference.

Densityf(p) = ∏ᵢ pᵢ^(αᵢ−1) / B(α)
Support(k−1)-simplex: pᵢ ≥ 0, Σpᵢ = 1
Meanαᵢ / α₀ where α₀ = Σαⱼ
ConjugacyDir(α) + counts n → Dir(α + n)
LDA defaultsα = 50/K (documents), η ≈ 0.01 (topics)
First studiedDirichlet 1839, Blei-Ng-Jordan LDA 2003

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The simplex and the density

The (k − 1)-simplex Δ^(k−1) is the set of probability vectors (p₁, …, p_k) satisfying pᵢ ≥ 0 and Σ pᵢ = 1. It's the natural home for categorical probabilities: with k = 3 it's a triangle in the plane (sometimes drawn as a barycentric coordinate diagram); with k = 4 it's a tetrahedron; for general k it's a (k − 1)-dimensional surface in ℝᵏ.

The Dirichlet distribution Dir(α₁, …, α_k) puts a probability density on the simplex:

f(p₁, …, p_k; α₁, …, α_k) = (1 / B(α)) · ∏ᵢ pᵢ^(αᵢ − 1)

B(α) = ∏ Γ(αᵢ) / Γ(Σᵢ αᵢ)        (multivariate beta function)

The parameters αᵢ > 0 are called concentration parameters. Their sum α₀ = Σᵢ αᵢ is the total concentration. Properties:

Mean: E[pᵢ] = αᵢ / α₀
Variance: Var(pᵢ) = αᵢ(α₀ − αᵢ) / (α₀² (α₀ + 1))
Covariance: Cov(pᵢ, pⱼ) = −αᵢ αⱼ / (α₀² (α₀ + 1)) — always negative
Mode (when all αᵢ > 1): (αᵢ − 1) / (α₀ − k)

The negative covariance is structural: if one pᵢ increases, the others must decrease to keep the sum at 1. Dirichlet imposes a soft simplex constraint, not a free covariance structure.

Three regimes — concentration shapes

The behavior of Dir(α) on a 3-category simplex depends on α₀ and the relative αᵢ:

α	α₀	Behavior	Use case
(1, 1, 1)	3	Uniform on triangle — every probability vector equally likely	Non-informative prior
(0.5, 0.5, 0.5)	1.5	Mass piles at corners — sparse draws, one category wins	Sparse topic models
(5, 5, 5)	15	Bell around centroid (1/3, 1/3, 1/3)	Concentrated near uniform
(50, 50, 50)	150	Sharp Gaussian-like peak at centroid	Strong prior near uniform
(5, 1, 1)	7	Asymmetric — mode at (4/4, 0/4, 0/4) = first category dominates	Strong prior on category 1
(0.1, 0.1, 0.1)	0.3	Extreme corner-piling — draws nearly always at one corner	Heavily sparse priors

Small α₀ → sparse → corners. Large α₀ → dense → centroid. Asymmetric α → mode shifts toward the heaviest component.

Conjugacy with multinomial

The categorical/multinomial likelihood for observed counts (n₁, …, n_k) with total n = Σnᵢ given probabilities p is:

L(p | n) = (n! / ∏ nᵢ!) · ∏ pᵢ^(nᵢ)

Multiply by the Dirichlet prior:

f(p | α) · L(p | n) ∝ ∏ pᵢ^(αᵢ − 1) · ∏ pᵢ^(nᵢ)
                    = ∏ pᵢ^(αᵢ + nᵢ − 1)

posterior: p | n ~ Dir(α₁ + n₁, …, α_k + n_k)

The Bayes update is pure addition: prior + observed counts. This is what makes Dirichlet/multinomial inference computationally tractable, especially in topic models and language models with millions of words.

Worked example — word frequencies

Suppose you have a small corpus and want to model the relative frequency of three words: "the," "and," "of." Begin with a non-informative prior Dir(1, 1, 1).

Observe 1000 word tokens with counts (450, 320, 230). Posterior is Dir(451, 321, 231). Posterior mean is (451/1003, 321/1003, 231/1003) = (0.450, 0.320, 0.230) — essentially the empirical frequencies. With α₀ = 3 (weak prior), the posterior is dominated by data after 1000 observations.

Now compare with a strong prior Dir(100, 100, 100) — equivalent to having "pre-observed" 99 of each word. Posterior becomes Dir(550, 420, 330). Posterior mean (550/1300, 420/1300, 330/1300) = (0.423, 0.323, 0.254). The strong prior pulls the posterior partway toward uniform (1/3 each). After 1000 fresh observations, the prior still has measurable influence.

The "rule of α₀": prior worth = α₀ pseudo-observations. To dominate it, collect ≫ α₀ real data points.

Sampling — Gamma trick

The fastest way to sample from Dir(α₁, …, α_k):

1. Sample y_i ~ Gamma(α_i, 1) independently for i = 1, ..., k.
2. Return p_i = y_i / Σ_j y_j.

The resulting p is exactly Dir(α). This works because: if Y₁ ~ Gamma(α₁, 1) and Y₂ ~ Gamma(α₂, 1) are independent, then Y₁ / (Y₁ + Y₂) ~ Beta(α₁, α₂) — and the multivariate generalization gives Dirichlet.

An alternative: stick-breaking. Sample β₁ ~ Beta(α₁, α₂ + … + α_k), then p₁ = β₁. For the next, β₂ ~ Beta(α₂, α₃ + … + α_k), and p₂ = (1 − β₁)·β₂. Continue. This is the construction underlying the Dirichlet process (the infinite-dimensional generalization).

Latent Dirichlet Allocation (LDA)

LDA is the most influential application of Dirichlet in machine learning. It models each document as a mixture of K topics, where each topic is itself a distribution over the vocabulary:

For each topic k = 1, ..., K:
  Draw β_k ~ Dir(η)          (η ≈ 0.01, word-topic distribution)

For each document d:
  Draw θ_d ~ Dir(α)          (α = 50/K, document-topic mixture)
  For each word position:
    Draw a topic z ~ Categorical(θ_d)
    Draw a word w ~ Categorical(β_z)

The α = 50/K default (Griffiths and Steyvers 2004) is small enough to encourage sparse document-topic mixtures (a typical document is "about" 2–5 topics, not all K). The η ≈ 0.01 default favors sparse topic-word distributions (each topic uses a focused vocabulary).

Inference recovers θ_d (the topic mixture for each document) and β_k (the word distribution for each topic) from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form: in Gibbs sampling, you sample each word's topic z given everything else by a closed-form Dirichlet-multinomial probability. This is what makes LDA inference scale to millions of documents.

Where Dirichlet shows up

Topic models (LDA, HDP, supervised LDA, dynamic topic models). Document-topic and topic-word distributions are Dirichlet.
Bayesian categorical inference. Standard prior for unknown class probabilities — replacing the maximum-likelihood estimate with regularized posterior means.
Bayesian language modeling. Dirichlet-smoothed n-grams, Pitman-Yor extensions for power-law vocabularies, hierarchical Pitman-Yor for character-level models.
Population genetics. Modeling allele frequencies across populations; the Wright-Fisher model has Dirichlet stationary distribution under certain assumptions.
Decision trees and random forests with categorical splits. Bayesian leaf probabilities use Dirichlet smoothing.
Reinforcement learning. Categorical Q-learning and posterior sampling (Thompson sampling for multi-armed bandits) use Dirichlet posteriors over action probabilities.
Computational biology. Motif discovery, transcription factor binding site modeling, hidden Markov models for protein structure all use Dirichlet priors on emission and transition probabilities.
Polya urn schemes. The classical Pólya urn (drawing balls and replacing them with extras of the same color) is exactly the predictive distribution of a Dirichlet-multinomial model.

Python — sampling and Bayesian update

import numpy as np

# Sample from Dir(α)
alpha = np.array([5.0, 5.0, 5.0])
sample = np.random.dirichlet(alpha)
print('one sample (sums to 1):', sample, sample.sum())

# Many samples — empirical mean should be α / α₀
samples = np.random.dirichlet(alpha, size=10000)
print('empirical mean:', samples.mean(axis=0))
print('theoretical mean:', alpha / alpha.sum())

# Bayesian update: prior Dir(α) + observed counts → posterior Dir(α + n)
prior = np.array([1.0, 1.0, 1.0])
counts = np.array([450, 320, 230])
posterior = prior + counts
print('posterior parameters:', posterior)
print('posterior mean:', posterior / posterior.sum())

# Manual sampling via Gamma trick
def dirichlet_sample(alpha):
    ys = np.random.gamma(alpha, 1.0)
    return ys / ys.sum()

# Verify
manual = np.array([dirichlet_sample(alpha) for _ in range(10000)])
print('manual sampler mean:', manual.mean(axis=0))

Common pitfalls

Confusing α with mean. The mean is αᵢ/α₀, not αᵢ. Different α with same proportions but different sums give wildly different concentration. Dir(1, 1, 1) and Dir(100, 100, 100) have the same mean but vastly different variance.
Negative covariance trap. Dirichlet enforces Cov(pᵢ, pⱼ) < 0. If your real probabilities have positive correlations (e.g., voter preferences clustering by ideology), Dirichlet is the wrong family — use logistic normal.
Wrong concentration in LDA. Setting α too large gives uniform document-topic mixtures (every document touches every topic — uninformative). Setting α too small causes Gibbs sampler instability. Defaults α = 50/K, η = 0.01 are well-tested baselines.
Ignoring the multivariate beta function. Forgetting the 1/B(α) normalization gives a non-density. Many implementations expose log-density to avoid underflow in high k.
Adding α and counts of different scales. If your counts are millions and α is order 1, the prior is irrelevant. To get prior influence proportional to data, scale α up to match expected sample size.
Using uniform Dir(1, ..., 1) for k = 2. This is Beta(1, 1), the uniform on [0, 1] — fine, but Jeffreys prior Beta(1/2, 1/2) often gives better calibration. The analog for k > 2 is Dir(1/2, ..., 1/2), the symmetric Jeffreys prior.

History

Peter Gustav Lejeune Dirichlet studied the integral that bears his name in 1839 — what later became the Dirichlet distribution's normalizing constant. The distribution as a probability law was developed by Sir Ronald Fisher and others in the 1930s for population genetics (allele-frequency models). Norman L. Johnson's textbook on continuous distributions canonized the modern parameterization. The Dirichlet's central role in machine learning began with Pearl's belief networks and Gelfand-Smith's Gibbs sampling in the late 1980s, then exploded with Blei, Ng, and Jordan's 2003 paper introducing Latent Dirichlet Allocation. The Dirichlet process (an infinite-dimensional generalization developed by Ferguson 1973 and popularized by Antoniak 1974) is the foundation of Bayesian nonparametrics — letting K grow with the data.

Frequently asked questions

What does a Dirichlet distribution look like?

Dir(α₁, …, α_k) lives on the (k − 1)-simplex — the set of probability vectors (p₁, …, p_k) with each pᵢ ≥ 0 and Σ pᵢ = 1. For k = 3, the simplex is a triangle and the Dirichlet density colors the triangle. Behavior depends on the concentration parameter α₀ = Σ αᵢ: α = (1, 1, 1) gives the uniform; α = (5, 5, 5) concentrates near the centroid; α = (0.1, 0.1, 0.1) puts almost all mass near corners. Asymmetric α like (3, 1, 1) shifts the mode toward the first category.

Why is Dirichlet the conjugate prior for the multinomial?

Multinomial likelihood for counts (n₁, …, n_k) given probabilities (p₁, …, p_k) is L(p) ∝ ∏ᵢ pᵢ^(nᵢ). Multiply by the Dirichlet prior density ∏ᵢ pᵢ^(αᵢ − 1): the product is ∏ᵢ pᵢ^(αᵢ + nᵢ − 1), which is the kernel of a Dir(α + n) density. So prior and posterior share the same algebraic form, with parameters updated by simple addition. The 'effective sample size' of a Dir(α) prior is α₀ − k — that's how many fresh observations the prior is worth.

How is Dirichlet used in Latent Dirichlet Allocation (LDA)?

LDA models each document as a mixture of K topics. The prior on document-topic proportions θ_d is Dir(α) with α typically set to 50/K — small enough to encourage sparsity, large enough to avoid degenerate solutions. The prior on topic-word distributions β_k is Dir(η) with η ≈ 0.01 — favoring sparse vocabulary distributions where each topic uses a focused set of words. Inference (Gibbs sampling or variational Bayes) recovers θ and β from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form, enabling efficient MCMC.

What's the difference between Beta and Dirichlet?

Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] — equivalent to Dirichlet on k = 2 categories. Dirichlet(α₁, …, α_k) extends to k probabilities summing to 1. Both are conjugate priors for their respective likelihoods (Beta for Bernoulli/binomial, Dirichlet for categorical/multinomial). Both have the same pseudo-count intuition. The marginal of any single Dirichlet component pᵢ is a Beta(αᵢ, α₀ − αᵢ), so questions about one category at a time reduce to Beta inference.

What does the concentration parameter α₀ control?

α₀ = Σ αᵢ is the total concentration. It controls how spread out the Dirichlet is around its mean. Small α₀ (< 1 in each component) makes draws sparse — most pᵢ near 0, with a few near 1. Large α₀ makes draws concentrated around the mean αᵢ/α₀. The variance of pᵢ is αᵢ(α₀ − αᵢ)/(α₀²(α₀ + 1)) — so as α₀ → ∞ the variance vanishes and draws approach the mean deterministically. Hyperparameter tuning for LDA centers on α₀.

How do you sample from a Dirichlet?

The standard method uses independent Gamma draws. For each i, sample yᵢ ~ Gamma(αᵢ, 1). Then set pᵢ = yᵢ / (Σⱼ yⱼ). The resulting (p₁, …, p_k) is exactly Dir(α₁, …, α_k). This works because the Gamma family has the property that ratios of Gammas with the same rate parameter are Dirichlet — a beautiful consequence of the Gamma distribution being closed under the operation.

When does Dirichlet break down?

Dirichlet imposes a negative correlation between components — if pᵢ goes up, the others go down to maintain the sum-to-one constraint. This forces Cov(pᵢ, pⱼ) < 0 for i ≠ j. If your real-world probabilities have positive correlations, Dirichlet is wrong. Use the logistic normal (multivariate Gaussian on logit-transformed proportions) or Pólya tree distributions. Dirichlet also has only k parameters — it can't capture richer covariance structures.