Probability

Dirichlet Distribution

Multivariate Beta on the simplex — distribution over probability vectors

Dir(α₁, …, α_k) is a distribution over probability vectors summing to 1 — the conjugate prior for the multinomial. Powers topic models (LDA), Bayesian categorical inference.

  • Densityf(p) = ∏ᵢ pᵢ^(αᵢ−1) / B(α)
  • Support(k−1)-simplex: pᵢ ≥ 0, Σpᵢ = 1
  • Meanαᵢ / α₀ where α₀ = Σαⱼ
  • ConjugacyDir(α) + counts n → Dir(α + n)
  • LDA defaultsα = 50/K (documents), η ≈ 0.01 (topics)
  • First studiedDirichlet 1839, Blei-Ng-Jordan LDA 2003

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The simplex and the density

The (k − 1)-simplex Δ^(k−1) is the set of probability vectors (p₁, …, p_k) satisfying pᵢ ≥ 0 and Σ pᵢ = 1. It's the natural home for categorical probabilities: with k = 3 it's a triangle in the plane (sometimes drawn as a barycentric coordinate diagram); with k = 4 it's a tetrahedron; for general k it's a (k − 1)-dimensional surface in ℝᵏ.

The Dirichlet distribution Dir(α₁, …, α_k) puts a probability density on the simplex:

f(p₁, …, p_k; α₁, …, α_k) = (1 / B(α)) · ∏ᵢ pᵢ^(αᵢ − 1)

B(α) = ∏ Γ(αᵢ) / Γ(Σᵢ αᵢ)        (multivariate beta function)

The parameters αᵢ > 0 are called concentration parameters. Their sum α₀ = Σᵢ αᵢ is the total concentration. Properties:

  • Mean: E[pᵢ] = αᵢ / α₀
  • Variance: Var(pᵢ) = αᵢ(α₀ − αᵢ) / (α₀² (α₀ + 1))
  • Covariance: Cov(pᵢ, pⱼ) = −αᵢ αⱼ / (α₀² (α₀ + 1)) — always negative
  • Mode (when all αᵢ > 1): (αᵢ − 1) / (α₀ − k)

The negative covariance is structural: if one pᵢ increases, the others must decrease to keep the sum at 1. Dirichlet imposes a soft simplex constraint, not a free covariance structure.

Three regimes — concentration shapes

The behavior of Dir(α) on a 3-category simplex depends on α₀ and the relative αᵢ:

αα₀BehaviorUse case
(1, 1, 1)3Uniform on triangle — every probability vector equally likelyNon-informative prior
(0.5, 0.5, 0.5)1.5Mass piles at corners — sparse draws, one category winsSparse topic models
(5, 5, 5)15Bell around centroid (1/3, 1/3, 1/3)Concentrated near uniform
(50, 50, 50)150Sharp Gaussian-like peak at centroidStrong prior near uniform
(5, 1, 1)7Asymmetric — mode at (4/4, 0/4, 0/4) = first category dominatesStrong prior on category 1
(0.1, 0.1, 0.1)0.3Extreme corner-piling — draws nearly always at one cornerHeavily sparse priors

Small α₀ → sparse → corners. Large α₀ → dense → centroid. Asymmetric α → mode shifts toward the heaviest component.

Conjugacy with multinomial

The categorical/multinomial likelihood for observed counts (n₁, …, n_k) with total n = Σnᵢ given probabilities p is:

L(p | n) = (n! / ∏ nᵢ!) · ∏ pᵢ^(nᵢ)

Multiply by the Dirichlet prior:

f(p | α) · L(p | n) ∝ ∏ pᵢ^(αᵢ − 1) · ∏ pᵢ^(nᵢ)
                    = ∏ pᵢ^(αᵢ + nᵢ − 1)

posterior: p | n ~ Dir(α₁ + n₁, …, α_k + n_k)

The Bayes update is pure addition: prior + observed counts. This is what makes Dirichlet/multinomial inference computationally tractable, especially in topic models and language models with millions of words.

Worked example — word frequencies

Suppose you have a small corpus and want to model the relative frequency of three words: "the," "and," "of." Begin with a non-informative prior Dir(1, 1, 1).

Observe 1000 word tokens with counts (450, 320, 230). Posterior is Dir(451, 321, 231). Posterior mean is (451/1003, 321/1003, 231/1003) = (0.450, 0.320, 0.230) — essentially the empirical frequencies. With α₀ = 3 (weak prior), the posterior is dominated by data after 1000 observations.

Now compare with a strong prior Dir(100, 100, 100) — equivalent to having "pre-observed" 99 of each word. Posterior becomes Dir(550, 420, 330). Posterior mean (550/1300, 420/1300, 330/1300) = (0.423, 0.323, 0.254). The strong prior pulls the posterior partway toward uniform (1/3 each). After 1000 fresh observations, the prior still has measurable influence.

The "rule of α₀": prior worth = α₀ pseudo-observations. To dominate it, collect ≫ α₀ real data points.

Sampling — Gamma trick

The fastest way to sample from Dir(α₁, …, α_k):

1. Sample y_i ~ Gamma(α_i, 1) independently for i = 1, ..., k.
2. Return p_i = y_i / Σ_j y_j.

The resulting p is exactly Dir(α). This works because: if Y₁ ~ Gamma(α₁, 1) and Y₂ ~ Gamma(α₂, 1) are independent, then Y₁ / (Y₁ + Y₂) ~ Beta(α₁, α₂) — and the multivariate generalization gives Dirichlet.

An alternative: stick-breaking. Sample β₁ ~ Beta(α₁, α₂ + … + α_k), then p₁ = β₁. For the next, β₂ ~ Beta(α₂, α₃ + … + α_k), and p₂ = (1 − β₁)·β₂. Continue. This is the construction underlying the Dirichlet process (the infinite-dimensional generalization).

Latent Dirichlet Allocation (LDA)

LDA is the most influential application of Dirichlet in machine learning. It models each document as a mixture of K topics, where each topic is itself a distribution over the vocabulary:

For each topic k = 1, ..., K:
  Draw β_k ~ Dir(η)          (η ≈ 0.01, word-topic distribution)

For each document d:
  Draw θ_d ~ Dir(α)          (α = 50/K, document-topic mixture)
  For each word position:
    Draw a topic z ~ Categorical(θ_d)
    Draw a word w ~ Categorical(β_z)

The α = 50/K default (Griffiths and Steyvers 2004) is small enough to encourage sparse document-topic mixtures (a typical document is "about" 2–5 topics, not all K). The η ≈ 0.01 default favors sparse topic-word distributions (each topic uses a focused vocabulary).

Inference recovers θ_d (the topic mixture for each document) and β_k (the word distribution for each topic) from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form: in Gibbs sampling, you sample each word's topic z given everything else by a closed-form Dirichlet-multinomial probability. This is what makes LDA inference scale to millions of documents.

Where Dirichlet shows up

  • Topic models (LDA, HDP, supervised LDA, dynamic topic models). Document-topic and topic-word distributions are Dirichlet.
  • Bayesian categorical inference. Standard prior for unknown class probabilities — replacing the maximum-likelihood estimate with regularized posterior means.
  • Bayesian language modeling. Dirichlet-smoothed n-grams, Pitman-Yor extensions for power-law vocabularies, hierarchical Pitman-Yor for character-level models.
  • Population genetics. Modeling allele frequencies across populations; the Wright-Fisher model has Dirichlet stationary distribution under certain assumptions.
  • Decision trees and random forests with categorical splits. Bayesian leaf probabilities use Dirichlet smoothing.
  • Reinforcement learning. Categorical Q-learning and posterior sampling (Thompson sampling for multi-armed bandits) use Dirichlet posteriors over action probabilities.
  • Computational biology. Motif discovery, transcription factor binding site modeling, hidden Markov models for protein structure all use Dirichlet priors on emission and transition probabilities.
  • Polya urn schemes. The classical Pólya urn (drawing balls and replacing them with extras of the same color) is exactly the predictive distribution of a Dirichlet-multinomial model.

Python — sampling and Bayesian update

import numpy as np

# Sample from Dir(α)
alpha = np.array([5.0, 5.0, 5.0])
sample = np.random.dirichlet(alpha)
print('one sample (sums to 1):', sample, sample.sum())

# Many samples — empirical mean should be α / α₀
samples = np.random.dirichlet(alpha, size=10000)
print('empirical mean:', samples.mean(axis=0))
print('theoretical mean:', alpha / alpha.sum())

# Bayesian update: prior Dir(α) + observed counts → posterior Dir(α + n)
prior = np.array([1.0, 1.0, 1.0])
counts = np.array([450, 320, 230])
posterior = prior + counts
print('posterior parameters:', posterior)
print('posterior mean:', posterior / posterior.sum())

# Manual sampling via Gamma trick
def dirichlet_sample(alpha):
    ys = np.random.gamma(alpha, 1.0)
    return ys / ys.sum()

# Verify
manual = np.array([dirichlet_sample(alpha) for _ in range(10000)])
print('manual sampler mean:', manual.mean(axis=0))

Common pitfalls

  • Confusing α with mean. The mean is αᵢ/α₀, not αᵢ. Different α with same proportions but different sums give wildly different concentration. Dir(1, 1, 1) and Dir(100, 100, 100) have the same mean but vastly different variance.
  • Negative covariance trap. Dirichlet enforces Cov(pᵢ, pⱼ) < 0. If your real probabilities have positive correlations (e.g., voter preferences clustering by ideology), Dirichlet is the wrong family — use logistic normal.
  • Wrong concentration in LDA. Setting α too large gives uniform document-topic mixtures (every document touches every topic — uninformative). Setting α too small causes Gibbs sampler instability. Defaults α = 50/K, η = 0.01 are well-tested baselines.
  • Ignoring the multivariate beta function. Forgetting the 1/B(α) normalization gives a non-density. Many implementations expose log-density to avoid underflow in high k.
  • Adding α and counts of different scales. If your counts are millions and α is order 1, the prior is irrelevant. To get prior influence proportional to data, scale α up to match expected sample size.
  • Using uniform Dir(1, ..., 1) for k = 2. This is Beta(1, 1), the uniform on [0, 1] — fine, but Jeffreys prior Beta(1/2, 1/2) often gives better calibration. The analog for k > 2 is Dir(1/2, ..., 1/2), the symmetric Jeffreys prior.

History

Peter Gustav Lejeune Dirichlet studied the integral that bears his name in 1839 — what later became the Dirichlet distribution's normalizing constant. The distribution as a probability law was developed by Sir Ronald Fisher and others in the 1930s for population genetics (allele-frequency models). Norman L. Johnson's textbook on continuous distributions canonized the modern parameterization. The Dirichlet's central role in machine learning began with Pearl's belief networks and Gelfand-Smith's Gibbs sampling in the late 1980s, then exploded with Blei, Ng, and Jordan's 2003 paper introducing Latent Dirichlet Allocation. The Dirichlet process (an infinite-dimensional generalization developed by Ferguson 1973 and popularized by Antoniak 1974) is the foundation of Bayesian nonparametrics — letting K grow with the data.

Frequently asked questions

What does a Dirichlet distribution look like?

Dir(α₁, …, α_k) lives on the (k − 1)-simplex — the set of probability vectors (p₁, …, p_k) with each pᵢ ≥ 0 and Σ pᵢ = 1. For k = 3, the simplex is a triangle and the Dirichlet density colors the triangle. Behavior depends on the concentration parameter α₀ = Σ αᵢ: α = (1, 1, 1) gives the uniform; α = (5, 5, 5) concentrates near the centroid; α = (0.1, 0.1, 0.1) puts almost all mass near corners. Asymmetric α like (3, 1, 1) shifts the mode toward the first category.

Why is Dirichlet the conjugate prior for the multinomial?

Multinomial likelihood for counts (n₁, …, n_k) given probabilities (p₁, …, p_k) is L(p) ∝ ∏ᵢ pᵢ^(nᵢ). Multiply by the Dirichlet prior density ∏ᵢ pᵢ^(αᵢ − 1): the product is ∏ᵢ pᵢ^(αᵢ + nᵢ − 1), which is the kernel of a Dir(α + n) density. So prior and posterior share the same algebraic form, with parameters updated by simple addition. The 'effective sample size' of a Dir(α) prior is α₀ − k — that's how many fresh observations the prior is worth.

How is Dirichlet used in Latent Dirichlet Allocation (LDA)?

LDA models each document as a mixture of K topics. The prior on document-topic proportions θ_d is Dir(α) with α typically set to 50/K — small enough to encourage sparsity, large enough to avoid degenerate solutions. The prior on topic-word distributions β_k is Dir(η) with η ≈ 0.01 — favoring sparse vocabulary distributions where each topic uses a focused set of words. Inference (Gibbs sampling or variational Bayes) recovers θ and β from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form, enabling efficient MCMC.

What's the difference between Beta and Dirichlet?

Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] — equivalent to Dirichlet on k = 2 categories. Dirichlet(α₁, …, α_k) extends to k probabilities summing to 1. Both are conjugate priors for their respective likelihoods (Beta for Bernoulli/binomial, Dirichlet for categorical/multinomial). Both have the same pseudo-count intuition. The marginal of any single Dirichlet component pᵢ is a Beta(αᵢ, α₀ − αᵢ), so questions about one category at a time reduce to Beta inference.

What does the concentration parameter α₀ control?

α₀ = Σ αᵢ is the total concentration. It controls how spread out the Dirichlet is around its mean. Small α₀ (< 1 in each component) makes draws sparse — most pᵢ near 0, with a few near 1. Large α₀ makes draws concentrated around the mean αᵢ/α₀. The variance of pᵢ is αᵢ(α₀ − αᵢ)/(α₀²(α₀ + 1)) — so as α₀ → ∞ the variance vanishes and draws approach the mean deterministically. Hyperparameter tuning for LDA centers on α₀.

How do you sample from a Dirichlet?

The standard method uses independent Gamma draws. For each i, sample yᵢ ~ Gamma(αᵢ, 1). Then set pᵢ = yᵢ / (Σⱼ yⱼ). The resulting (p₁, …, p_k) is exactly Dir(α₁, …, α_k). This works because the Gamma family has the property that ratios of Gammas with the same rate parameter are Dirichlet — a beautiful consequence of the Gamma distribution being closed under the operation.

When does Dirichlet break down?

Dirichlet imposes a negative correlation between components — if pᵢ goes up, the others go down to maintain the sum-to-one constraint. This forces Cov(pᵢ, pⱼ) < 0 for i ≠ j. If your real-world probabilities have positive correlations, Dirichlet is wrong. Use the logistic normal (multivariate Gaussian on logit-transformed proportions) or Pólya tree distributions. Dirichlet also has only k parameters — it can't capture richer covariance structures.