Question 1

What does a Dirichlet distribution look like?

Accepted Answer

Dir(α₁, …, α_k) lives on the (k − 1)-simplex — the set of probability vectors (p₁, …, p_k) with each pᵢ ≥ 0 and Σ pᵢ = 1. For k = 3, the simplex is a triangle (the corners at (1,0,0), (0,1,0), (0,0,1)) and the Dirichlet density colors the triangle. Behavior depends on the concentration parameter α₀ = Σ αᵢ: α = (1, 1, 1) gives the uniform distribution on the triangle; α = (5, 5, 5) concentrates mass near the centroid (1/3, 1/3, 1/3) like a Gaussian bump; α = (0.1, 0.1, 0.1) puts almost all the mass near the three corners — draws are sparse, picking one category strongly. Asymmetric α like (3, 1, 1) shifts the mode toward the first category. The Dirichlet is the natural distribution for problems where the unknown is a probability vector.

Question 2

Why is Dirichlet the conjugate prior for the multinomial?

Accepted Answer

Multinomial likelihood for counts (n₁, …, n_k) given probabilities (p₁, …, p_k) is L(p) ∝ ∏ᵢ pᵢ^(nᵢ). Multiply by the Dirichlet prior density ∏ᵢ pᵢ^(αᵢ − 1): the product is ∏ᵢ pᵢ^(αᵢ + nᵢ − 1), which is the kernel of a Dir(α + n) density. So prior and posterior share the same algebraic form, with parameters updated by simple addition. The Bayesian inference algorithm for a categorical/multinomial model is: start with Dir(α); for each observed outcome i, increment αᵢ by 1. The 'effective sample size' of a Dir(α) prior is α₀ − k — that's how many fresh observations the prior is worth. This conjugacy is what makes Dirichlet ubiquitous in Bayesian text models, topic models, and language models with categorical mixture components.

Question 3

How is Dirichlet used in Latent Dirichlet Allocation (LDA)?

Accepted Answer

LDA models each document as a mixture of K topics, where each topic is a distribution over the vocabulary. The prior on document-topic proportions θ_d is Dir(α) with α typically set to 50/K — small enough to encourage sparsity (each document about a few topics), large enough to avoid degenerate solutions. The prior on topic-word distributions β_k is Dir(η) with η ≈ 0.01 — heavily favoring sparse vocabulary distributions where each topic uses a focused set of words. The generative model: for each word, sample a topic from θ_d, then sample a word from β_topic. Inference (Gibbs sampling or variational Bayes) recovers θ and β from observed word counts. The Dirichlet conjugacy makes the conditionals closed-form, enabling efficient MCMC. LDA's success in topic modeling (and its extensions like HDP, supervised LDA, dynamic topic models) put Dirichlet into the standard ML toolkit.

Question 4

What's the difference between Beta and Dirichlet?

Accepted Answer

Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] — equivalent to Dirichlet on k = 2 categories (with p₂ = 1 − p₁ implicit). Dirichlet(α₁, …, α_k) extends to k probabilities summing to 1. Both are conjugate priors for their respective likelihoods (Beta for Bernoulli/binomial, Dirichlet for categorical/multinomial). Both have the same pseudo-count intuition: αᵢ − 1 represents prior count for category i. The marginal of any single Dirichlet component pᵢ is a Beta(αᵢ, α₀ − αᵢ), so questions about one category at a time reduce to Beta inference. The aggregation property holds: if (p₁, p₂, p₃) ~ Dir(α₁, α₂, α₃), then (p₁ + p₂, p₃) ~ Dir(α₁ + α₂, α₃).

Question 5

What does the concentration parameter α₀ control?

Accepted Answer

α₀ = Σ αᵢ is the total concentration. It controls how spread out the Dirichlet is around its mean. Small α₀ (< 1 in each component) makes draws sparse — most pᵢ near 0, with a few near 1. Large α₀ makes draws concentrated around the mean αᵢ/α₀. The variance of pᵢ is αᵢ(α₀ − αᵢ)/(α₀²(α₀ + 1)) — so as α₀ → ∞ the variance vanishes and draws approach the mean deterministically. This makes α₀ behave like a 'temperature' parameter for sparsity. Hyperparameter tuning for LDA and topic models centers on α₀: a too-small α₀ gives degenerate documents; too-large gives mixed soup. The 50/K default for LDA comes from this trade-off empirically validated by Griffiths and Steyvers (2004).

Question 6

How do you sample from a Dirichlet?

Accepted Answer

The standard method uses independent Gamma draws. For each i, sample yᵢ ~ Gamma(αᵢ, 1) (Gamma with shape αᵢ, rate 1). Then set pᵢ = yᵢ / (Σⱼ yⱼ). The resulting (p₁, …, p_k) is exactly Dir(α₁, …, α_k). This works because the Gamma family has the property that ratios of Gammas with the same rate parameter are Dirichlet — a beautiful consequence of the Gamma distribution being closed under the operation. For α with all integer values, you can also use the stick-breaking construction: sample β₁ ~ Beta(α₁, α₂ + ... + α_k), then p₁ = β₁ and recurse on the remaining stick of length 1 − β₁. Both are O(k) — sampling Dirichlet is cheap and parallelizable.

Question 7

When does Dirichlet break down?

Accepted Answer

Dirichlet imposes a negative correlation between components — if pᵢ goes up, the others go down to maintain the sum-to-one constraint. This forces Cov(pᵢ, pⱼ) < 0 for i ≠ j. If your real-world probabilities have positive correlations (e.g., political-party preferences clustering by ideology), Dirichlet is wrong. Use the logistic normal (multivariate Gaussian on logit-transformed proportions) or Pólya tree distributions. Dirichlet also has only k parameters — it can't capture richer covariance structures. For high-dimensional problems where you need covariance between thousands of categories (image patches, gene expression), use generalized Dirichlet, nested Dirichlet, or the Dirichlet process (which allows infinite categories but assumes exchangeability).

α	α₀	Behavior	Use case
(1, 1, 1)	3	Uniform on triangle — every probability vector equally likely	Non-informative prior
(0.5, 0.5, 0.5)	1.5	Mass piles at corners — sparse draws, one category wins	Sparse topic models
(5, 5, 5)	15	Bell around centroid (1/3, 1/3, 1/3)	Concentrated near uniform
(50, 50, 50)	150	Sharp Gaussian-like peak at centroid	Strong prior near uniform
(5, 1, 1)	7	Asymmetric — mode at (4/4, 0/4, 0/4) = first category dominates	Strong prior on category 1
(0.1, 0.1, 0.1)	0.3	Extreme corner-piling — draws nearly always at one corner	Heavily sparse priors

Dirichlet Distribution

Watch the 60-second explainer

The simplex and the density

Three regimes — concentration shapes

Conjugacy with multinomial

Worked example — word frequencies

Sampling — Gamma trick

Latent Dirichlet Allocation (LDA)

Where Dirichlet shows up

Python — sampling and Bayesian update

Common pitfalls

History

Frequently asked questions