Probability
Multinomial Distribution
Binomial generalized to k categories — joint count distribution
Multinomial counts outcomes in n trials with k categories. PMF n!/∏nᵢ! · ∏pᵢ^nᵢ. Generalizes the binomial. Powers NLP, polling, dice, language models.
- PMFP(n₁,…,n_k) = n!/∏nᵢ! · ∏pᵢ^nᵢ
- MeanE[nᵢ] = npᵢ
- VarianceVar(nᵢ) = npᵢ(1 − pᵢ)
- CovarianceCov(nᵢ, nⱼ) = −npᵢpⱼ for i ≠ j
- Marginalnᵢ ~ Binomial(n, pᵢ)
- Conjugate priorDirichlet on the simplex
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
From binomial to multinomial
The binomial distribution counts successes in n trials with two outcomes (success, failure) with probabilities (p, 1 − p). The multinomial generalizes this to k outcomes per trial.
You have n independent trials. Each trial has k possible categories with probabilities p₁, …, p_k summing to 1. Let nᵢ be the count of times category i occurs. Then (n₁, …, n_k) is distributed as Multinomial(n, p₁, …, p_k):
P(n_1, n_2, ..., n_k) = (n! / (n_1! · n_2! · ... · n_k!)) · p_1^n_1 · p_2^n_2 · ... · p_k^n_k
The constraint is Σ nᵢ = n. When k = 2 this is exactly the binomial.
The factor n!/(n₁! n₂! … n_k!) is the multinomial coefficient — the number of distinct sequences of n trials producing exactly the count tally (n₁, …, n_k). The factor ∏ pᵢ^nᵢ is the probability of any single specific such sequence.
Worked example — rolling a die 20 times
Roll a fair die n = 20 times with k = 6 categories, each with probability 1/6. What is the probability of seeing the outcome (3, 4, 5, 2, 3, 3)?
The counts sum to 20 — valid. The multinomial coefficient:
20! / (3! · 4! · 5! · 2! · 3! · 3!) = 2,432,902,008,176,640,000 / (6 · 24 · 120 · 2 · 6 · 6)
= 2,432,902,008,176,640,000 / 1,244,160
= 1,955,457,504,000
The probability per sequence: (1/6)^20 = 1/3.66 × 10¹⁵. Multiply:
P(3, 4, 5, 2, 3, 3) = 1.96 × 10¹² / 3.66 × 10¹⁵ ≈ 5.35 × 10⁻⁴
About 0.0535% — a specific count tally is quite unlikely. The most probable outcome under a fair die is balanced counts near (3.33, 3.33, …); the actual mode is hard to write because n = 20 doesn't divide evenly by 6.
Moments — mean, variance, covariance
Each count nᵢ marginally follows a Binomial(n, pᵢ), so:
E[n_i] = n · p_i
Var(n_i) = n · p_i · (1 - p_i)
The covariance between distinct counts is negative:
Cov(n_i, n_j) = -n · p_i · p_j for i ≠ j
Negative because counts must sum to n: if one category over-represents, the others must under-represent. The correlation coefficient is:
Corr(n_i, n_j) = -sqrt(p_i p_j / ((1 - p_i)(1 - p_j)))
This negative correlation is what distinguishes multinomial from k independent binomials: if you assumed independence, you'd overestimate joint variability and miscalibrate joint confidence intervals.
Multinomial vs related distributions
| Distribution | k | n | Use case | Notes |
|---|---|---|---|---|
| Bernoulli(p) | 2 | 1 | Single coin flip | Marginal of binomial |
| Binomial(n, p) | 2 | n trials | Coin flips | Sum of n Bernoulli |
| Categorical(p₁,…,p_k) | k | 1 | Single die roll, single word | Multinomial with n = 1 |
| Multinomial(n, p) | k | n trials | Die rolls, bag-of-words | Generalization of binomial |
| Hypergeometric | 2 | n without replacement | Card hands | Like binomial but no replacement |
| Negative multinomial | k | Random | Trials until r-th success | Generalizes negative binomial |
| Dirichlet(α) | k | — | Conjugate prior on (p₁,…,p_k) | Distribution over p, not n |
The multinomial sits at the center: binomial is the k = 2 case; categorical is the n = 1 case; Dirichlet is the conjugate prior.
Dirichlet-multinomial conjugacy
The multinomial likelihood for observed counts (n₁, …, n_k) given probabilities p is ∝ ∏ pᵢ^nᵢ. Multiply by a Dirichlet(α) prior with density ∝ ∏ pᵢ^(αᵢ − 1):
posterior ∝ ∏ p_i^(α_i - 1) · ∏ p_i^n_i = ∏ p_i^(α_i + n_i - 1)
→ Dir(α₁ + n₁, ..., α_k + n_k)
The posterior is a Dirichlet with parameters incremented by the observed counts. This is the Bayesian update rule for categorical probabilities — used in topic models (LDA), Naive Bayes text classification, and language model smoothing.
The posterior predictive distribution (integrating out p) is the Dirichlet-multinomial distribution:
P(new counts m | observed n, α) = (m! / ∏ m_i!) · B(α + n + m) / B(α + n)
where B is the multivariate beta function. This handles the uncertainty over p analytically — no Monte Carlo required.
Where multinomial appears
- Dice and gambling. Outcomes of k-sided dice, casino games with k payouts, lottery analysis.
- Polling and surveys. Counts of respondents per option in k-way categorical questions; standard errors and confidence ellipsoids from the multinomial covariance.
- NLP — bag-of-words. Each document is a multinomial draw over the vocabulary. Foundation of Naive Bayes text classification, topic models, and ngram language models.
- Language model sampling. The softmax probability vector at each token position is a categorical (n = 1) over the vocabulary; sampling is a multinomial draw.
- Genetics. Allele frequencies in a population, especially under Hardy-Weinberg equilibrium with k = 3 genotypes.
- Contingency tables. The cell counts of a k × m contingency table follow a multinomial; Pearson's chi-squared test is the goodness-of-fit test for the multinomial.
- Color quantization. Image histograms across k color bins are multinomial; clustering algorithms exploit this.
- Physics — photon counting. Photons arriving in k spectral or angular bins follow multinomial (or Poisson, in the unrestricted-total case).
Python — sampling and PMF
import numpy as np
from scipy.stats import multinomial
import math
# PMF: probability of a specific count vector
n = 20
p = [1/6] * 6
counts = [3, 4, 5, 2, 3, 3]
prob = multinomial.pmf(counts, n=n, p=p)
print(f'P({counts}) = {prob:.6f}') # ≈ 0.000535
# Verify via formula
def multinomial_pmf(counts, p):
n = sum(counts)
coef = math.factorial(n)
for c in counts:
coef //= math.factorial(c)
prod = 1.0
for c, pi in zip(counts, p):
prod *= pi ** c
return coef * prod
print('manual:', multinomial_pmf(counts, p))
# Sampling
samples = multinomial.rvs(n=n, p=p, size=10000)
print('shape:', samples.shape) # (10000, 6)
print('mean of n_1:', samples[:, 0].mean(), ' theoretical:', n / 6)
print('var of n_1:', samples[:, 0].var(), ' theoretical:', n * (1/6) * (5/6))
print('cov(n_1, n_2):', np.cov(samples[:, 0], samples[:, 1])[0, 1],
' theoretical:', -n * (1/6) * (1/6))
# Bayesian update: Dirichlet(1, ..., 1) + observed counts → Dirichlet(1 + n_i)
alpha_prior = np.ones(6)
alpha_post = alpha_prior + np.array(counts)
print('Posterior mean for each face:', alpha_post / alpha_post.sum())
Common pitfalls
- Treating counts as independent. Multinomial counts are negatively correlated. If you compute Var(nᵢ + nⱼ) as Var(nᵢ) + Var(nⱼ), you ignore the covariance and overestimate variability.
- Forgetting the sum-to-n constraint. Multinomial outcomes (n₁, …, n_k) live on the discrete simplex Σ nᵢ = n. Treating any single nᵢ as varying freely (e.g., in regression) requires acknowledging this constraint.
- Wrong PMF normalization. Forgetting the multinomial coefficient gives a non-probability. Compute log-PMF using log-gamma to avoid factorial overflow for large n.
- Zero counts and zero probabilities. If pᵢ = 0 and nᵢ > 0, the probability is 0 — a strict impossibility. Use Dirichlet smoothing (additive α) to avoid this in language models.
- Confusing categorical with multinomial. Categorical(p) = Multinomial(n = 1, p). When people say "multinomial logistic regression" they often mean "categorical regression" — predicting one outcome per trial, not the joint count of many trials.
- Multinomial sampling vs Bernoulli sampling per category. If you sample k independent Bernoulli trials with success probabilities (p₁, …, p_k) summing to less than 1, that's not a multinomial — it's a separate construction. Multinomial samples a single category per trial with probabilities summing to 1.
History
The multinomial distribution was implicitly used by Jacob Bernoulli (Ars Conjectandi, 1713) for problems with multiple categorical outcomes; Abraham de Moivre and Pierre-Simon Laplace formalized the multinomial coefficient in the 18th century. The connection to the binomial was made explicit in early probability textbooks (Todhunter 1865, von Mises 1928). The Dirichlet-multinomial Bayesian conjugacy emerged from work on categorical inference by I. J. Good (Good-Turing smoothing, 1953) and Lindley (Bayesian estimation, 1964). The multinomial's centrality in language modeling traces to Markov's analysis of Pushkin's "Eugene Onegin" (1913) for vowel-consonant transitions, then exploded with Shannon's information theory (1948) and modern NLP (Manning and Schütze, 1999). The Dirichlet-multinomial smoothing approach is now baseline in additive smoothing for n-gram models.
Frequently asked questions
How is multinomial different from binomial?
The binomial counts successes in n Bernoulli trials with two outcomes. The multinomial generalizes to k outcomes per trial. For k = 2, multinomial reduces exactly to binomial: P(n₁, n − n₁) = (n!/(n₁!(n − n₁)!)) · p₁^n₁(1 − p₁)^(n − n₁). For k > 2, you track all category counts jointly. Example: rolling a die 20 times produces 6 counts (n₁, …, n₆) that sum to 20 — that's Multinomial(20, (1/6, …, 1/6)).
What's the multinomial coefficient?
The multinomial coefficient is n! / (n₁! n₂! … n_k!), often written as (n choose n₁, n₂, …, n_k). It counts the number of distinct ways to arrange n objects where there are n₁ of type 1, n₂ of type 2, and so on. For example, the word 'MISSISSIPPI' has 11! / (1! 4! 4! 2!) = 34,650 distinct arrangements (1 M, 4 I, 4 S, 2 P). When k = 2 this reduces to the binomial coefficient C(n, n₁) = n! / (n₁! (n − n₁)!).
Why is the covariance between counts negative?
Multinomial counts must sum to n, so they cannot all grow independently. If category 1 has unusually many trials, the others must have correspondingly fewer. Specifically Cov(nᵢ, nⱼ) = −npᵢpⱼ for i ≠ j — always negative, scaling with the product of probabilities. The marginal variance Var(nᵢ) = npᵢ(1 − pᵢ) matches the binomial. The negative correlation is what distinguishes multinomial from independent binomials.
What's the marginal distribution of one count?
Each individual count nᵢ marginally follows a binomial: nᵢ ~ Binomial(n, pᵢ). This is because category i either occurs (with probability pᵢ) or doesn't (with probability 1 − pᵢ) on each trial — so the count in category i is a sum of n independent Bernoulli(pᵢ). The joint distribution is not independent binomials (cov is negative), but each marginal is binomial.
How is multinomial used in NLP?
In bag-of-words language models, a document of n tokens is modeled as a multinomial over the vocabulary V: each word position independently draws from a categorical distribution with probabilities (p_v). The joint count vector is Multinomial(n, p) over the vocabulary. The Dirichlet-multinomial likelihood, after integrating out p, gives the predictive distribution used in Bayesian topic models (LDA), language models, and smoothing techniques. In modern transformer language models, the softmax over the vocabulary at each token position is exactly the multinomial probability.
How is multinomial used in polling?
A political poll with k = 4 candidate options and n respondents produces counts (n₁, …, n_k) modeled as Multinomial(n, p) where p is the true population proportion. The margin of error on the proportion estimate p̂ᵢ = nᵢ/n is √(p̂ᵢ(1 − p̂ᵢ)/n) — exactly the binomial standard error for that marginal. For n = 1000, the margin of error at p = 0.5 is about 0.016 (the well-known ±1.6 percentage points). The negative correlation between candidate proportions explains why a third candidate's surge mechanically depresses the others.
How do you sample from a multinomial?
Three common methods. (1) Direct: for each of n trials, draw a uniform U ~ Unif(0, 1) and find the category i where p₁ + … + pᵢ ≥ U via binary search — O(n log k). (2) Sequential binomial: draw n₁ ~ Binomial(n, p₁), then n₂ ~ Binomial(n − n₁, p₂/(1 − p₁)), and so on — O(k) binomial draws. (3) Poisson trick: draw m independent Poisson counts mᵢ ~ Poisson(λᵢ) with λᵢ ∝ pᵢ, then condition on the total.