Question 1

How is multinomial different from binomial?

Accepted Answer

The binomial counts successes in n Bernoulli trials with two outcomes (success or failure). The multinomial generalizes to k outcomes per trial. For k = 2, multinomial reduces exactly to binomial: P(n₁, n − n₁) = (n!/(n₁!(n − n₁)!)) · p₁^n₁(1 − p₁)^(n − n₁), which is the binomial PMF. For k > 2, you track all category counts jointly. Example: rolling a die 20 times produces 6 counts (n₁, …, n₆) that sum to 20 — that's Multinomial(20, (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). The probability of a specific outcome like (3, 4, 5, 2, 3, 3) is computed via the multinomial coefficient times the product of category probabilities raised to their counts.

Question 2

What's the multinomial coefficient?

Accepted Answer

The multinomial coefficient is n! / (n₁! n₂! … n_k!), often written as a generalization of the binomial coefficient (n choose n₁, n₂, …, n_k). It counts the number of distinct ways to arrange n objects where there are n₁ of type 1, n₂ of type 2, and so on. For example, the word 'MISSISSIPPI' has 11! / (1! 4! 4! 2!) = 34,650 distinct arrangements (1 M, 4 I, 4 S, 2 P). When k = 2 this reduces to the binomial coefficient C(n, n₁) = n! / (n₁! (n − n₁)!). In the multinomial PMF, this coefficient counts the number of sequences of n trials producing exactly the count tally (n₁, …, n_k); multiplying by ∏ pᵢ^nᵢ gives the total probability.

Question 3

Why is the covariance between counts negative?

Accepted Answer

Multinomial counts must sum to n, so they cannot all grow independently. If category 1 has unusually many trials, the others must have correspondingly fewer. Specifically Cov(nᵢ, nⱼ) = −npᵢpⱼ for i ≠ j — always negative, scaling with the product of probabilities. The marginal variance Var(nᵢ) = npᵢ(1 − pᵢ) matches the binomial, since each nᵢ alone is Binomial(n, pᵢ). The negative correlation is what distinguishes multinomial from independent binomials: if you treat counts as independent, you overestimate variability. This matters for confidence intervals on multinomial proportions and for likelihood ratio tests in contingency tables.

Question 4

What's the marginal distribution of one count?

Accepted Answer

Each individual count nᵢ marginally follows a binomial: nᵢ ~ Binomial(n, pᵢ). This is because category i either occurs (with probability pᵢ) or doesn't (with probability 1 − pᵢ) on each trial — so the count in category i is a sum of n independent Bernoulli(pᵢ). The joint distribution is not independent binomials (cov is negative), but each marginal is binomial. This is useful in practice: if you only care about one specific outcome (e.g., probability of more than 30 'heads' in 100 flips of an unfair coin), you can ignore the rest of the multinomial and just use the binomial.

Question 5

How is multinomial used in NLP?

Accepted Answer

In bag-of-words language models, a document of n tokens is modeled as a multinomial over the vocabulary V: each word position independently draws from a categorical distribution with probabilities (p_v)_{v ∈ V}. The joint count vector is Multinomial(n, p) over the vocabulary. The Dirichlet-multinomial likelihood, after integrating out p, gives the predictive distribution used in Bayesian topic models (LDA), language models (Polya urn schemes), and smoothing techniques (additive smoothing = MAP estimate under Dirichlet prior). In modern transformer language models, the softmax over the vocabulary at each token position is exactly the multinomial probability — sampling the next token is sampling from a categorical (multinomial with n = 1). Beam search and top-p sampling are heuristics for selecting from this multinomial distribution.

Question 6

How is multinomial used in polling?

Accepted Answer

A political poll with k = 4 candidate options (or 'undecided') and n respondents produces counts (n₁, …, n_k) modeled as Multinomial(n, p) where p is the true population proportion. The margin of error on the proportion estimate p̂ᵢ = nᵢ/n is √(p̂ᵢ(1 − p̂ᵢ)/n) — exactly the binomial standard error for that marginal. For n = 1000, the margin of error at p = 0.5 is about 0.016 (the well-known ±1.6 percentage points). The negative correlation between candidate proportions explains why a third candidate's surge mechanically depresses the others; campaigns model these dynamics with multinomial likelihood functions. Bayesian polling aggregation uses Dirichlet priors on p, updating to Dirichlet posteriors as poll counts accumulate.

Question 7

How do you sample from a multinomial?

Accepted Answer

Three common methods. (1) Direct: for each of n trials, draw a uniform U ~ Unif(0, 1) and find the category i where p₁ + … + pᵢ ≥ U via binary search — O(n log k). (2) Sequential binomial: draw n₁ ~ Binomial(n, p₁), then n₂ ~ Binomial(n − n₁, p₂/(1 − p₁)), and so on — O(k) binomial draws. (3) Poisson trick: draw m independent Poisson counts mᵢ ~ Poisson(λᵢ) with λᵢ ∝ pᵢ, then condition on the total — useful when n is also random. Modern libraries (NumPy, SciPy, PyTorch) use the second or third method internally; the first is more numerically stable when k is moderate. For large n and small p, you can also use the Gaussian approximation: nᵢ ≈ npᵢ + √(npᵢ(1 − pᵢ)) · Zᵢ with correlated Z's.

Distribution	k	n	Use case	Notes
Bernoulli(p)	2	1	Single coin flip	Marginal of binomial
Binomial(n, p)	2	n trials	Coin flips	Sum of n Bernoulli
Categorical(p₁,…,p_k)	k	1	Single die roll, single word	Multinomial with n = 1
Multinomial(n, p)	k	n trials	Die rolls, bag-of-words	Generalization of binomial
Hypergeometric	2	n without replacement	Card hands	Like binomial but no replacement
Negative multinomial	k	Random	Trials until r-th success	Generalizes negative binomial
Dirichlet(α)	k	—	Conjugate prior on (p₁,…,p_k)	Distribution over p, not n

Multinomial Distribution

Watch the 60-second explainer

From binomial to multinomial

Worked example — rolling a die 20 times

Moments — mean, variance, covariance

Multinomial vs related distributions

Dirichlet-multinomial conjugacy

Where multinomial appears

Python — sampling and PMF

Common pitfalls

History

Frequently asked questions