Probability

Beta Distribution

Q: Why is Beta the conjugate prior for the Bernoulli/binomial?

Conjugacy means the posterior stays in the same family as the prior — only the parameters change. Bernoulli likelihood for k successes in n trials given probability p is p^k (1−p)^(n−k), which has the same algebraic shape as the Beta density x^(α−1)(1−x)^(β−1). Multiply them: x^(α+k−1)(1−x)^(β+n−k−1) — a Beta(α + k, β + n − k) kernel. The normalization 1/B(α', β') updates trivially. So priors and posteriors live in the same two-parameter Beta family, and Bayes updates are simple addition of observed counts to α and β. This makes streaming Bayesian inference cheap: you carry two numbers, not a full posterior.

Q: What do α and β mean intuitively?

Treat them as pseudo-counts. α − 1 counts prior successes; β − 1 counts prior failures. Beta(1, 1) is uniform — zero pseudo-counts, no information. Beta(2, 2) means 'I've seen one prior success and one prior failure' — mean is 1/2, but with some uncertainty. Beta(50, 50) is sharp around 0.5 — like having seen 49 successes and 49 failures already, so 100 new observations only modestly shift your belief. The 'effective sample size' of a Beta prior is α + β − 2 — that's how many fresh data points the prior is worth. Choosing α + β small (e.g. Beta(2, 2)) gives a weakly informative prior; choosing α + β large bakes in strong domain knowledge.

Q: What's the difference between Beta and Dirichlet?

Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] (with 1 − p as the implicit second category). Dirichlet(α₁, …, αₖ) is a distribution over k proportions (p₁, …, pₖ) summing to 1 — the conjugate prior for categorical/multinomial likelihoods. Beta is Dirichlet with k = 2. Same conjugacy story: prior Dirichlet(α) + observed counts n becomes posterior Dirichlet(α + n). Both are members of the exponential family and share the same pseudo-count intuition.

Q: What's the Jeffreys prior for a proportion, and why use it?

The Jeffreys prior for a Bernoulli proportion is Beta(1/2, 1/2) — the arc-sine distribution, U-shaped, density spiking at 0 and 1. It is derived from the Fisher information and is invariant under reparameterization (a property the uniform Beta(1, 1) lacks). For inference, the Jeffreys posterior gives confidence intervals with better frequentist coverage when data are sparse than the uniform prior does. It is the standard 'objective' prior for binomial inference when no domain knowledge is available. For most engineering uses, Beta(1, 1) (uniform) is fine and easier to explain; Jeffreys matters when you need calibrated tail behavior.

Q: How is Beta used in Bayesian A/B testing?

Each variant gets its own Beta posterior over its true conversion rate. Start with Beta(1, 1) (or any weak prior) for each arm. As impressions accrue, update each posterior by adding successes to α and failures to β. To answer 'is variant B better than A?', compute P(p_B > p_A) by Monte Carlo: draw N samples from each posterior, count the fraction with p_B > p_A. Decision rules include 'stop when P(p_B > p_A) > 0.95', 'stop when expected loss < threshold', or Thompson sampling for adaptive allocation (sample p_A, p_B from posteriors; route the next user to whichever sample is larger). Beta priors give closed-form posteriors and cheap Monte Carlo — that's why Beta dominates A/B platforms.

Q: When is the Beta a bad choice of prior?

When your unknown is not a proportion in [0, 1] — Beta is a distribution on a bounded interval, so it cannot model unbounded quantities (use Gamma for rates, Normal for means). When you have multimodal beliefs about a proportion (e.g. you think the conversion rate is either 1% or 30% but not in between), Beta is unimodal and can't capture that — use a mixture. When you want to model correlated proportions (e.g. several treatments where success on one affects belief about another), use a hierarchical model with Beta hyperpriors, not independent Betas. And when α, β < 1 the density diverges at the endpoints — fine as a prior, but unintuitive when reporting credible intervals.

The conjugate prior for proportions — Bayesian inference's two-parameter workhorse

Beta(α, β) is a distribution on [0, 1] with density proportional to x^(α−1)(1−x)^(β−1). It is the conjugate prior for Bernoulli/binomial proportions: prior Beta(α, β) plus k successes in n trials updates to posterior Beta(α + k, β + n − k). Pseudo-counts α and β encode prior beliefs. Mean α/(α + β); uniform when α = β = 1. Bayesian A/B testing's standard tool.

Densityf(x) = x^(α−1)(1−x)^(β−1) / B(α, β) on [0, 1]
Mean / modeα/(α+β); mode (α−1)/(α+β−2) for α,β > 1
Conjugate toBernoulli, binomial, geometric, negative binomial
Update ruleBeta(α, β) + (k, n−k) → Beta(α + k, β + n − k)
Special casesBeta(1,1)=uniform; Beta(½,½)=Jeffreys arc-sine
Famous useBayesian A/B testing, Thompson sampling, CTR estimation

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the Beta distribution works

The Beta distribution lives on the unit interval [0, 1] and has two positive shape parameters, α and β. Its probability density function is

f(x; α, β) = x^(α−1) · (1 − x)^(β−1) / B(α, β),     0 ≤ x ≤ 1,

where B(α, β) = Γ(α)Γ(β)/Γ(α + β) is the beta function, ensuring the density integrates to 1. The first moment is α/(α + β); for α, β > 1 the mode sits at (α − 1)/(α + β − 2). The variance is αβ/((α + β)² (α + β + 1)) — note the (α + β + 1) in the denominator, so larger α + β (more pseudo-counts) tightens the distribution.

The intuition that makes the Beta useful: α − 1 is a count of prior successes; β − 1 is a count of prior failures. The shape parameters do not have to be integers, but treating them as pseudo-counts makes priors easy to interpret and elicit from domain experts.

Worked example — A/B testing a button colour

You run an A/B test on a sign-up button. Variant A (blue) has converted 12 of 100 visitors so far; Variant B (green) has converted 18 of 100. With a uniform Beta(1, 1) prior on each conversion rate, the posteriors after these observations are

p_A | data ~ Beta(1 + 12, 1 + 88) = Beta(13, 89)
p_B | data ~ Beta(1 + 18, 1 + 82) = Beta(19, 83)

Posterior means are 13/102 ≈ 0.127 and 19/102 ≈ 0.186. Posterior standard deviations (square root of αβ/((α+β)²(α+β+1))) are about 0.033 and 0.039. To answer "is B better than A?" draw, say, 10,000 samples from each posterior and count what fraction satisfy p_B > p_A — roughly 90% in this case. If you wanted a 95% decision threshold you would keep running the test.

Now observe 200 more visitors in each arm with 24 and 36 conversions respectively. The posteriors update by simple addition:

p_A | data ~ Beta(13 + 24, 89 + 176) = Beta(37, 265)
p_B | data ~ Beta(19 + 36, 83 + 164) = Beta(55, 247)

The means barely shift, but the standard deviations shrink to about 0.019 and 0.022. P(p_B > p_A) is now well above 0.99 — call the test, ship green.

Shape menagerie

The two parameters control the shape:

Beta(1, 1) — the uniform distribution on [0, 1]. Density is constant at 1. The maximum-entropy prior given no information about the support.
Beta(½, ½) — the arc-sine (Jeffreys) distribution. Density is U-shaped, diverging at 0 and 1. Used as the objective prior for binomial inference because it is invariant under reparameterization.
Beta(2, 2) — a gentle bell with mean 0.5 and standard deviation about 0.224. Equivalent to having seen one prior success and one prior failure.
Beta(5, 1) — right-skewed; mass concentrated near 1. Mean is 5/6 ≈ 0.833.
Beta(1, 5) — left-skewed mirror; mean 1/6 ≈ 0.167.
Beta(50, 50) — sharp peak at 0.5, standard deviation about 0.05. Strong prior centred on a fair coin.
Beta(α, β) with α or β < 1 — density spikes at one or both endpoints (mass piles up at extreme proportions).

Why conjugacy works — one line of algebra

The Bernoulli/binomial likelihood for observing k successes in n trials given probability p is, dropping a combinatorial constant,

L(p | k, n) ∝ p^k · (1 − p)^(n − k).

This has exactly the same algebraic form as the Beta density (with the shape exponents shifted by 1). Multiplying the Beta(α, β) prior by the likelihood and dropping normalisation,

prior × likelihood ∝ p^(α − 1) (1 − p)^(β − 1) · p^k (1 − p)^(n − k)
                   = p^(α + k − 1) (1 − p)^(β + n − k − 1).

This is the kernel of a Beta(α + k, β + n − k). Normalisation by 1/B(α + k, β + n − k) follows from the fact that the result is a density on [0, 1]. So Bayes' theorem produces another member of the Beta family with two-line bookkeeping — that is the entire conjugacy story.

Variants and relatives

Dirichlet distribution. The multivariate generalisation. Dirichlet(α₁, …, αₖ) is a distribution over k proportions summing to 1; conjugate to the multinomial/categorical. Beta = Dirichlet with k = 2.
Beta-binomial. The marginal of the binomial under a Beta(α, β) prior on p. Heavier-tailed than the binomial because p is uncertain. Used when individual trials are not exchangeable across populations.
Generalised Beta (Beta of the second kind). Maps to (0, ∞) via x/(1+x); used in income distribution modelling.
Logit-Normal. Distribution of σ(z) where z ~ Normal — not conjugate to the binomial but often easier to combine with regression structure (logistic regression posteriors).
Kumaraswamy distribution. A simpler closed-form alternative to Beta with similar shape flexibility; used when you need explicit CDFs and quantiles (Beta's CDF is the regularised incomplete beta function, not elementary).

JavaScript — Beta posterior and Bayesian A/B test

// Sample from Beta(α, β) via two Gammas: Beta = X / (X + Y),
// where X ~ Gamma(α), Y ~ Gamma(β). Use Marsaglia-Tsang for Gamma.
function gammaSample(alpha) {
  if (alpha < 1) return gammaSample(alpha + 1) * Math.random() ** (1 / alpha);
  const d = alpha - 1 / 3;
  const c = 1 / Math.sqrt(9 * d);
  while (true) {
    let x, v;
    do { x = randn(); v = 1 + c * x; } while (v <= 0);
    v = v ** 3;
    const u = Math.random();
    if (u < 1 - 0.0331 * x ** 4) return d * v;
    if (Math.log(u) < 0.5 * x * x + d * (1 - v + Math.log(v))) return d * v;
  }
}

function randn() {
  const u = 1 - Math.random(), v = Math.random();
  return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}

function betaSample(alpha, beta) {
  const x = gammaSample(alpha);
  return x / (x + gammaSample(beta));
}

// Posterior P(p_B > p_A) for two-arm Bayesian A/B test
function bayesianABTest(successA, totalA, successB, totalB, draws = 10000) {
  const aA = 1 + successA, bA = 1 + totalA - successA;
  const aB = 1 + successB, bB = 1 + totalB - successB;
  let bBeatsA = 0;
  for (let i = 0; i < draws; i++) {
    if (betaSample(aB, bB) > betaSample(aA, bA)) bBeatsA++;
  }
  return bBeatsA / draws;
}

// Example: A=12/100, B=18/100
console.log(bayesianABTest(12, 100, 18, 100));  // ≈ 0.90

Why the mean is α/(α + β)

One useful proof uses the recurrence Γ(α + 1) = α Γ(α):

E[X] = ∫₀¹ x · x^(α−1)(1−x)^(β−1) / B(α, β) dx
     = (1 / B(α, β)) · ∫₀¹ x^α (1−x)^(β−1) dx
     = B(α + 1, β) / B(α, β)
     = [Γ(α + 1) Γ(β) / Γ(α + β + 1)] · [Γ(α + β) / Γ(α) Γ(β)]
     = α / (α + β).

The same trick gives E[X²] = α(α + 1)/((α + β)(α + β + 1)) and hence the variance formula. The "pseudo-count" interpretation is now mechanical: mean = (prior successes + 1)/(total pseudo-counts + 2).

When the Beta is the right tool

Posterior over a single proportion. Click-through rate, conversion rate, defect rate, click-fraud rate — anywhere the unknown lives in [0, 1].
Bayesian A/B and multi-armed bandit testing. Each arm gets a Beta posterior; Thompson sampling routes traffic by drawing from the posteriors.
Reliability and survival of yes/no systems. Probability a component works for a given mission — pseudo-count priors encode engineering history.
Hierarchical models. Beta with hyperprior on (α, β) lets you partial-pool across many groups (multiple variants, multiple landing pages, multiple players).
Smoothing rare events. Laplace smoothing for n-gram language models is mathematically equivalent to a Beta(1, 1) prior over each conditional probability.
Order statistics. The k-th order statistic of n uniform samples is exactly Beta(k, n − k + 1) — useful for empirical-quantile uncertainty.

Common pitfalls

Treating Beta(0, 0) as "no prior". The shape exponents must be positive; Beta(0, 0) is an improper prior (mass piles at 0 and 1). Use Beta(1, 1) for uniform or Beta(½, ½) for Jeffreys instead.
Reporting the mean as the "answer". The Beta gives a full posterior — report a credible interval or P(p > threshold), not just the mean.
Confusing α with the success count. The posterior after k successes is Beta(α + k, …), not Beta(k, …) — the prior α matters at small n.
Equating posterior probability with frequentist p-value. P(p_B > p_A) = 0.95 is not a 5% Type-I error — Bayesian and frequentist decisions agree on data they share but mean different things.
Forgetting peeking is OK under Bayes. Sequential Bayesian A/B tests do not need a multiple-testing correction the way frequentist sequential tests do — but you do need a clear decision rule (e.g. expected loss < threshold) to avoid optional stopping bias.
Over-confident strong priors. A Beta(100, 100) prior is hard to budge — 100 fresh observations only shift the mean by a few percent. If the underlying rate has changed (concept drift), strong priors lag.

Applications across statistics and ML

Bayesian A/B testing and bandits

Modern A/B platforms (Google Optimize, Optimizely, Statsig) expose Bayesian posteriors as Beta distributions. Decision rules include "stop when P(B > A) > 0.95", "expected loss < ε", or full multi-armed bandit routing via Thompson sampling. The Beta makes all of these cheap in closed form.

Click-through rate estimation in ads

Each ad's CTR posterior is a Beta updated as impressions and clicks accumulate. The "Bayesian average rating" used by Amazon and IMDb (a global prior smooths small-sample averages) is exactly the posterior mean of a Beta with α + β set to a pseudo-count.

Reliability and quality control

For pass/fail testing of components, a Beta prior on the failure rate gives confidence bounds even with small samples. Beta(1, 1) plus zero failures in n tests yields a Beta(1, n + 1) posterior with mean 1/(n + 2) — far more useful than the maximum-likelihood estimate of 0.

Smoothing in language models

Add-one (Laplace) smoothing for unigram probabilities is equivalent to a Beta(1, 1) prior on each word's probability. Add-k smoothing is a Beta(k, k) prior. Modern subword/neural models still use these priors implicitly via label smoothing.

Beta regression

For regression with a response in (0, 1) — e.g. fractions, proportions, rates — Beta regression replaces the Normal likelihood with a Beta and links covariates to the mean via a logit. Better calibrated than Normal regression on rates near 0 or 1.

Frequently asked questions

Why is Beta the conjugate prior for the Bernoulli/binomial?

Conjugacy means the posterior stays in the same family as the prior — only the parameters change. Bernoulli likelihood for k successes in n trials given probability p is p^k (1−p)^(n−k), which has the same algebraic shape as the Beta density x^(α−1)(1−x)^(β−1). Multiply them: x^(α+k−1)(1−x)^(β+n−k−1) — a Beta(α + k, β + n − k) kernel. The normalization 1/B(α', β') updates trivially. So priors and posteriors live in the same two-parameter Beta family, and Bayes updates are simple addition of observed counts to α and β. This makes streaming Bayesian inference cheap: you carry two numbers, not a full posterior.

What do α and β mean intuitively?

Treat them as pseudo-counts. α − 1 counts prior successes; β − 1 counts prior failures. Beta(1, 1) is uniform — zero pseudo-counts, no information. Beta(2, 2) means "I've seen one prior success and one prior failure" — mean is 1/2, but with some uncertainty. Beta(50, 50) is sharp around 0.5 — like having seen 49 successes and 49 failures already, so 100 new observations only modestly shift your belief. The "effective sample size" of a Beta prior is α + β − 2 — that's how many fresh data points the prior is worth. Choosing α + β small (e.g. Beta(2, 2)) gives a weakly informative prior; choosing α + β large bakes in strong domain knowledge.

What's the difference between Beta and Dirichlet?

Dirichlet is the multivariate generalization. Beta(α, β) is a distribution over a single proportion p ∈ [0, 1] (with 1 − p as the implicit second category). Dirichlet(α₁, …, αₖ) is a distribution over k proportions (p₁, …, pₖ) summing to 1 — the conjugate prior for categorical/multinomial likelihoods. Beta is Dirichlet with k = 2. Same conjugacy story: prior Dirichlet(α) + observed counts n becomes posterior Dirichlet(α + n). Both are members of the exponential family and share the same pseudo-count intuition.

What's the Jeffreys prior for a proportion, and why use it?

The Jeffreys prior for a Bernoulli proportion is Beta(1/2, 1/2) — the arc-sine distribution, U-shaped, density spiking at 0 and 1. It is derived from the Fisher information and is invariant under reparameterization (a property the uniform Beta(1, 1) lacks). For inference, the Jeffreys posterior gives confidence intervals with better frequentist coverage when data are sparse than the uniform prior does. It is the standard "objective" prior for binomial inference when no domain knowledge is available. For most engineering uses, Beta(1, 1) (uniform) is fine and easier to explain; Jeffreys matters when you need calibrated tail behavior.

How is Beta used in Bayesian A/B testing?

Each variant gets its own Beta posterior over its true conversion rate. Start with Beta(1, 1) (or any weak prior) for each arm. As impressions accrue, update each posterior by adding successes to α and failures to β. To answer "is variant B better than A?", compute P(p_B > p_A) by Monte Carlo: draw N samples from each posterior, count the fraction with p_B > p_A. Decision rules include "stop when P(p_B > p_A) > 0.95", "stop when expected loss < threshold", or Thompson sampling for adaptive allocation (sample p_A, p_B from posteriors; route the next user to whichever sample is larger). Beta priors give closed-form posteriors and cheap Monte Carlo — that's why Beta dominates A/B platforms.

When is the Beta a bad choice of prior?

When your unknown is not a proportion in [0, 1] — Beta is a distribution on a bounded interval, so it cannot model unbounded quantities (use Gamma for rates, Normal for means). When you have multimodal beliefs about a proportion (e.g. you think the conversion rate is either 1% or 30% but not in between), Beta is unimodal and can't capture that — use a mixture. When you want to model correlated proportions (e.g. several treatments where success on one affects belief about another), use a hierarchical model with Beta hyperpriors, not independent Betas. And when α, β < 1 the density diverges at the endpoints — fine as a prior, but unintuitive when reporting credible intervals.

Beta vs Dirichlet vs Normal (and friends)

Picking the right conjugate prior — at a glance.

Distribution	Support	Conjugate to	Parameters	Posterior update	When to reach for it
Beta(α, β)	[0, 1]	Bernoulli, binomial	α, β > 0 (pseudo-counts)	(α, β) → (α + k, β + n − k)	One proportion: CTR, conversion rate
Dirichlet(α)	Simplex Δᵏ⁻¹	Categorical, multinomial	α₁, …, αₖ > 0	α_j → α_j + n_j	k proportions summing to 1
Normal-Gamma	ℝ × (0, ∞)	Normal with unknown μ, σ²	μ₀, λ, α, β	Standard formulas	Mean + precision of a continuous quantity
Gamma(α, β)	(0, ∞)	Poisson, exponential	α (shape), β (rate)	(α, β) → (α + Σx_i, β + n)	Counts/waiting-time rates
Beta-binomial	{0, 1, …, n}	Marginal of binomial with Beta prior	n, α, β	Conjugate posterior on α, β	Overdispersed binomial counts
Kumaraswamy(a, b)	[0, 1]	Not conjugate	a, b > 0	(MCMC required)	Need closed-form CDF/quantiles
Logit-Normal	(0, 1)	Not conjugate	μ, σ on logit scale	(MCMC / variational)	Combine with regression structure