Probability
Bayes' Theorem
Update beliefs from new evidence — P(H|E) = P(E|H)·P(H) / P(E)
Bayes' theorem tells you how to update the probability of a hypothesis given new evidence. P(H|E) = P(E|H) × P(H) / P(E). It's the foundation of Bayesian statistics, medical testing accuracy, spam filters, machine learning, and rational reasoning under uncertainty. The famous result — a positive medical test for a rare disease usually doesn't mean you have it.
- FormulaP(H|E) = P(E|H) × P(H) / P(E)
- P(H)Prior — initial belief
- P(E|H)Likelihood — how well the hypothesis explains the evidence
- P(H|E)Posterior — updated belief after seeing evidence
- Discovered byThomas Bayes (1701-1761; published posthumously 1763)
- Famous applicationMedical testing — high false-positive rate for rare conditions
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The formula
Bayes' theorem relates two conditional probabilities:
P(H | E) = P(E | H) × P(H) / P(E)
where:
- P(H) — the prior probability of hypothesis H before seeing evidence E.
- P(E | H) — the likelihood — probability of seeing evidence E if H is true.
- P(H | E) — the posterior probability of H after seeing E.
- P(E) — the marginal probability of seeing E (averaged over all hypotheses).
The formula tells you how to update your belief P(H) into P(H | E) when you observe evidence.
Derivation
From the definition of conditional probability:
P(H | E) = P(H ∩ E) / P(E)
P(E | H) = P(H ∩ E) / P(H)
Solve the second for P(H ∩ E):
P(H ∩ E) = P(E | H) · P(H)
Substitute into the first:
P(H | E) = P(E | H) · P(H) / P(E)
That's it — two lines. The theorem isn't deep; the implications are.
The classic medical-testing example
A disease affects 0.1% of the population. A test for it has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). Someone tests positive. What's the probability they have the disease?
Set up:
- P(disease) = 0.001 (prior)
- P(positive | disease) = 0.99 (likelihood)
- P(positive | no disease) = 0.01 (false positive rate)
P(positive) — by total probability:
P(positive) = P(positive | disease) · P(disease) + P(positive | no disease) · P(no disease)
= 0.99 · 0.001 + 0.01 · 0.999
= 0.00099 + 0.00999
= 0.01098
Apply Bayes:
P(disease | positive) = P(positive | disease) · P(disease) / P(positive)
= 0.99 · 0.001 / 0.01098
≈ 0.0902 = 9%
A positive test for this rare disease means only ~9% chance of actually having it. Most positive tests are false positives because the disease is rare. This is wildly counterintuitive — the test is "99% accurate" but the result is misleading. Doctors and patients alike systematically overestimate the meaning of positive tests for rare conditions.
More worked examples
Example 1 — drug testing
A drug test detects 99% of users. False positive rate is 5%. Among the population, 0.5% are users. Someone tests positive. P(user | positive)?
P(positive) = 0.99 · 0.005 + 0.05 · 0.995 = 0.00495 + 0.04975 = 0.0547
P(user | positive) = (0.99 · 0.005) / 0.0547 ≈ 0.0905 = 9%
Less than 10% of positive tests are actual users. Without knowing the false-positive rate and the base rate, the "99% accuracy" sounds reassuring; with Bayes, it's clearly inadequate.
Example 2 — email spam
20% of emails are spam. The word "free" appears in 60% of spam, 5% of ham. An email contains "free." P(spam | "free")?
P("free") = 0.60 · 0.20 + 0.05 · 0.80 = 0.12 + 0.04 = 0.16
P(spam | "free") = (0.60 · 0.20) / 0.16 = 0.12 / 0.16 = 0.75
One spam-keyword bumps the posterior to 75%. With several spam keywords (combining via independent likelihood — naive Bayes), the posterior approaches 1. This is the basis of Bayesian spam filtering.
Example 3 — sequential updates
Update from prior to posterior with one piece of evidence. The posterior becomes the new prior for the next piece. This is "Bayesian updating" — beliefs evolve with each new datum:
Prior_0 → after E_1 → Posterior_1 = Prior_1
→ after E_2 → Posterior_2 = Prior_2
→ ...
The order doesn't matter (assuming independent evidence) — the final posterior is the same. This is "Bayesian rationality" — updating beliefs in proportion to evidence is mathematically forced.
JavaScript — Bayesian updating
function bayes(prior, likelihoodH, likelihoodNotH) {
// Returns posterior P(H | E)
const evidence = likelihoodH * prior + likelihoodNotH * (1 - prior);
return (likelihoodH * prior) / evidence;
}
// Medical test example
const posterior = bayes(0.001, 0.99, 0.01);
console.log(posterior); // 0.0902 — about 9%
// Sequential updating — multiple positive tests
let p = 0.001;
for (let i = 0; i < 3; i++) {
p = bayes(p, 0.99, 0.01);
console.log(`After test ${i+1}: ${(p*100).toFixed(2)}%`);
}
// After test 1: 9.02%
// After test 2: 90.74%
// After test 3: 99.89%
// Three positive tests in a row pushes the posterior near 100%, regardless
// of the rare prior — independent confirmations dominate.
Bayesian vs frequentist worldviews
| Bayesian | Frequentist | |
|---|---|---|
| What is probability? | Degree of belief | Long-run frequency of repeated events |
| Parameters are | Random variables with distributions | Fixed unknown constants |
| Prior | Required (encode prior beliefs) | Doesn't exist (no prior beliefs in the framework) |
| Output | Posterior distribution over parameters | Point estimates + confidence intervals |
| Used in | ML probabilistic models, decision theory, philosophy | Hypothesis testing, regression, classical statistics |
| Computational cost | High (often MCMC) | Lower (closed-form or asymptotic) |
| Subjectivity | Explicit via the prior | Hidden in test choices and significance levels |
Most modern statisticians use both depending on the problem. Bayesian methods dominate in ML; frequentist methods dominate in classical hypothesis testing. The philosophical debate has cooled; the practical use case decides which to apply.
Where Bayes' theorem appears
- Medical diagnosis. Combining test results, symptoms, and disease prevalence to estimate the probability of a condition. Doctors who reason Bayesian-ly avoid the "false positive trap" for rare diseases.
- Machine learning — Bayesian networks, Bayesian inference. Probabilistic graphical models, Bayesian neural networks, Gaussian processes. Bayesian uncertainty quantification — knowing what the model doesn't know.
- Spam filtering. Naive Bayes was the dominant spam filter for a decade. Still effective for many text classification tasks.
- Search and recommendation. Probabilistic ranking, A/B testing with Thompson sampling (a Bayesian bandit method), personalization.
- Forensics. DNA matching, fingerprint matching — interpreting evidence requires P(innocent | match) which uses Bayes.
- Cryptography and signal detection. Bayesian decision rules optimize detection thresholds when there's a known prior over signal types.
- Rational decision-making. "What should I believe given this evidence?" Bayes is the answer when you can quantify priors and likelihoods.
Common mistakes
- Confusing P(A|B) with P(B|A). The prosecutor's fallacy. P(positive test | healthy) is the false positive rate of the test. P(healthy | positive test) is what you actually care about. They're different; Bayes connects them.
- Ignoring the base rate. A "99% accurate" test sounds great until you account for prevalence. For rare conditions (low P(H)), even very accurate tests produce mostly false positives.
- Improper prior selection. A flat / uniform prior is rarely "uninformative" in the way it claims. Logarithmic, Jeffrey's, and reference priors all encode different no-information assumptions, with different posteriors.
- Naive Bayes assuming independence too strongly. Spam filters assume words are independent given class — false. Empirically they still work; theoretically they're suboptimal. Modern alternatives relax this.
- Forgetting that P(E) is just normalization. Computing the posterior up to proportionality (P(H|E) ∝ P(E|H)·P(H)) and normalizing at the end is often easier than computing P(E) directly.
- Sequential updating with non-independent evidence. Two correlated tests (both based on the same biomarker) don't update Bayes-independently. Treating them as independent overcounts the evidence and overconfident posterior results.
Frequently asked questions
Why does a positive medical test often not mean you have the disease?
Base rate fallacy. If a disease affects 1 in 1000, even a 99% accurate test produces mostly false positives. Among 1000 people — 1 has the disease (true positive). Among 999 without — 9.99 false positives. So the test flags 11 people; only 1 has the disease. P(disease | positive) ≈ 1/11 = 9%, not 99%. Bayes' theorem makes this explicit; intuition routinely fails.
What's the difference between P(A|B) and P(B|A)?
They're different probabilities about different events. P(A|B) = probability of A given B happened. P(B|A) = probability of B given A happened. Confusing them is the "prosecutor's fallacy" — P(evidence | innocent) is what matters in court, not P(innocent | evidence). Bayes' theorem connects them via the prior.
What's a prior and how do I choose it?
The prior P(H) encodes your belief before seeing evidence. Choosing it is part art, part rigor. Options — uniform (no preference), informative (based on past data or expert knowledge), conjugate (mathematically convenient — produces posteriors in the same family). For most ML applications, the choice matters less as data grows; for small data, it dominates the posterior.
What's the difference between Bayesian and frequentist statistics?
Frequentist treats probabilities as long-run frequencies of repeated events. Bayesian treats probabilities as degrees of belief that update with evidence. Hypothesis testing is mostly frequentist (p-values). ML methods like Bayesian networks and Bayesian inference are explicitly Bayesian. Different philosophies, often complementary; both useful.
How do spam filters use Bayes?
Naive Bayes — for each word in an email, compute P(word | spam) and P(word | not-spam) from training data. Multiply (assuming word independence) to get P(email | spam) and P(email | not-spam). Apply Bayes to combine with priors P(spam) and get the posterior P(spam | email). If posterior > threshold, mark as spam. Dominant spam-filtering technique 2002-2010, partly displaced by deep learning since.
What's the prosecutor's fallacy?
Confusing P(evidence | innocent) with P(innocent | evidence). A DNA match might have P(match | innocent) = 1 in a million — sounds damning. But that's not P(innocent | match). If millions of people are screened, the posterior depends on how the suspect was identified. The famous Sally Clark case (UK, 1999) wrongfully convicted a mother of murdering her babies based on this exact error.
Can Bayes' theorem be derived?
Yes, in two lines from the definition of conditional probability. P(A|B) = P(A∩B)/P(B) and P(B|A) = P(A∩B)/P(A). Solve both for P(A∩B), set them equal — P(A|B)·P(B) = P(B|A)·P(A) — divide by P(B) — Bayes' theorem. The whole machinery follows from one definition.