Probability

Bayes' Theorem

Name: Bayes' Theorem — 60-second explainer
Uploaded: 2026-05-13T09:57:36Z
Duration: 1 min
Description: Bayes' theorem tells you how to update the probability of a hypothesis given new evidence. P(H|E) = P(E|H) × P(H) / P(E). It's the foundation of Bayesian statistics, medical testing accuracy, spam filters, machine learning, and rational reasoning under uncertainty. The famous result — a positive medical test for a rare disease usually doesn't mean you have it.

Update beliefs from new evidence — P(H|E) = P(E|H)·P(H) / P(E)

Bayes' theorem tells you how to update the probability of a hypothesis given new evidence. P(H|E) = P(E|H) × P(H) / P(E). It's the foundation of Bayesian statistics, medical testing accuracy, spam filters, machine learning, and rational reasoning under uncertainty. The famous result — a positive medical test for a rare disease usually doesn't mean you have it.

FormulaP(H|E) = P(E|H) × P(H) / P(E)
P(H)Prior — initial belief
P(E|H)Likelihood — how well the hypothesis explains the evidence
P(H|E)Posterior — updated belief after seeing evidence
Discovered byThomas Bayes (1701-1761; published posthumously 1763)
Famous applicationMedical testing — high false-positive rate for rare conditions

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Watch on YouTube

The formula

Bayes' theorem relates two conditional probabilities:

P(H | E) = P(E | H) × P(H) / P(E)

where:

P(H) — the prior probability of hypothesis H before seeing evidence E.
P(E | H) — the likelihood — probability of seeing evidence E if H is true.
P(H | E) — the posterior probability of H after seeing E.
P(E) — the marginal probability of seeing E (averaged over all hypotheses).

The formula tells you how to update your belief P(H) into P(H | E) when you observe evidence.

Derivation

From the definition of conditional probability:

P(H | E) = P(H ∩ E) / P(E)
P(E | H) = P(H ∩ E) / P(H)

Solve the second for P(H ∩ E):

P(H ∩ E) = P(E | H) · P(H)

Substitute into the first:

P(H | E) = P(E | H) · P(H) / P(E)

That's it — two lines. The theorem isn't deep; the implications are.

The classic medical-testing example

A disease affects 0.1% of the population. A test for it has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). Someone tests positive. What's the probability they have the disease?

Set up:

P(disease) = 0.001 (prior)
P(positive | disease) = 0.99 (likelihood)
P(positive | no disease) = 0.01 (false positive rate)

P(positive) — by total probability:

P(positive) = P(positive | disease) · P(disease) + P(positive | no disease) · P(no disease)
            = 0.99 · 0.001 + 0.01 · 0.999
            = 0.00099 + 0.00999
            = 0.01098

Apply Bayes:

P(disease | positive) = P(positive | disease) · P(disease) / P(positive)
                      = 0.99 · 0.001 / 0.01098
                      ≈ 0.0902 = 9%

A positive test for this rare disease means only ~9% chance of actually having it. Most positive tests are false positives because the disease is rare. This is wildly counterintuitive — the test is "99% accurate" but the result is misleading. Doctors and patients alike systematically overestimate the meaning of positive tests for rare conditions.

More worked examples

Example 1 — drug testing

A drug test detects 99% of users. False positive rate is 5%. Among the population, 0.5% are users. Someone tests positive. P(user | positive)?

P(positive) = 0.99 · 0.005 + 0.05 · 0.995 = 0.00495 + 0.04975 = 0.0547
P(user | positive) = (0.99 · 0.005) / 0.0547 ≈ 0.0905 = 9%

Less than 10% of positive tests are actual users. Without knowing the false-positive rate and the base rate, the "99% accuracy" sounds reassuring; with Bayes, it's clearly inadequate.

Example 2 — email spam

20% of emails are spam. The word "free" appears in 60% of spam, 5% of ham. An email contains "free." P(spam | "free")?

P("free") = 0.60 · 0.20 + 0.05 · 0.80 = 0.12 + 0.04 = 0.16
P(spam | "free") = (0.60 · 0.20) / 0.16 = 0.12 / 0.16 = 0.75

One spam-keyword bumps the posterior to 75%. With several spam keywords (combining via independent likelihood — naive Bayes), the posterior approaches 1. This is the basis of Bayesian spam filtering.

Example 3 — sequential updates

Update from prior to posterior with one piece of evidence. The posterior becomes the new prior for the next piece. This is "Bayesian updating" — beliefs evolve with each new datum:

Prior_0 → after E_1 → Posterior_1 = Prior_1
                  → after E_2 → Posterior_2 = Prior_2
                              → ...

The order doesn't matter (assuming independent evidence) — the final posterior is the same. This is "Bayesian rationality" — updating beliefs in proportion to evidence is mathematically forced.

JavaScript — Bayesian updating

function bayes(prior, likelihoodH, likelihoodNotH) {
  // Returns posterior P(H | E)
  const evidence = likelihoodH * prior + likelihoodNotH * (1 - prior);
  return (likelihoodH * prior) / evidence;
}

// Medical test example
const posterior = bayes(0.001, 0.99, 0.01);
console.log(posterior);  // 0.0902 — about 9%

// Sequential updating — multiple positive tests
let p = 0.001;
for (let i = 0; i < 3; i++) {
  p = bayes(p, 0.99, 0.01);
  console.log(`After test ${i+1}: ${(p*100).toFixed(2)}%`);
}
// After test 1: 9.02%
// After test 2: 90.74%
// After test 3: 99.89%

// Three positive tests in a row pushes the posterior near 100%, regardless
// of the rare prior — independent confirmations dominate.

Bayesian vs frequentist worldviews

	Bayesian	Frequentist
What is probability?	Degree of belief	Long-run frequency of repeated events
Parameters are	Random variables with distributions	Fixed unknown constants
Prior	Required (encode prior beliefs)	Doesn't exist (no prior beliefs in the framework)
Output	Posterior distribution over parameters	Point estimates + confidence intervals
Used in	ML probabilistic models, decision theory, philosophy	Hypothesis testing, regression, classical statistics
Computational cost	High (often MCMC)	Lower (closed-form or asymptotic)
Subjectivity	Explicit via the prior	Hidden in test choices and significance levels

Most modern statisticians use both depending on the problem. Bayesian methods dominate in ML; frequentist methods dominate in classical hypothesis testing. The philosophical debate has cooled; the practical use case decides which to apply.

Where Bayes' theorem appears

Medical diagnosis. Combining test results, symptoms, and disease prevalence to estimate the probability of a condition. Doctors who reason Bayesian-ly avoid the "false positive trap" for rare diseases.
Machine learning — Bayesian networks, Bayesian inference. Probabilistic graphical models, Bayesian neural networks, Gaussian processes. Bayesian uncertainty quantification — knowing what the model doesn't know.
Spam filtering. Naive Bayes was the dominant spam filter for a decade. Still effective for many text classification tasks.
Search and recommendation. Probabilistic ranking, A/B testing with Thompson sampling (a Bayesian bandit method), personalization.
Forensics. DNA matching, fingerprint matching — interpreting evidence requires P(innocent | match) which uses Bayes.
Cryptography and signal detection. Bayesian decision rules optimize detection thresholds when there's a known prior over signal types.
Rational decision-making. "What should I believe given this evidence?" Bayes is the answer when you can quantify priors and likelihoods.

Common mistakes

Confusing P(A|B) with P(B|A). The prosecutor's fallacy. P(positive test | healthy) is the false positive rate of the test. P(healthy | positive test) is what you actually care about. They're different; Bayes connects them.
Ignoring the base rate. A "99% accurate" test sounds great until you account for prevalence. For rare conditions (low P(H)), even very accurate tests produce mostly false positives.
Improper prior selection. A flat / uniform prior is rarely "uninformative" in the way it claims. Logarithmic, Jeffrey's, and reference priors all encode different no-information assumptions, with different posteriors.
Naive Bayes assuming independence too strongly. Spam filters assume words are independent given class — false. Empirically they still work; theoretically they're suboptimal. Modern alternatives relax this.
Forgetting that P(E) is just normalization. Computing the posterior up to proportionality (P(H|E) ∝ P(E|H)·P(H)) and normalizing at the end is often easier than computing P(E) directly.
Sequential updating with non-independent evidence. Two correlated tests (both based on the same biomarker) don't update Bayes-independently. Treating them as independent overcounts the evidence and overconfident posterior results.

Frequently asked questions

Why does a positive medical test often not mean you have the disease?

Base rate fallacy. If a disease affects 1 in 1000, even a 99% accurate test produces mostly false positives. Among 1000 people — 1 has the disease (true positive). Among 999 without — 9.99 false positives. So the test flags 11 people; only 1 has the disease. P(disease | positive) ≈ 1/11 = 9%, not 99%. Bayes' theorem makes this explicit; intuition routinely fails.

What's the difference between P(A|B) and P(B|A)?

They're different probabilities about different events. P(A|B) = probability of A given B happened. P(B|A) = probability of B given A happened. Confusing them is the "prosecutor's fallacy" — P(evidence | innocent) is what matters in court, not P(innocent | evidence). Bayes' theorem connects them via the prior.

What's a prior and how do I choose it?

The prior P(H) encodes your belief before seeing evidence. Choosing it is part art, part rigor. Options — uniform (no preference), informative (based on past data or expert knowledge), conjugate (mathematically convenient — produces posteriors in the same family). For most ML applications, the choice matters less as data grows; for small data, it dominates the posterior.

What's the difference between Bayesian and frequentist statistics?

Frequentist treats probabilities as long-run frequencies of repeated events. Bayesian treats probabilities as degrees of belief that update with evidence. Hypothesis testing is mostly frequentist (p-values). ML methods like Bayesian networks and Bayesian inference are explicitly Bayesian. Different philosophies, often complementary; both useful.

How do spam filters use Bayes?

Naive Bayes — for each word in an email, compute P(word | spam) and P(word | not-spam) from training data. Multiply (assuming word independence) to get P(email | spam) and P(email | not-spam). Apply Bayes to combine with priors P(spam) and get the posterior P(spam | email). If posterior > threshold, mark as spam. Dominant spam-filtering technique 2002-2010, partly displaced by deep learning since.

What's the prosecutor's fallacy?

Confusing P(evidence | innocent) with P(innocent | evidence). A DNA match might have P(match | innocent) = 1 in a million — sounds damning. But that's not P(innocent | match). If millions of people are screened, the posterior depends on how the suspect was identified. The famous Sally Clark case (UK, 1999) wrongfully convicted a mother of murdering her babies based on this exact error.

Can Bayes' theorem be derived?

Yes, in two lines from the definition of conditional probability. P(A|B) = P(A∩B)/P(B) and P(B|A) = P(A∩B)/P(A). Solve both for P(A∩B), set them equal — P(A|B)·P(B) = P(B|A)·P(A) — divide by P(B) — Bayes' theorem. The whole machinery follows from one definition.