Statistics

Hypothesis Testing

Evidence vs null hypothesis — p-values, t-tests, and the 5% threshold

Hypothesis testing is the framework for deciding whether observed data provides enough evidence to reject a "null hypothesis" — a default assumption like "no difference between groups." You compute a p-value (probability of seeing this data if null is true); if p < α (typically 0.05), you reject. Used in clinical trials, A/B testing, scientific experiments, quality control. Often misinterpreted; reformulated by Bayesian methods.

Null hypothesis (H₀)Default — typically "no effect" or "no difference"
Alternative (H₁)What you're trying to demonstrate
p-valueP(observed data or more extreme | H₀ true)
Significance level αThreshold for rejecting (typically 0.05)
Type I errorReject H₀ when true (false positive); rate = α
Type II errorFail to reject H₀ when false (false negative); rate = β

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The framework

Hypothesis testing follows a structured procedure:

State the hypotheses. Null H₀ — the default ("no effect"). Alternative H₁ — what you're testing for.
Choose a significance level α. Typically 0.05 — accept up to 5% false-positive rate.
Compute a test statistic from your data. A number quantifying how far from the null your observation is.
Find the p-value. P(this extreme a test statistic | H₀). Look up in tables or compute analytically.
Decide. If p < α, reject H₀ (in favor of H₁). If p ≥ α, fail to reject.

Critical — you never "accept" H₀. You either reject it or fail to reject. Failure to reject doesn't mean H₀ is true; just that the evidence isn't strong enough to overturn it.

Worked example — testing if a coin is fair

Hypotheses:

H₀ — coin is fair (p = 0.5).
H₁ — coin is biased (p ≠ 0.5).

Flip the coin 100 times; get 60 heads. Is this enough evidence?

Compute test statistic. Under H₀, number of heads ~ Binomial(100, 0.5), approximately N(50, 25). Standardize:

Z = (60 - 50) / 5 = 2.0

Two-tailed p-value — P(|Z| ≥ 2) = 2 · P(Z ≥ 2) ≈ 2 · 0.0228 = 0.0456.

p ≈ 0.046 < 0.05 — reject H₀. The data is moderately strong evidence that the coin is biased.

If we'd gotten 55 heads — Z = 1, p ≈ 0.317. Don't reject H₀ — within typical chance variation.

Common hypothesis tests

Test	Tests for	Assumption	Test statistic
Z-test	Mean, known σ	Normal data or large n	(x̄ − μ₀) / (σ/√n)
One-sample t-test	Mean, unknown σ	Normal data	(x̄ − μ₀) / (s/√n)
Two-sample t-test	Difference of two means	Independent normal samples	(x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Paired t-test	Mean difference of paired data	Differences are normal	d̄ / (s_d/√n)
Chi-square goodness-of-fit	Distribution shape	Expected counts > 5	∑(observed − expected)² / expected
Chi-square independence	Independence of two categorical variables	Expected counts > 5	Same formula on contingency table
ANOVA (F-test)	Multiple group means	Normal, equal variances	Between-group var / within-group var
Mann-Whitney U	Median difference (non-parametric)	Independent samples	Rank-based

The test depends on the data type, sample size, and assumed distribution. Most introductory stats courses focus on t-tests and chi-square; advanced applications use specialized tests for specific problems.

Type I and Type II errors

	H₀ True	H₀ False
Reject H₀	Type I error (false positive) — probability α	Correct rejection (true positive) — probability 1 − β = power
Fail to reject H₀	Correct (true negative) — probability 1 − α	Type II error (false negative) — probability β

The trade-off is real — lowering α (stricter threshold) reduces Type I errors but increases Type II errors. Increasing sample size reduces both. Power calculations for clinical trials estimate the n needed to achieve power ≥ 0.8 for an effect size of interest.

JavaScript — basic hypothesis test

// One-sample z-test for mean
function zTest(sampleMean, populationMean, populationSD, n) {
  const z = (sampleMean - populationMean) / (populationSD / Math.sqrt(n));
  const pValueTwoTailed = 2 * (1 - normalCDF(Math.abs(z)));
  return { z, pValue: pValueTwoTailed };
}

// Standard normal CDF (error function-based)
function normalCDF(z) {
  const t = 1 / (1 + 0.2316419 * Math.abs(z));
  const d = 0.3989423 * Math.exp(-z*z/2);
  let p = d * t * (0.3193815 + t * (-0.3565638 + t * (1.781478 + t * (-1.821256 + t * 1.330274))));
  return z > 0 ? 1 - p : p;
}

// Test if a die is fair — flip 100 coins, got 60 heads
const result = zTest(0.6, 0.5, 0.5, 100);
console.log(result);  // { z: 2, pValue: 0.0455 }

if (result.pValue < 0.05) console.log('Reject H₀ — evidence of bias');
else console.log('Fail to reject — no significant evidence of bias');

The multiple comparisons problem

Running many tests inflates false positive rates. With α = 0.05 per test, the probability of at least one false positive across n independent tests is 1 − (0.95)ⁿ.

Number of tests	P(at least one false positive)
1	5%
5	23%
10	40%
20	64%
50	92%
100	99.4%

Standard corrections:

Bonferroni. Divide α by number of tests. Conservative; controls family-wise error rate.
Benjamini-Hochberg. Controls false discovery rate (proportion of false positives among rejections). Less conservative; more powerful for many tests.

The "genome-wide significance threshold" of 5 × 10⁻⁸ is α = 0.05 / 1,000,000 — Bonferroni for the ~1M tests in genome-wide association studies.

Where hypothesis testing matters

Clinical trials. Drug efficacy testing. Pre-registered analysis plans; FDA requires p < 0.05 (often more stringent for severe diseases).
A/B testing. Comparing two website variants. T-test or z-test on conversion rates. Minimum-detectable-effect calculations determine sample sizes.
Scientific publications. Most experimental papers report p-values for their main results. The "statistical significance" threshold (p < 0.05) is the default acceptance criterion.
Quality control. Six Sigma testing — comparing process metrics to specifications, looking for shifts. Statistical process control charts.
Marketing analytics. Customer segmentation, campaign effectiveness, churn prediction — comparing groups to test whether differences are meaningful.
Particle physics. 5σ requirement for discovery announcements (p < 3 × 10⁻⁷). Higgs boson discovery (2012) cited 5σ.

Common mistakes

Misinterpreting p-value as "probability of null being true." P-value is P(data | null), not P(null | data). The latter is what most people want; computing it requires Bayes' theorem and a prior.
Equating "not significant" with "no effect." Failing to reject H₀ doesn't prove it. The effect might exist but be too small to detect with your sample size. Lack of evidence isn't evidence of absence.
Over-relying on p < 0.05. The threshold is arbitrary. Effect size matters more than significance — a tiny effect with huge n can be "significant" but practically meaningless.
p-hacking. Running many tests, picking the significant ones, reporting only those. This destroys validity. Pre-registration of hypotheses prevents this; corrections handle multiple testing properly.
Inadequate sample size for power. Underpowered studies miss real effects. Compute required sample size BEFORE running the study; don't add data until "significant."
Ignoring assumptions. Tests assume specific data distributions and independence. Skewed data, autocorrelation, outliers — all violate t-test assumptions and produce wrong p-values. Use non-parametric or specialized tests when appropriate.
Confusing one-tailed with two-tailed. Two-tailed tests for "different from null" (either direction). One-tailed tests for "specifically larger or smaller." One-tailed gives smaller p-value; only use when you have strong directional prior, not as a way to make non-significant results "significant."

Frequently asked questions

What does a p-value actually mean?

P(seeing this data or more extreme | null hypothesis is true). NOT P(null hypothesis is true). NOT the probability the result is due to chance. NOT the strength of the effect. Simply — assuming the null is true, how surprising is this data? Small p — surprising; we lean toward rejecting null. The misinterpretations have caused enormous confusion in scientific reporting.

Why is p < 0.05 the magic threshold?

Convention chosen by R.A. Fisher in the 1920s — "convenient round number." There's nothing special about 0.05; particle physics uses 5σ (~3 × 10⁻⁷) for discovery announcements. The threshold should match the cost of false positives in your domain. Fisher himself warned against rigid thresholds; modern statistics increasingly questions the binary "significant or not" framing.

What's a Type I vs Type II error?

Type I — reject H₀ when it's actually true (false positive). Probability = α (significance level). Type II — fail to reject H₀ when it's actually false (false negative). Probability = β. Power = 1 − β = probability of correctly rejecting false null. Trade-off — lowering α increases β. Sample size influences both; bigger studies have lower β for fixed α.

What's the difference between t-test and z-test?

Both test means. Z-test assumes you know the population standard deviation σ. T-test estimates σ from the sample (using sample std dev s); the test statistic follows a t-distribution with n-1 degrees of freedom (heavier tails than normal for small n). For n > ~30, t-distribution ≈ normal, so t-test ≈ z-test. In practice, you almost always use t-test because population σ is unknown.

What's statistical power?

1 − β = probability of detecting an effect when it really exists. Depends on effect size, sample size, and α. A study with low power (say, 0.3) misses real effects 70% of the time. Standard target — power ≥ 0.8 (80% chance of detection). Underpowered studies are wasteful; the result is often ambiguous.

How is hypothesis testing different from Bayesian inference?

Hypothesis testing — frequentist. Compute p-value; reject if below threshold. Doesn't tell you P(hypothesis true). Bayesian — compute posterior P(hypothesis | data) directly. Updates beliefs from prior to posterior. Both reach similar conclusions on simple problems, but Bayesian gives more direct interpretations. For multiple comparisons or complex models, Bayesian methods often handle uncertainty more cleanly.

What's the multiple comparisons problem?

Running many hypothesis tests inflates the false-positive rate. With α = 0.05 per test and 20 independent tests, P(at least one false positive) = 1 − 0.95²⁰ ≈ 64%. Corrections — Bonferroni (divide α by number of tests, conservative), Benjamini-Hochberg (controls false discovery rate, less conservative). Failure to correct is a major source of "p-hacking" in science.