Statistics
Student's t-Distribution
Small samples + unknown variance → heavier tails than Normal
Student's t-distribution describes (X̄ − μ)/(s/√n) when the sample is small and σ² is unknown. Heavy-tailed at low df, essentially Normal beyond df=30. Published anonymously by Gosset at Guinness, 1908.
- Definitiont(ν) = Z / √(χ²(ν)/ν)
- InventorWilliam Gosset ("Student"), 1908
- Mean (ν>1) / Variance (ν>2)0 · ν/(ν−2)
- t(1)Cauchy — mean undefined
- t(30) vs Normaldiffer by <1%
- Used fort-tests, small-sample CIs
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The setup — and why we need t at all
Suppose X₁, X₂, …, Xₙ are i.i.d. N(μ, σ²). The standardized sample mean is:
Z = (X̄ − μ) / (σ/√n) ~ N(0, 1)
Beautiful — but only if you know σ. You almost never do. Replace σ with the sample standard deviation s, and the standardized statistic becomes:
T = (X̄ − μ) / (s/√n)
This is no longer Normal. The denominator s is itself a random variable. Its uncertainty thickens the tails of T. Gosset (1908) worked out exactly what distribution T follows: t(n − 1).
Construction from Normal and chi-squared
Let Z ~ N(0, 1) and V ~ χ²(ν) be independent. Then:
T = Z / √(V/ν) ~ t(ν)
Why? In the sample-mean setup:
Numerator: (X̄ − μ)/(σ/√n) ~ N(0, 1) = Z
Denominator: s/σ where (n−1)s²/σ² ~ χ²(n−1)
T = (X̄ − μ)/(s/√n) = Z / (s/σ) = Z / √(χ²(n−1)/(n−1)) = t(n−1).
So the t-distribution literally captures the joint behavior of "true Normal numerator" and "wobbly sample-variance denominator."
Density and moments
The pdf of t(ν):
f(t; ν) = Γ((ν+1)/2) / [√(νπ) · Γ(ν/2)] · (1 + t²/ν)^(-(ν+1)/2)
| Quantity | Formula | ν=3 | ν=10 | ν=30 |
|---|---|---|---|---|
| Mean | 0 (for ν > 1) | 0 | 0 | 0 |
| Variance | ν/(ν − 2) (for ν > 2) | 3 | 1.25 | 1.07 |
| Skewness | 0 (symmetric) | 0 | 0 | 0 |
| Excess kurtosis | 6/(ν − 4) | — | 1.0 | 0.23 |
| 97.5% critical (2-sided) | — | 3.18 | 2.23 | 2.04 |
The 2.5% critical values for two-tailed tests show how t differs from Normal (1.96). At df = 30, you're paying only a few percent in interval width relative to assuming Normal.
t → Normal as ν → ∞
Plotting t(ν) for ν = 1, 2, 5, 10, 30, ∞ shows the curves stack from heavy-tailed Cauchy (ν=1) to standard Normal (ν=∞). Specific comparisons:
| ν | Variance | 97.5% critical | Heavier-tailed than Normal by |
|---|---|---|---|
| 1 (Cauchy) | ∞ | 12.71 | 6.5× wider |
| 2 | ∞ | 4.30 | 2.2× wider |
| 5 | 1.67 | 2.57 | 31% wider |
| 10 | 1.25 | 2.23 | 14% wider |
| 30 | 1.07 | 2.04 | 4% wider |
| 100 | 1.02 | 1.98 | 1% wider |
| ∞ (Normal) | 1.00 | 1.96 | — |
For df ≤ 5 the difference is substantial; from df = 30 onward you might as well use Normal.
The t-test — by example
Suppose a battery manufacturer claims a mean lifetime of μ₀ = 1000 hours. You test n = 10 batteries and get sample mean X̄ = 950 hours, sample SD s = 50 hours. Is the claim plausible?
Null: μ = 1000
Test statistic: t = (X̄ − μ₀)/(s/√n) = (950 − 1000)/(50/√10) = −50/15.81 ≈ −3.16
Under null, t ~ t(9). Critical value at α = 0.05 two-tailed: ±2.26.
|t| = 3.16 > 2.26 → reject the null.
Conclusion: at α = 0.05, the data are inconsistent with mean lifetime 1000 hours.
Note: had we (mistakenly) used Normal's 1.96 critical value, we'd still reject — but with the wrong α level. The t-test gets the size of the test right.
t vs Normal vs Cauchy
| Distribution | Tails | Mean | Variance | Use when |
|---|---|---|---|---|
| Normal N(0,1) | e^{−x²/2} (very light) | 0 | 1 | Large samples, known σ |
| t(30) | ~Normal (slightly heavier) | 0 | 30/28 | n > 30, σ unknown |
| t(5) | polynomial decay, ν=5 | 0 | 5/3 | n ≈ 6–10, σ unknown |
| t(2) | polynomial, infinite var | 0 | ∞ | Heavy-tailed test stat, robust inference |
| t(1) = Cauchy | 1/(π(1+x²)), mean undefined | — | ∞ | Pathological / sub-CLT models |
| Laplace | e^{−|x|} (medium-heavy) | 0 | 2 | L1 regression, robust estimators |
t(ν) is a continuous interpolation between Cauchy (ν=1, all-heavy) and Normal (ν=∞, all-light), with degrees of freedom literally tuning the tail weight.
Confidence intervals — wider for small n
95% confidence interval for μ with unknown σ:
X̄ ± t_{0.975}(n−1) · s/√n
For n = 10, the multiplier is 2.26 (vs Normal's 1.96 — interval is 15% wider). For n = 30, it's 2.05 (4.5% wider). For n = 100, it's 1.98 (1% wider). Small samples pay a real width penalty for not knowing σ.
Where the t-distribution shows up
- One-sample t-test. Test whether a sample mean differs from a hypothesized value when σ is unknown.
- Paired t-test. Test mean of paired differences (before/after, treatment/control with matched subjects).
- Two-sample t-test. Test difference of means in two groups; pooled (equal variances) or Welch (unequal).
- Regression coefficient inference. Each estimated β̂ⱼ has t-distributed standardized statistic; standard p-values in linear regression come from t-tests.
- Confidence intervals for means. Whenever the population variance is unknown — which is almost always in practice.
- Bayesian inference with vague priors. Posterior of a Normal mean under improper σ-prior is t-distributed — appears throughout objective Bayesian regression.
- Robust statistics. Heavy-tailed t-distributions (low ν) model outlier-prone data; t-distributed errors in regression resist outliers better than Gaussian errors.
- A/B testing. Welch's t-test is the standard significance test for mean differences between two groups.
Common pitfalls
- Using Z when you should use t. If σ is unknown (almost always), use t. The Normal approximation gets the type-I error rate wrong for small samples.
- Forgetting df = n − 1. One degree of freedom is lost estimating the mean. For paired t-tests, df = n − 1 where n is the number of pairs; for two-sample pooled t-tests, df = n₁ + n₂ − 2.
- Assuming normality of the underlying data. The t-test is robust to mild deviations from normality, especially for n ≥ 30. For severe non-normality, use Wilcoxon signed-rank or other non-parametric alternatives.
- Welch vs pooled t-test confusion. Use Welch's t-test (separate variances) when sample variances differ noticeably; pooled t-test assumes equal population variances and is biased otherwise.
- One-tailed vs two-tailed. Two-tailed is the default; use one-tailed only when you have a strong directional hypothesis pre-registered.
- Calling the result "significant" without effect size. A statistically significant t doesn't tell you whether the effect matters. Report the mean difference, confidence interval, and Cohen's d alongside the p-value.
The Guinness story
William Sealy Gosset (1876–1937) was head experimental brewer at Guinness in Dublin. Quality control on small batches of barley and hops required statistical inference with very few samples — n = 5 or 10, not n = 1000. Existing Normal-based methods overstated precision; Gosset noticed his confidence intervals were too narrow. Working with R. A. Fisher and Karl Pearson by correspondence, he derived the exact distribution of (X̄ − μ)/(s/√n) for small n. Guinness, fearing competitors would learn its statistical methods, forbade publishing under his real name. Gosset published as "Student" in Biometrika, 1908. The pseudonym stuck. The Guinness brewery's prohibition on Gosset's name is the reason an entire branch of inference is named after a nameless student.
Frequently asked questions
Why use t instead of Normal for small samples?
Because you're estimating two things, not one. The standardized mean (X̄ − μ)/(σ/√n) is Normal — but σ is unknown, so you replace it with the sample standard deviation s. The ratio (X̄ − μ)/(s/√n) now has TWO sources of randomness: the numerator and the denominator. The denominator s is itself a random variable; small n makes s wobbly; the wobbliness shows up as heavier tails in the resulting distribution. That's the t distribution. For ν = 5, the 2.5% critical value is 2.57 versus the Normal's 1.96 — significantly wider intervals, reflecting the genuine extra uncertainty.
When does t become essentially Normal?
By df = 30, t and Normal differ by less than 1% at typical critical values. By df = 100, they're identical to four decimal places. Rule of thumb: use t for n ≤ 30, switch to Normal (Z-tables) for n > 30. The transition is gradual: at df = 10, the 95% critical value is 2.23 vs Normal's 1.96 (12% wider). At df = 30, it's 2.04 (4% wider). At df = 60, it's 2.00 (2% wider). The convergence is monotone — t tails always thicker than Normal but shrinking with df.
Who was "Student" and why the pseudonym?
William Sealy Gosset (1876–1937), a chemist at the Guinness brewery in Dublin. He developed t-tests to handle small-batch quality control on beer ingredients — Guinness's trade secret. Guinness forbade publication of trade-related research, so Gosset published under the pseudonym "Student" in Biometrika (1908). Karl Pearson edited the journal and oversaw publication. Fisher generalized Gosset's work in the 1920s and gave us the modern t-test framework. The brewery's prohibition on publication is the reason an entire branch of statistics is named after an anonymous student rather than a real person.
What is the variance of the t-distribution?
Var(t(ν)) = ν/(ν − 2) for ν > 2; undefined for ν ≤ 2. The variance blows up as ν → 2 from above, reflecting the heavier tails. For ν = 3, variance is 3; for ν = 5, variance is 5/3 ≈ 1.67; for ν = 30, variance is 30/28 ≈ 1.07 (close to Normal's 1). The mean is 0 only for ν > 1 — for ν = 1 (Cauchy distribution!) the mean doesn't exist due to heavy tails. So t is fully well-defined only for ν > 2 if you want both finite mean and variance.
What's the t-test, exactly?
One-sample t-test for mean μ₀: compute t = (X̄ − μ₀) / (s/√n). Under the null (true mean is μ₀), t follows t(n − 1). Reject the null if |t| exceeds the critical value (e.g., for n = 10 and α = 0.05 two-tailed, critical value is 2.26 from t(9) tables). Variants: paired t-test (differences for matched pairs), independent two-sample t-test (Welch's or Student's pooled-variance versions). The t-test is the workhorse of small-sample mean comparison — biology, psychology, A/B testing, clinical trials.
How is t(ν) constructed from Normal and chi-squared?
Let Z be standard normal and V be independent chi-squared with ν degrees of freedom. Then T = Z / √(V/ν) has the t(ν) distribution. Verify: in (X̄ − μ)/(s/√n), the numerator standardized is Z ~ N(0,1), and (n − 1)s²/σ² ~ χ²(n − 1). So (X̄ − μ)/(s/√n) = Z / √(χ²(n−1)/(n−1)) — exactly t(n − 1). The chi-squared in the denominator captures the wobbliness of the sample standard deviation; the longer ν, the more concentrated the denominator, the closer to Normal.
What's the difference between t and Cauchy?
t(1) IS the Cauchy distribution. At ν = 1 the t-distribution has so-heavy tails that even the mean doesn't exist — the integral ∫t · f(t) dt diverges. As ν increases the tails progressively lighten: ν = 2 has mean 0 but infinite variance; ν > 2 has finite variance; ν → ∞ converges to Normal. The t family is a continuous interpolation from Cauchy (ν = 1, fully heavy-tailed) to Normal (ν = ∞, fully light-tailed). The degrees of freedom parameter literally tunes the tail weight.