Machine Learning
The Bias-Variance Tradeoff
Why the model that fits your training data best is rarely the one that predicts best
The bias-variance tradeoff explains why too simple a model underfits and too complex a model overfits: expected test error splits into bias squared, variance, and irreducible noise, and minimizing it means tuning model complexity to balance the two.
- Expected errorBias² + Variance + Noise
- Underfittinghigh bias, low variance
- Overfittinglow bias, high variance
- Variance scaling≈ σ² / n
- Irreducible floorσ² (Bayes error)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The dartboard intuition
Imagine training the same model a hundred times, each time on a fresh random sample drawn from the same source, and marking where each version's prediction lands on a dartboard whose bullseye is the true answer. Two distinct things can go wrong. The cluster of darts can be tight but sitting off to one side — that systematic offset is bias. Or the darts can scatter all over the board, centered on the bullseye but wildly spread — that spread is variance. A useless model is both biased and noisy; a great model is tight and centered. The frustrating discovery, which goes back to the prediction-error literature of the 1990s and was crystallized for ML by Geman, Bienenstock, and Doursat in 1992, is that the two knobs are coupled: turning down one tends to turn up the other.
Bias is the error you would still make even with infinite training data, because your model class is too rigid to represent the truth — a straight line trying to fit a sine wave. Variance is the error from your model being so flexible that it contorts itself to match the random noise in this particular training set, and would contort differently for the next one. The whole game of supervised learning is choosing a complexity that keeps their sum low, because that sum — not the training-set fit — is what determines how you do on data you have never seen.
The precise decomposition
Fix an input point x. The true label is generated as y = f(x) + ε, where f is the unknown true function and ε is zero-mean noise with variance σ². We train a model f̂ on a random dataset D, so f̂(x) is itself a random variable as D varies. Under squared-error loss, the expected test error at x, taken over both the noise and the draw of D, decomposes exactly:
E[(y − f̂(x))²] = ( E[f̂(x)] − f(x) )² ← Bias²
+ E[ (f̂(x) − E[f̂(x)])² ] ← Variance
+ σ² ← Irreducible noise
This is an identity, not an approximation. It falls out of adding and subtracting E[f̂(x)] inside the square and noting that the cross terms vanish because the noise ε is independent of the model and has mean zero. Three consequences matter in practice:
- Bias² is how far the average prediction (over all possible training sets) sits from the truth. It is a property of the model class, not of any one fit.
- Variance is how much an individual fit jitters around that average. Resampling the data is what exposes it.
- σ² is the floor. No model, however perfect, can beat it — it is the Bayes error. If your test error is already near σ², stop tuning the model and go clean the labels instead.
One important caveat: this clean three-way split is specific to squared-error loss. For 0/1 classification loss the decomposition is messier — bias and variance interact multiplicatively and a high-variance estimator can actually reduce error when it lands on the right side of the decision boundary, which is part of why bagging works so well for classifiers.
How model complexity moves the two terms
As you slide model complexity from "constant" to "memorize every point," bias falls monotonically and variance rises monotonically. The classic picture is a U: total error drops as you escape underfitting, bottoms out at the sweet spot, then climbs again into overfitting.
Concretely, take polynomial regression on a noisy sine wave with 30 training points:
- Degree 1 (a line). Bias dominates. The line cannot bend, so it systematically misses the curve's peaks and troughs. Train error and test error are both high and roughly equal — the hallmark of underfitting.
- Degree 4. The sweet spot. Bias has dropped because a quartic can trace a sine arc closely; variance is still small because four coefficients are stably estimable from 30 points.
- Degree 15. Variance explodes. The polynomial wiggles to pass through every noisy point, so train error is near zero but test error is enormous. Refit on a new sample and the curve looks completely different — that instability is the variance term.
The lever is not always literal model degree. It is regularization strength λ (more λ = more bias, less variance), tree depth, the k in k-nearest-neighbors, the number of boosting rounds, or how long you train a neural net before early stopping. Every one of these is a complexity dial that walks you along the same U-curve.
Diagnosing it from the symptoms
| High bias (underfit) | High variance (overfit) | Just right | |
|---|---|---|---|
| Training error | High | Very low | Low |
| Validation error | High | High | Low |
| Train–val gap | Small | Large | Small |
| Effect of more data | No help | Closes the gap | Marginal |
| Effect of more features / capacity | Helps | Hurts | Marginal |
| Effect of stronger regularization | Hurts | Helps | Slight hurt |
| Fix | Bigger model, richer features, less λ | More data, more λ, simpler model, bagging | Ship it |
The single most useful diagnostic is the learning curve: training and validation error plotted against the number of training examples. Two converged-and-high curves scream bias — feeding more data is wasted effort, you need a more expressive model. A low training curve with a validation curve hovering well above it, still descending, screams variance — more data or more regularization will close the gap.
What the numbers actually say
- Variance shrinks like σ²/n. Doubling the training set roughly halves the variance term, but does nothing to bias. This is why the variance curve in a learning-curve plot decays as a power law while the bias curve is a flat asymptote.
- k-NN variance is exactly σ²/k. For k-nearest-neighbors regression with noise variance σ², the variance contribution is σ²/k while bias grows roughly with the square of the neighborhood radius. Going from k=1 to k=10 cuts variance tenfold at the cost of more bias — a clean, closed-form instance of the tradeoff.
- Bagging cuts variance by up to B-fold. Averaging B independently-trained models divides the variance by B if the models were uncorrelated; with realistic correlation ρ between trees, the residual variance is ρσ² + (1−ρ)σ²/B. Random forests exist almost entirely to drive ρ down by randomizing the feature subsets, so the second term keeps shrinking with more trees.
- Boosting attacks bias instead. Gradient boosting adds many high-bias, low-variance weak learners (shallow trees) and reduces bias stage by stage — the mirror image of bagging, and the reason the two ensembling families have opposite hyperparameter sensitivities.
- The irreducible floor is real money. On a dataset where σ² corresponds to, say, 4% label error, no amount of modeling pushes accuracy past 96%. Teams routinely burn months chasing the last 2% that the Bayes error makes mathematically impossible.
JavaScript: measuring bias and variance empirically
You cannot observe bias and variance directly — they are expectations over unseen training sets. But you can estimate them with a Monte Carlo experiment: draw many training sets, fit a model to each, and look at how the predictions at a fixed test point spread out and where their average lands.
// True function and noisy sampler
const f = x => Math.sin(2 * Math.PI * x);
const NOISE = 0.25; // σ
const sample = n => Array.from({ length: n }, () => {
const x = Math.random();
return [x, f(x) + (Math.random() * 2 - 1) * NOISE * Math.sqrt(3)]; // var ≈ NOISE²
});
// Fit a polynomial of given degree by least squares (normal equations)
function fitPoly(data, degree) {
const X = data.map(([x]) => Array.from({ length: degree + 1 }, (_, j) => x ** j));
const y = data.map(d => d[1]);
// Solve (XᵀX) w = Xᵀy via Gaussian elimination
const A = X[0].map((_, i) => X[0].map((_, j) => X.reduce((s, r) => s + r[i] * r[j], 0)));
const b = X[0].map((_, i) => X.reduce((s, r, k) => s + r[i] * y[k], 0));
return solve(A, b); // weight vector w
}
const predict = (w, x) => w.reduce((s, wj, j) => s + wj * x ** j, 0);
// Monte Carlo estimate of bias² and variance at a fixed test point x0
function biasVariance(degree, x0, trials = 500, n = 30) {
const preds = [];
for (let t = 0; t < trials; t++) preds.push(predict(fitPoly(sample(n), degree), x0));
const mean = preds.reduce((a, b) => a + b, 0) / trials;
const variance = preds.reduce((s, p) => s + (p - mean) ** 2, 0) / trials;
const bias2 = (mean - f(x0)) ** 2;
return { bias2, variance, total: bias2 + variance + NOISE ** 2 };
}
function solve(A, b) { // tiny Gaussian elimination
const n = b.length, M = A.map((r, i) => [...r, b[i]]);
for (let c = 0; c < n; c++) {
let p = c; for (let r = c + 1; r < n; r++) if (Math.abs(M[r][c]) > Math.abs(M[p][c])) p = r;
[M[c], M[p]] = [M[p], M[c]];
for (let r = 0; r < n; r++) if (r !== c) {
const k = M[r][c] / M[c][c];
for (let j = c; j <= n; j++) M[r][j] -= k * M[c][j];
}
}
return M.map((row, i) => row[n] / row[i]);
}
console.log('degree 1 :', biasVariance(1, 0.5)); // high bias², low variance
console.log('degree 4 :', biasVariance(4, 0.5)); // both small — the sweet spot
console.log('degree 15:', biasVariance(15, 0.5)); // tiny bias², huge variance
Sweep degree from 1 to 15 and you will watch bias² fall toward zero while variance climbs from near-zero into the hundreds — the U-curve, reconstructed from first principles. The NOISE² term you add at the end is the irreducible floor; it never changes.
Python: the same experiment with scikit-learn
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
rng = np.random.default_rng(0)
f = lambda x: np.sin(2 * np.pi * x)
NOISE = 0.25
x0 = 0.5 # the test point we probe
def sample(n):
x = rng.random(n)
return x[:, None], f(x) + rng.normal(0, NOISE, n)
def bias_variance(degree, trials=500, n=30):
preds = np.empty(trials)
for t in range(trials):
X, y = sample(n)
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X, y)
preds[t] = model.predict([[x0]])[0]
mean = preds.mean()
bias2 = (mean - f(x0)) ** 2
variance = preds.var()
return bias2, variance, bias2 + variance + NOISE ** 2
for d in (1, 4, 15):
b2, v, total = bias_variance(d)
print(f"degree {d:2d}: bias²={b2:.4f} var={v:.4f} total≈{total:.4f}")
# Typical output (seed 0):
# degree 1: bias²=0.2480 var=0.0031 total≈0.3136 <- underfit, bias-bound
# degree 4: bias²=0.0009 var=0.0204 total≈0.0838 <- sweet spot
# degree 15: bias²=0.0006 var=2.9471 total≈3.0102 <- overfit, variance blows up
Notice degree 15's bias² is microscopic — a degree-15 polynomial can represent the sine arc almost perfectly on average — yet its total error is 36× worse than degree 4. The damage is entirely variance: any single fit veers wildly because there are not enough points to pin down sixteen coefficients. This is overfitting laid bare, and it is exactly the failure mode that regularization, more data, and ensembling are built to suppress.
Beyond the U-curve: what modern ML adds
Double descent. The textbook U assumes the underparameterized regime, where parameters < data points. Belkin, Hsu, Ma, and Mandal (2019) showed that as you push past the interpolation threshold — where the model has exactly enough capacity to fit the training data perfectly — test error rises to a spike and then, counter-intuitively, falls again. The curve is not a U but a U followed by a second descent. This is why a billion-parameter network trained on a million examples can generalize: it lives far to the right of the spike.
Implicit regularization. Over-parameterized models do not blow up the variance term the way the classic theory predicts, because stochastic gradient descent prefers low-norm, "simple" interpolating solutions among the infinitely many that fit the data. The capacity is enormous but the effective complexity SGD selects is modest, so variance stays controlled.
The bias-variance-covariance decomposition. For ensembles the error splits into three: average bias, average variance, and the covariance between members. This is the precise statement of why diversity helps — driving member predictions to be uncorrelated shrinks the covariance term, which is exactly what random forests engineer.
Regularization as a tunable bias injection. Ridge (L2) and lasso (L1) deliberately add bias — shrinking coefficients toward zero — to buy a larger reduction in variance. The regularization path is a continuous walk along the tradeoff curve, and cross-validation is just the search for the λ at the bottom of the U.
Common mistakes and edge cases
- Tuning on the test set. If you pick complexity by looking at test error, you have leaked it and your estimate of the U-curve's minimum is optimistic. Use a separate validation set or cross-validation; reserve the test set for one final read.
- Reading the train–val gap as "always overfitting." A small gap with both errors high is underfitting, not success. The gap measures variance; the level measures bias. You need both numbers.
- Throwing data at a bias problem. More data cannot fix a model class that is too simple — a line stays a line. If the learning curves have already converged, spend the data budget on richer features or a bigger model instead.
- Forgetting the irreducible floor. Chasing accuracy below the Bayes error is impossible. Estimate σ² (for instance, from the disagreement of independent human labelers) before setting a target.
- Assuming the 0/1-loss decomposition is additive. The clean Bias² + Variance + σ² identity holds for squared error. For classification, variance can help when it pushes a prediction across the correct side of the boundary — which is why high-variance learners like deep trees are perfect base models for bagging.
- Comparing bias and variance across different loss scales. The two terms are only comparable in the same units of the loss. Mixing an L2 variance estimate with an L1 bias estimate produces a meaningless "total."
Frequently asked questions
What is the bias-variance tradeoff in one sentence?
Expected test error breaks into bias squared (error from a model too simple to capture the true pattern), variance (error from a model so flexible it memorizes the noise in this particular training set), and irreducible noise; as you add complexity bias falls but variance rises, so the best model is the one that minimizes their sum, not either alone.
Is high bias underfitting or overfitting?
High bias is underfitting: the model is too rigid, makes systematic errors, and shows high error on both the training set and the test set, with the two gaps close together. High variance is overfitting: low training error but a large gap up to test error, because the model fit the noise.
How do I tell them apart from a learning curve?
Plot training and validation error against training-set size. If both curves are high and have converged together, you are bias-bound — add capacity, not data. If training error is low but validation error sits well above it and the gap is still shrinking as data grows, you are variance-bound — more data, regularization, or a simpler model will help.
Why does adding training data reduce variance but not bias?
Variance measures how much the fitted function wobbles between different training samples; averaging over more points stabilizes the fit, so variance falls roughly as 1/n. Bias is a property of the model class itself — a straight line can never bend to fit a parabola no matter how many points you feed it, so more data leaves bias unchanged.
Does the tradeoff still hold for deep neural networks?
The classic U-shaped curve is real in the underparameterized regime, but huge networks show double descent: past the interpolation threshold where the model exactly fits the training data, test error falls again. Modern over-parameterized models are not a counterexample to the decomposition — bias, variance, and noise still add up — but implicit regularization from SGD keeps variance low even when capacity is enormous.
How does k in k-nearest-neighbors trade bias for variance?
Small k (like k=1) gives low bias and high variance: each prediction copies its single nearest neighbor, so a noisy point flips the answer. Large k averages many neighbors, smoothing predictions — low variance but high bias, since distant, irrelevant points drag the estimate toward the global mean. The variance term is the noise variance divided by k.