Machine Learning

Cross-Validation

Stop trusting one lucky split — rotate the test set and average

Cross-validation rotates which slice of data is held out for testing, so your accuracy estimate isn't a lucky split. K-fold averages k train-test cycles into one low-variance score that uses every row for both training and evaluation.

  • Common fold countk = 5 or 10
  • Model fits per runk
  • Train data per fold(k−1)/k
  • Each row testedexactly once
  • Leave-one-out costn fits

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The problem with one split

You train a classifier, set aside 20% of the rows as a test set, score it at 91% accuracy, and ship it. But that 91% is one draw from a lottery. Re-shuffle the data, take a different 20%, and you might get 86% — or 94%. On small or noisy datasets the spread between "lucky split" and "unlucky split" can easily be five percentage points, which is more than enough to make you pick the wrong model in a bake-off.

Cross-validation removes the luck. Instead of one held-out slice, you rotate the held-out slice through the whole dataset. Cut the data into k equal parts (called folds). Train on k−1 of them, test on the one you left out, record the score. Then put that fold back, pull out a different one, and repeat — until every fold has served exactly once as the test set. Average the k scores. Now every row has been used for training in some rounds and for testing in exactly one round, and your estimate no longer hinges on a single arbitrary partition.

The technique dates to the 1970s — Mervyn Stone and Seymour Geisser independently formalized it in 1974–1975 — and it remains the default way to estimate how a model will generalize before you have new data to try it on.

How k-fold works, step by step

The mechanism is mechanical and worth pinning down precisely:

  1. Shuffle the n rows (unless the data is ordered, e.g. time series — see the pitfalls section).
  2. Partition into k folds of size ≈ n/k. With n = 1000 and k = 5, each fold holds 200 rows.
  3. For i = 1 … k: train a fresh model on the other k−1 folds (800 rows), score it on fold i (200 rows), store sᵢ.
  4. Report the mean CV = (1/k) Σ sᵢ and, just as importantly, its standard deviation across folds.

The reported standard deviation is the under-appreciated half of the output. A model that scores 0.88 ± 0.01 is meaningfully different from one at 0.88 ± 0.06: the first is stable, the second is one bad fold away from disappointing you in production. When you compare two models, compare the distributions, not just the means.

A subtle but critical point: each fold trains a model that you then throw away. Cross-validation does not produce a final model — it produces an estimate of how well your training procedure generalizes. Once you trust that estimate, you retrain one final model on all n rows.

The bias-variance arithmetic

Let μ be the true generalization accuracy you wish you could measure. A single test fold gives an unbiased but noisy estimate with some variance σ². Averaging k fold scores reduces the variance of the mean — though not by the full factor of k, because the folds overlap heavily in their training sets and are therefore correlated. The standard error of a k-fold estimate is

SE(CV) = sqrt( (1/k) · [ σ² + (k − 1) · ρσ² ] )

where ρ is the correlation between fold scores. Because ρ > 0, the variance never drops to σ²/k the way independent samples would — this is exactly why a famous 2004 result (Bengio & Grandvalet) proved there is no unbiased estimator of the variance of k-fold CV. The bias side moves the other way: the larger k is, the more data each model trains on (closer to the full n), so the bias of the estimate shrinks while the variance grows. That tension is the whole reason fold count is a tuning choice and not a constant.

Complexity. If a single model fit costs T(m) on m rows, then k-fold costs k · T(n·(k−1)/k) ≈ k · T(n) — linear in the number of folds. Leave-one-out is the extreme k = n, costing n fits. For ordinary least squares there is a closed-form shortcut: LOO error equals Σ [ residualᵢ / (1 − hᵢ) ]² / n, where hᵢ is the i-th leverage (hat-matrix diagonal) — so LOO for linear regression costs one fit, not n.

Comparison of validation schemes

Holdoutk-foldStratified k-foldLOORepeated k-foldTimeSeriesSplit
Model fits1kknk · rk
Train fraction~0.8(k−1)/k(k−1)/k(n−1)/n(k−1)/kgrows each fold
Bias of estimatehighlowlowlowestlowlow
Variance of estimatehighestmoderatemoderatehighlowestmoderate
Preserves class balancenonoyesn/avia stratifyno
Respects time orderif manualnonononoyes
Best forhuge data, quick checkgeneral defaultclassificationtiny datanoisy small dataforecasting

The headline trade-off is fits versus stability. Holdout is one fit and noisy; LOO is n fits and nearly unbiased but high-variance; ordinary k-fold sits in the sweet spot. Repeated k-fold (run the whole k-fold several times with different shuffles and average) is the cheap way to squeeze out shuffle-luck on small datasets.

What the numbers actually say

  • 10-fold costs 10× a single fit. If your model takes 3 minutes to train, one 10-fold pass is 30 minutes. Tuning 50 hyperparameter settings with 10-fold is 500 fits — 25 hours. This is why people reach for k=5, or for a fast surrogate during search and a full k=10 only for the final estimate.
  • Variance reduction is roughly √k, not k. Because folds are correlated, going from holdout to 5-fold typically cuts the standard error of the accuracy estimate by about half, not by five — still a large, cheap win.
  • Leave-one-out on n = 10,000 means 10,000 fits. For anything but the cheapest models this is prohibitive, which is the whole reason k-fold exists: it caps the cost at k regardless of dataset size.
  • Leakage is worth percentage points. Fitting a StandardScaler or imputer on the full dataset before splitting routinely inflates the CV score by 1–3 points on tabular data, and far more when feature selection is done outside the loop — sometimes turning random noise features into apparent 80%+ "accuracy."

JavaScript implementation

// Generic k-fold cross-validation.
// fit(trainX, trainY) -> model; score(model, testX, testY) -> number.

function shuffle(arr, seed = 42) {
  // Mulberry32 PRNG for reproducible shuffles
  let s = seed >>> 0;
  const rand = () => (s = (s * 1664525 + 1013904223) >>> 0) / 4294967296;
  const a = arr.slice();
  for (let i = a.length - 1; i > 0; i--) {
    const j = Math.floor(rand() * (i + 1));
    [a[i], a[j]] = [a[j], a[i]];
  }
  return a;
}

function kFold(X, y, k, fit, score, seed = 42) {
  const n = X.length;
  const idx = shuffle([...X.keys()], seed);   // shuffle row indices once
  const scores = [];

  for (let f = 0; f < k; f++) {
    const testIdx = idx.filter((_, i) => i % k === f);      // fold f is the test set
    const trainIdx = idx.filter((_, i) => i % k !== f);     // everything else trains
    const trainX = trainIdx.map(i => X[i]);
    const trainY = trainIdx.map(i => y[i]);
    const testX  = testIdx.map(i => X[i]);
    const testY  = testIdx.map(i => y[i]);

    const model = fit(trainX, trainY);   // a FRESH model per fold — never reuse
    scores.push(score(model, testX, testY));
  }

  const mean = scores.reduce((a, b) => a + b, 0) / k;
  const variance = scores.reduce((a, s) => a + (s - mean) ** 2, 0) / k;
  return { mean, std: Math.sqrt(variance), scores };
}

Two details carry the correctness. First, the row indices are shuffled once and the fold membership is then deterministic — every row lands in exactly one test fold. Second, fit is called inside the loop so each fold builds a brand-new model; reusing a model across folds would let earlier test rows leak into later training and quietly invalidate the whole exercise.

Python implementation

In practice you almost never hand-roll the loop — scikit-learn ships battle-tested splitters. The non-negotiable rule is to put every learned transform inside a Pipeline so it is re-fit on each training split:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

# Scaler is INSIDE the pipeline -> re-fit on the training folds only.
# This is the single line that prevents the most common CV leakage bug.
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="accuracy")

print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
print("per fold:", np.round(scores, 3))

And the manual loop, when you need the predictions or per-fold artifacts:

from sklearn.base import clone

def k_fold(estimator, X, y, cv):
    scores = []
    for train_idx, test_idx in cv.split(X, y):
        model = clone(estimator)                 # fresh, unfitted copy each fold
        model.fit(X[train_idx], y[train_idx])    # transforms re-fit here only
        scores.append(model.score(X[test_idx], y[test_idx]))
    return np.array(scores)

The clone matters: it copies the hyperparameters but discards any fitted state, so fold 2 cannot inherit weights learned on fold 1's data. For nested tuning, wrap a GridSearchCV as the estimator and pass it to an outer cross_val_score — the inner CV picks hyperparameters, the outer CV reports an honest score on folds the search never touched.

Variants worth knowing

Stratified k-fold. Keeps each fold's class proportions equal to the whole dataset's. The default for classification — without it, an imbalanced problem can produce a fold with zero positive examples, making metrics like recall undefined or wildly noisy.

Group k-fold. When rows are clustered — multiple readings per patient, multiple frames per video — ordinary k-fold can put one patient's rows in both train and test, leaking identity. Group k-fold guarantees an entire group stays on one side of every split.

Leave-one-out (LOO) and leave-p-out. The extreme k = n. Nearly unbiased, but expensive and high-variance because the n training sets are almost identical, so their errors are strongly correlated. Reserved for tiny datasets.

Repeated k-fold. Run k-fold r times with different shuffles and average all k·r scores. The cheapest way to beat down shuffle-luck on small, noisy data.

Nested cross-validation. An inner loop for hyperparameter selection wrapped in an outer loop for performance estimation. The gold standard when you both tune and report, because it never lets the tuning peek at the rows used to score.

TimeSeriesSplit (forward chaining). Each fold trains on a prefix of the timeline and tests on the next block, never on earlier data. The correct scheme whenever rows are ordered in time.

Common bugs and edge cases

  • Preprocessing outside the loop. Fitting a scaler, imputer, PCA, or feature selector on the full dataset before splitting leaks test statistics into training. Always fit transforms on the training fold only — use a Pipeline.
  • Tuning on the same folds you report. Choosing the best hyperparameters by CV score and then quoting that score is double-dipping. The reported number is optimistic; use nested CV.
  • Shuffling time-series or grouped data. Random folds leak the future into the past, or one entity into both sides. Use TimeSeriesSplit or GroupKFold respectively.
  • Forgetting stratification on imbalanced classes. A 2% positive rate plus plain 10-fold can yield folds with no positives. Stratify.
  • Treating the CV mean as the model. CV estimates the procedure, not a model. After CV, retrain once on all the data to ship.
  • Ignoring the standard deviation. Two models with the same mean but very different fold spreads are not equivalent — the high-variance one is riskier in production.
  • Leaking duplicate or near-duplicate rows. If the same record appears twice (or augmented copies do), it can sit in train and test simultaneously, inflating the score. De-duplicate before splitting, or group.

Frequently asked questions

Why not just use a single train-test split?

A single split gives you one accuracy number whose value depends entirely on which rows happened to land in the test set. On a small dataset that luck-of-the-draw variance can be several percentage points, so two researchers running the same model with different random seeds can disagree about which model is better. Cross-validation averages over k different splits, which shrinks the variance of the estimate — not by the full factor of k, because the folds share most of their training data and so their scores are correlated, but typically enough to roughly halve the standard error while still using every row for testing exactly once.

How many folds should I use?

Five or ten are the standard choices. k=10 has lower bias (each model trains on 90% of the data) but higher variance and ten times the compute; k=5 is cheaper and often good enough. Leave-one-out (k=n) is nearly unbiased but has high variance and costs n model fits, so it is reserved for very small datasets. There is no fold count that is correct in general — it is a bias-variance-compute trade-off.

What is the difference between k-fold and stratified k-fold?

Plain k-fold shuffles rows and cuts them into k contiguous blocks. Stratified k-fold instead ensures each fold has roughly the same class proportions as the full dataset. On imbalanced classification — say 5% positives — plain k-fold can hand you a fold with zero positives, making that fold's metric meaningless. Stratification is the default for classification in scikit-learn for exactly this reason.

Should I scale or impute inside or outside the cross-validation loop?

Inside. Fit the scaler, imputer, feature selector, or any other learned transform on the training folds only, then apply it to the held-out fold. If you fit them on the whole dataset first, statistics from the test fold leak into training and your CV score is optimistically biased. Wrapping everything in a Pipeline guarantees the transforms are re-fit on each training split automatically.

Why do I need nested cross-validation for tuning?

If you pick hyperparameters by choosing whichever gives the best CV score, that best score is now an optimistic estimate — you selected it precisely because it was lucky. Nested CV puts hyperparameter search in an inner loop and performance estimation in an outer loop, so the number you report comes from folds the tuning never saw. The cost is multiplicative: an outer 5 and inner 5 with 20 candidate settings is 500 model fits.

Can I use ordinary k-fold on time-series data?

No. Random folds let the model train on future rows and test on past ones, which leaks information that would not exist in production and inflates the score. Use a forward-chaining scheme (scikit-learn's TimeSeriesSplit) where each fold trains on a prefix of the timeline and tests on the next block, never on earlier data.