Machine Learning

Gaussian Process Regression

Don't predict a line — predict every line that fits, then measure your own doubt

Gaussian process regression puts a distribution over functions: instead of one prediction it returns a mean plus a calibrated uncertainty band, exact in closed form via the kernel, at O(n³) training cost.

TrainingO(n³)
Predict meanO(n)
Predict varianceO(n²)
MemoryO(n²)
Practical exact limit≈ 10k–50k points

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The idea: a distribution over functions

Ordinary regression fits one function and reports one number per query. A Gaussian process does something stranger and more useful: it treats the unknown function itself as a random variable and reasons about every function that could explain your data at once. Before seeing anything, infinitely many smooth curves are plausible. Each observation rules out the curves that miss it. What survives is a tighter cloud of candidate functions — and the spread of that cloud at any input is your honest uncertainty there.

The defining property is this: for any finite set of input points, the function values at those points are jointly Gaussian (a multivariate normal). That's the entire definition of a Gaussian process — "a collection of random variables, any finite number of which have a joint Gaussian distribution," in the words of Rasmussen and Williams' canonical 2006 text. A Gaussian is fully specified by a mean and a covariance, so a GP is fully specified by a mean function m(x) (almost always taken as 0 after centering) and a covariance function, the kernel k(x, x′).

The kernel encodes a single assumption: nearby inputs produce correlated outputs. Tell the model how correlation decays with distance and you have implicitly described every function it considers possible — without ever writing down a basis, a polynomial degree, or a layer count.

The mechanism: conditioning a joint Gaussian

Stack your n training inputs into X with targets y, and the test inputs you want predictions for into X*. Build three kernel matrices: K = k(X, X) (n×n), K* = k(X, X*), and K** = k(X*, X*). Assuming i.i.d. observation noise with variance σ², training and test values are jointly Gaussian, and the standard formula for conditioning one block of a multivariate normal on another gives the posterior:

mean(X*)  = K*ᵀ (K + σ²I)⁻¹ y
cov(X*)   = K** − K*ᵀ (K + σ²I)⁻¹ K*

That's the whole model. No iterative training loop, no SGD — the posterior is available in closed form. Two facts make it remarkable. First, the predictive variance does not depend on y at all — only on where you collected data. Sample at a point and uncertainty there drops regardless of what value you measured. Second, near a training point the mean is pulled through it and the variance collapses toward the noise floor σ²; far from any data the second term vanishes and you fall back to the prior variance k(x, x). That is the "pinch at the data, balloon in the gaps" shape you see in the visualization.

You never literally invert the matrix. The numerically stable recipe is a Cholesky factorization K + σ²I = LLᵀ, then two triangular solves. This is what dominates the O(n³) training cost; everything downstream is cheaper.

When to reach for a GP

Small data, expensive labels. Dozens to a few thousand points where every label costs a lab experiment, a simulation hour, or a human rating. GPs squeeze maximum information out of scarce data.
You need the error bars, not just the answer. Risk-sensitive forecasting, sensor fusion, and any setting where "I don't know" must be expressible.
Bayesian optimization. Tuning a black-box function (hyperparameters, drug dosages, chip layouts) by repeatedly asking "where should I sample next to learn the most?" The GP's variance is the exploration signal.
Spatial interpolation (kriging). Ore grades, pollution maps, temperature fields — geostatistics has used GPs under the name kriging since Danie Krige's 1951 mining work and Georges Matheron's 1960s formalization.
Smooth, low-dimensional inputs. GPs shine in 1–20 dimensions. Above that, distance-based kernels weaken and you need structure or deep features.

Avoid GPs when you have hundreds of thousands of raw high-dimensional inputs (images, text) — the cubic cost and the curse of dimensionality both bite. Reach for a neural network or gradient-boosted trees there.

GP regression vs other regressors

	Gaussian process	Linear / ridge	Random forest	Gradient boosting	Neural network	k-NN regression
Training cost	O(n³)	O(n·d²)	O(n·d·log n·T)	O(n·d·T)	O(n·epochs)	O(1) (lazy)
Predict cost	O(n) mean, O(n²) var	O(d)	O(T·depth)	O(T·depth)	O(network size)	O(n·d)
Calibrated uncertainty	Yes (exact posterior)	Only under Gaussian-noise assumption	Crude (tree variance)	Quantile add-on	No (needs Bayesian/ensemble bolt-on)	No
Scales to n = 10⁶	No (needs approximation)	Yes	Yes	Yes	Yes	Slow at predict
Handles raw images/text	No	No	Weak	Weak	Yes	No
Hyperparameter fit	Marginal likelihood (no CV)	One λ via CV	CV / OOB	CV (many knobs)	CV + tuning	CV on k
Smoothness control	Explicit (kernel choice)	Linear only	Piecewise-constant	Piecewise-constant	Implicit	Local average
Extrapolation	Reverts to prior (honest)	Linear trend	Flat (clamped)	Flat (clamped)	Unpredictable	Flat (clamped)

The one column no other row-2 model fills cheaply is calibrated uncertainty. That, plus learning hyperparameters without a validation split, is why GPs dominate small-data, label-expensive niches even though they can't touch a neural net on ImageNet.

What the numbers actually say

The O(n³) wall is real and steep. Doubling your data multiplies training work by 8×. Going from 1,000 points (~10⁹ flops, milliseconds) to 50,000 points (~1.25×10¹⁴ flops) is a ~125,000× jump — minutes-to-hours on a CPU and ~20 GB just to store the 50,000×50,000 kernel matrix in double precision. That memory ceiling, not time, is usually what stops you first.
Sparse approximations move the wall. Inducing-point methods (FITC, and Titsias' 2009 variational SGPR) pick m ≪ n pseudo-inputs and drop the cost to O(n·m²). With m ≈ 500 that's a ~10,000-fold speedup at n = 50,000 (the cost ratio is (n/m)²), putting million-point GPs within reach.
Calibration is measurable. On a well-specified GP, the empirical coverage of a 95% interval lands near 95% — you can verify it by counting how many held-out points fall inside the band. A model claiming 95% that only covers 70% is miscalibrated, and the marginal-likelihood fit exists precisely to prevent that.
Jitter matters numerically. Kernel matrices are notoriously ill-conditioned; practitioners add a tiny diagonal "jitter" of 10⁻⁶ to 10⁻⁴ before Cholesky to keep the factorization from failing on near-duplicate inputs.

JavaScript implementation

A complete exact GP with an RBF kernel — Cholesky factorization, predictive mean, and predictive variance. No dependencies.

// RBF (squared-exponential) kernel
const rbf = (a, b, sf2, len) => sf2 * Math.exp(-((a - b) ** 2) / (2 * len * len));

// Cholesky: A = L Lᵀ  (A symmetric positive-definite)
function cholesky(A) {
  const n = A.length;
  const L = Array.from({ length: n }, () => new Array(n).fill(0));
  for (let i = 0; i < n; i++)
    for (let j = 0; j <= i; j++) {
      let s = 0;
      for (let k = 0; k < j; k++) s += L[i][k] * L[j][k];
      L[i][j] = i === j ? Math.sqrt(A[i][i] - s) : (A[i][j] - s) / L[j][j];
    }
  return L;
}
const fwdSub = (L, b) => {           // solve L y = b
  const n = L.length, y = new Array(n).fill(0);
  for (let i = 0; i < n; i++) {
    let s = b[i];
    for (let k = 0; k < i; k++) s -= L[i][k] * y[k];
    y[i] = s / L[i][i];
  }
  return y;
};
const bwdSub = (L, b) => {           // solve Lᵀ x = b
  const n = L.length, x = new Array(n).fill(0);
  for (let i = n - 1; i >= 0; i--) {
    let s = b[i];
    for (let k = i + 1; k < n; k++) s -= L[k][i] * x[k];
    x[i] = s / L[i][i];
  }
  return x;
};

function fitGP(X, y, { sf2 = 1, len = 1, noise = 1e-2 } = {}) {
  const n = X.length;
  const K = X.map((xi, i) =>
    X.map((xj, j) => rbf(xi, xj, sf2, len) + (i === j ? noise + 1e-8 : 0)));
  const L = cholesky(K);
  const alpha = bwdSub(L, fwdSub(L, y));   // alpha = (K+σ²I)⁻¹ y
  return { X, L, alpha, sf2, len };
}

function predict(gp, xStar) {
  const { X, L, alpha, sf2, len } = gp;
  const ks = X.map(xi => rbf(xStar, xi, sf2, len));
  const mean = ks.reduce((s, k, i) => s + k * alpha[i], 0);
  const v = fwdSub(L, ks);                  // v = L⁻¹ k*
  const variance = sf2 - v.reduce((s, vi) => s + vi * vi, 0);
  return { mean, sd: Math.sqrt(Math.max(variance, 0)) };
}

// demo
const X = [-2, -1, 0, 1.5, 3], y = X.map(x => Math.sin(x));
const gp = fitGP(X, y, { sf2: 1.2, len: 1.0, noise: 0.01 });
console.log(predict(gp, 0.5));   // → { mean: ~0.48, sd: small (near data) }
console.log(predict(gp, 6.0));   // → { mean: ~0,    sd: large (far from data) }

Python implementation

The same exact GP in NumPy, plus the log marginal likelihood you would maximize to learn the hyperparameters.

import numpy as np
from scipy.linalg import cho_factor, cho_solve

def rbf(A, B, sf2, length):
    # squared-exponential kernel between two stacks of inputs
    d2 = (A[:, None] - B[None, :]) ** 2
    return sf2 * np.exp(-d2 / (2 * length ** 2))

def fit_gp(X, y, sf2=1.0, length=1.0, noise=1e-2):
    K = rbf(X, X, sf2, length) + (noise + 1e-8) * np.eye(len(X))
    L = cho_factor(K, lower=True)            # Cholesky, O(n³)
    alpha = cho_solve(L, y)                   # (K+σ²I)⁻¹ y
    return dict(X=X, L=L, alpha=alpha, sf2=sf2, length=length)

def predict(gp, Xs):
    Ks = rbf(gp["X"], Xs, gp["sf2"], gp["length"])      # n × m
    mean = Ks.T @ gp["alpha"]                            # O(n·m)
    v = cho_solve(gp["L"], Ks)                           # L solve
    var = gp["sf2"] - np.sum(Ks * v, axis=0)             # diag of cov
    return mean, np.sqrt(np.maximum(var, 0))

def log_marginal_likelihood(X, y, sf2, length, noise):
    # the objective you maximize to learn sf2, length, noise — no CV needed
    K = rbf(X, X, sf2, length) + (noise + 1e-8) * np.eye(len(X))
    L = np.linalg.cholesky(K)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
    fit       = -0.5 * y @ alpha                         # data fit
    complexity = -np.sum(np.log(np.diag(L)))             # Occam term (½log|K|)
    const      = -0.5 * len(X) * np.log(2 * np.pi)
    return fit + complexity + const

X = np.array([-2., -1., 0., 1.5, 3.])
y = np.sin(X)
gp = fit_gp(X, y, sf2=1.2, length=1.0, noise=0.01)
m, s = predict(gp, np.array([0.5, 6.0]))
print(m, s)   # near-data point: tight s; far point: large s

The log_marginal_likelihood is the heart of GP model selection. Its complexity term — the log-determinant of the kernel — automatically penalizes overly flexible kernels, so maximizing it trades off fit against simplicity without a held-out validation set. In production you would hand this (and its gradient) to L-BFGS, exactly as scikit-learn's GaussianProcessRegressor and GPyTorch do.

Kernels and variants worth knowing

RBF / squared-exponential. The default. Infinitely differentiable, so its sample functions are extremely smooth. One length-scale per dimension (automatic relevance determination, ARD) lets the model learn which inputs matter and prune the rest.

Matérn kernels. A family parameterized by ν that controls roughness; ν = 3/2 and ν = 5/2 are the workhorses. Far more realistic than RBF for physical processes, which are rarely infinitely smooth — Stein's 1999 book argued the RBF is "unrealistically smooth" for most spatial data and Matérn should be the default.

Periodic and linear kernels. Encode known structure directly. A periodic kernel makes the GP extrapolate a repeating pattern; a linear kernel recovers Bayesian linear regression as a special case. Kernels add and multiply, so periodic × RBF gives locally-periodic-with-drift — the basis of the Mauna Loa CO₂ model in Rasmussen & Williams.

Sparse / variational GPs. SGPR and SVGP (Hensman et al., 2013) use inducing points and stochastic variational inference to scale to millions of points and to non-Gaussian likelihoods via mini-batches.

Deep kernel learning & deep GPs. Feed inputs through a neural network before the kernel (Wilson et al., 2016), or stack GPs in layers (Damianou & Lawrence, 2013), buying representation power at the price of the GP's clean closed form.

Common bugs and edge cases

Forgetting to add jitter. Two near-identical inputs make K nearly singular and Cholesky throws "not positive definite." Add 10⁻⁶·I before factorizing.
Not standardizing inputs and targets. A single length-scale assumes inputs share a scale. Z-score every feature; center the targets so the zero-mean prior is sensible. Skip this and the optimizer fights the geometry.
Confusing noise with signal variance. The noise term σ² sets how far the mean is allowed to miss each point; the signal variance sf² sets vertical spread. Setting noise to zero forces interpolation and makes the matrix unstable; setting it too high gives a flat, lazy fit.
Returning negative variance. Floating-point error in k** − vᵀv can dip slightly below zero near training points. Clamp at zero before taking the square root, as both implementations above do.
Expecting RBF to extrapolate a trend. Past a length-scale beyond your data the mean reverts to zero and the band re-widens. If you need a trend, add a linear or polynomial mean function, or a structured kernel — don't blame the GP.
Letting n creep past memory. The O(n²) kernel matrix, not runtime, is the usual killer. At n = 50,000 you are already at ~20 GB; switch to a sparse/inducing-point GP rather than buying RAM.

Frequently asked questions

Why is Gaussian process regression O(n³)?

Training inverts (or Cholesky-factorizes) the n×n kernel matrix over the n training points, which costs O(n³) time and O(n²) memory. That factorization is the whole model; once you have it, each prediction is O(n) for the mean and O(n²) for the variance. The cubic cost is why exact GPs stall around 10,000–50,000 points without approximation.

What does the kernel actually control in a GP?

The kernel defines how correlated two outputs are as a function of their inputs, and that single choice fixes the entire prior over functions. The RBF kernel's length-scale ℓ sets how quickly correlation decays — small ℓ gives wiggly functions and tight bands only very near data, large ℓ gives smooth functions and confidence that reaches farther. The signal variance scales the vertical spread, and the noise term sets how closely the mean must pass through each point.

What makes a GP's uncertainty 'calibrated'?

If the kernel and noise are well-chosen, a 95% predictive interval contains the true value about 95% of the time across many test points. The variance comes from the exact Gaussian posterior, not a bootstrap or an ensemble heuristic, so the bands shrink to near-zero at observations and grow back toward the prior far from data — exactly where a calibrated model should be unsure.

Gaussian process vs neural network — when do I pick which?

Use a GP when data is small (dozens to a few thousand points), you need honest uncertainty, and each label is expensive — hyperparameter tuning, physical experiments, Bayesian optimization. Use a neural network when you have hundreds of thousands of examples, high-dimensional raw inputs like images, and point accuracy matters more than calibrated error bars. GPs scale as O(n³); deep nets scale to billions of examples.

How do you fit a GP's hyperparameters?

By maximizing the log marginal likelihood — a closed-form expression in the kernel parameters that already balances data fit against model complexity (its log-determinant term is an automatic Occam's razor). You differentiate it and run gradient ascent (L-BFGS is standard). No separate validation set or cross-validation loop is required, which is one of the GP's quiet advantages on small data.

Does a GP extrapolate beyond the training data?

Not in the way you might hope. With a stationary kernel like the RBF, the posterior mean decays back toward the prior mean (usually zero) once you move more than a length-scale past the data, and the uncertainty band balloons back to the prior width. That's honest, not broken — but if you need structured extrapolation, encode it in the kernel (e.g. periodic or linear kernels) rather than expecting the RBF to invent a trend.