Machine Learning
Gradient Boosting
Stack a thousand mediocre trees and you get a champion
Gradient boosting builds a strong predictor by adding shallow trees one at a time, each fit to the negative gradient (the residual errors) of the loss so far, shrinking the mistakes of the ensemble before it.
- Ensemble typeSequential boosting
- Base learnerShallow tree (depth 3–8)
- Training timeO(M · n · features · depth)
- ReducesBias (not variance)
- Key knobLearning rate η ≈ 0.01–0.1
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The intuition: learn from your last mistake
Suppose you are predicting house prices and your first guess for every house is the dataset's average, say $300,000. That's a terrible model, but it's a start. Now look at the errors: house A actually sold for $450,000, so you were $150,000 too low; house B sold for $260,000, so you were $40,000 too high. Those leftover errors — the residuals — carry all the signal your first model missed.
Gradient boosting's idea is to fit a small tree to the residuals, not to the prices. That second tree won't predict price; it predicts "how wrong was the first model here, and in which direction?" Add a scaled-down slice of its predictions back to the running total, recompute the residuals, and fit a third tree to those. Each tree nudges the ensemble toward the truth by attacking only what's left over. After a few hundred rounds, a stack of individually weak trees becomes one of the most accurate predictors in machine learning.
The method was introduced by Jerome Friedman in 1999 (published as "Greedy Function Approximation: A Gradient Boosting Machine" in 2001), generalizing the AdaBoost idea of Freund and Schapire into a framework that works for any differentiable loss. Friedman's key insight: "fit the residuals" is a special case of "fit the negative gradient of the loss," and gradient descent in function space is what makes it general.
The precise mechanism
You want a function F(x) minimizing the average loss L(y, F(x)) over the data. Gradient boosting does this by adding one weak learner per round in an additive model:
F_M(x) = F_0(x) + η · h_1(x) + η · h_2(x) + … + η · h_M(x)
The algorithm proper:
- Initialize with the constant that minimizes the loss:
F_0(x) = argmin_c Σ L(y_i, c). For squared error that's the mean ofy; for log-loss it's the log-odds of the positive class. - For each round m = 1…M:
- Compute the pseudo-residuals — the negative gradient of the loss at the current predictions:
r_i = −∂L(y_i, F(x_i)) / ∂F(x_i), evaluated atF = F_{m−1}. - Fit a shallow regression tree
h_mto the pairs(x_i, r_i). - (Optionally) solve a one-dimensional line search for the optimal leaf values
γthat minimize the true loss inside each leaf. - Update:
F_m(x) = F_{m−1}(x) + η · h_m(x), whereηis the learning rate.
- Compute the pseudo-residuals — the negative gradient of the loss at the current predictions:
- Predict with
F_M(x)(apply the inverse link, e.g. a sigmoid, for classification).
For squared-error loss L = ½(y − F)², the negative gradient is −∂L/∂F = (y − F) — literally the residual. That's the special case that gives the algorithm its "fit the errors" intuition. For binary log-loss the pseudo-residual is y − p where p = σ(F) is the predicted probability, and modern implementations (XGBoost) also use the second derivative (the Hessian) for a Newton step rather than plain gradient descent.
Complexity. Training is M trees, each costing roughly O(n · features · depth) for an exact split search, or O(n · features) per level with histogram binning. Total training is O(M · n · features · depth). Crucially the trees are built sequentially — round m needs the residuals produced by round m−1 — so unlike a random forest you cannot train the trees in parallel (you parallelize within each tree's split search instead). Prediction is O(M · depth): walk every tree from root to leaf and sum.
When to reach for boosting
- Tabular, structured data — the regime where boosted trees still beat deep learning on most Kaggle competitions and production tables.
- Heterogeneous features — mixed numeric and categorical columns of wildly different scales; trees don't need normalization.
- You want top accuracy and can afford tuning — boosting has more knobs than a random forest, but rewards them with lower error.
- Ranking and custom objectives — because it only needs a differentiable loss, it powers learning-to-rank (LambdaMART) and quantile regression out of the box.
Skip it when you need a model you can train in one parallel pass with almost no tuning (use a random forest), when you have raw images/audio/text (use a neural network), or when you have so little data that any high-variance learner will overfit. Boosting's sequential dependence also makes it slower to train than bagging on huge datasets unless you use a histogram-based library.
Gradient boosting vs other ensembles
| Gradient boosting | Random forest | AdaBoost | Bagging | Single tree | |
|---|---|---|---|---|---|
| Tree construction | Sequential | Parallel | Sequential | Parallel | One tree |
| Mainly reduces | Bias | Variance | Bias | Variance | Neither |
| Base learner | Shallow tree (depth 3–8) | Deep tree (full) | Stump (depth 1) | Deep tree | One tree |
| Fits each learner to | Negative gradient / residual | Bootstrap sample | Re-weighted samples | Bootstrap sample | Raw labels |
| Overfits with more trees? | Yes — needs early stopping | No — plateaus safely | Yes | No | N/A |
| Training parallelism | Within-tree only | Fully parallel | Within-tree only | Fully parallel | Within-tree |
| Typical accuracy on tabular | Highest | High | Medium-high | Medium | Low |
| Tuning sensitivity | High (η, M, depth, subsample) | Low | Medium | Low | Low |
The headline contrast is bias vs variance. Random forest builds many independent deep trees and averages them, which cancels variance but can't fix a systematic bias. Boosting builds dependent shallow trees, each correcting the bias the previous ones left behind — which is why it usually wins, and also why it will happily keep fitting noise if you don't stop it.
What the numbers actually say
- Learning-rate / tree-count trade-off. Friedman's shrinkage result: dropping the learning rate from 0.1 to 0.01 typically needs roughly 10× more trees to reach the same training loss — so 100 trees at η=0.1 ≈ 1,000 trees at η=0.01 — but the slower path generalizes measurably better. Most practitioners settle around η = 0.05 with 300–1,000 rounds plus early stopping.
- Speed from histograms. LightGBM bins each feature into 255 buckets, so a split search costs
O(255 · features)per node instead ofO(n · features). On a 10-million-row dataset that collapses the per-node split candidates from millions to a few hundred, and LightGBM's paper reports up to a ~20× end-to-end training speedup over conventional GBDT — turning hours into minutes. - Subsampling helps and speeds up. Stochastic gradient boosting (Friedman 2002) uses a random ~50% of rows per tree; it both regularizes and halves the per-tree cost, often improving test accuracy.
- Diminishing returns. On most tabular problems validation loss flattens after a few hundred trees; rounds beyond that buy fractions of a percent while steadily increasing prediction latency (each tree adds O(depth) work at inference).
JavaScript implementation
A minimal least-squares gradient booster using decision stumps (depth-1 trees) as the weak learner. The residual-fitting loop is the whole idea; the stump finder is the standard "best single threshold split" search.
// Fit a depth-1 regression stump to targets `t` over feature matrix X.
function bestStump(X, t) {
const n = X.length, F = X[0].length;
let best = { sse: Infinity, feat: 0, thr: 0, left: 0, right: 0 };
for (let f = 0; f < F; f++) {
const vals = [...new Set(X.map(r => r[f]))].sort((a, b) => a - b);
for (let k = 0; k < vals.length - 1; k++) {
const thr = (vals[k] + vals[k + 1]) / 2;
let sl = 0, sr = 0, cl = 0, cr = 0;
for (let i = 0; i < n; i++) {
if (X[i][f] <= thr) { sl += t[i]; cl++; } else { sr += t[i]; cr++; }
}
if (!cl || !cr) continue;
const ml = sl / cl, mr = sr / cr; // leaf = mean residual
let sse = 0;
for (let i = 0; i < n; i++) {
const p = X[i][f] <= thr ? ml : mr;
sse += (t[i] - p) ** 2;
}
if (sse < best.sse) best = { sse, feat: f, thr, left: ml, right: mr };
}
}
return best;
}
function fitGradientBoosting(X, y, { rounds = 200, lr = 0.1 } = {}) {
const base = y.reduce((a, b) => a + b, 0) / y.length; // F_0 = mean(y)
const trees = [];
const F = new Array(X.length).fill(base);
for (let m = 0; m < rounds; m++) {
// squared-error pseudo-residual = y - F (the negative gradient)
const r = y.map((yi, i) => yi - F[i]);
const stump = bestStump(X, r);
for (let i = 0; i < X.length; i++) {
const pred = X[i][stump.feat] <= stump.thr ? stump.left : stump.right;
F[i] += lr * pred; // shrink, then add
}
trees.push(stump);
}
return { base, lr, trees };
}
function predict(model, row) {
let f = model.base;
for (const t of model.trees) {
f += model.lr * (row[t.feat] <= t.thr ? t.left : t.right);
}
return f;
}
Two details worth flagging. First, F is the running prediction; the residual is recomputed against it every round, which is what makes tree m depend on tree m−1. Second, the learning rate multiplies the tree's output before it's added — never bake shrinkage into the leaf values, or prediction time will double-apply it.
Python implementation
The same algorithm with scikit-learn stumps as the base learner, plus a binary-classification version using the log-loss gradient.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class GradientBoostingRegressor:
def __init__(self, rounds=200, lr=0.1, max_depth=3):
self.rounds, self.lr, self.max_depth = rounds, lr, max_depth
self.trees, self.base = [], 0.0
def fit(self, X, y):
self.base = y.mean() # F_0 minimizes squared error
F = np.full(len(y), self.base)
for _ in range(self.rounds):
residual = y - F # negative gradient of ½(y-F)²
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residual)
F += self.lr * tree.predict(X)
self.trees.append(tree)
return self
def predict(self, X):
F = np.full(X.shape[0], self.base)
for tree in self.trees:
F += self.lr * tree.predict(X)
return F
class GradientBoostingClassifier:
"""Binary classifier with log-loss; fits the gradient y - sigmoid(F)."""
def __init__(self, rounds=200, lr=0.1, max_depth=3):
self.rounds, self.lr, self.max_depth = rounds, lr, max_depth
self.trees, self.base = [], 0.0
@staticmethod
def _sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def fit(self, X, y):
p = np.clip(y.mean(), 1e-6, 1 - 1e-6)
self.base = np.log(p / (1 - p)) # log-odds init
F = np.full(len(y), self.base)
for _ in range(self.rounds):
residual = y - self._sigmoid(F) # log-loss pseudo-residual
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residual)
F += self.lr * tree.predict(X)
self.trees.append(tree)
return self
def predict_proba(self, X):
F = np.full(X.shape[0], self.base)
for tree in self.trees:
F += self.lr * tree.predict(X)
return self._sigmoid(F)
Note that the classifier fits a regression tree to a continuous gradient (y − p), then maps the accumulated score through a sigmoid only at prediction time. That's the standard trick: boosting always regresses on gradients in the raw "score" space, and the link function lives outside the trees.
Variants worth knowing
XGBoost (2016, Chen & Guestrin). Adds a second-order Taylor expansion of the loss — it uses both the gradient and the Hessian to choose splits (a Newton step), plus explicit L1/L2 penalties on leaf weights and a sparsity-aware split finder. The library that won most of Kaggle in the late 2010s.
LightGBM (Microsoft, 2017). Grows trees leaf-wise (split the leaf with the largest loss reduction) instead of level-wise, uses histogram binning, and adds GOSS (keep large-gradient rows, subsample the rest) and EFB (bundle sparse features). Fastest option on millions of rows.
CatBoost (Yandex, 2017). Uses ordered boosting and ordered target statistics to avoid the target leakage that naive categorical encoding causes, and builds symmetric (oblivious) trees that are fast to evaluate. Strong default behavior on categorical-heavy data.
Stochastic gradient boosting. Friedman's 2002 variant: subsample rows (and, in modern libs, columns) per tree. Cheap, regularizing, and almost always on by default now.
LambdaMART. Boosting with a ranking objective — the gradients come from a pairwise ranking loss weighted by the change in NDCG. The backbone of many learning-to-rank search systems.
Common bugs and edge cases
- Confusing residual with gradient. "Fit the residuals" is only correct for squared-error loss. For log-loss, Huber, or quantile loss you must fit the negative gradient, which is not the raw error. Hard-coding
y − Ffor classification gives wrong, unbounded updates. - Applying the learning rate twice. Shrink at the update step or in prediction, never both. Double-applying makes a model that looks fine in training and predicts half-scale at inference.
- No early stopping. Because training loss decreases monotonically, "train until it stops improving on the training set" never stops the right way — it overfits. Hold out a validation set and stop when validation loss plateaus.
- Trees too deep. A depth-15 weak learner isn't weak; the ensemble overfits in a handful of rounds. Keep depth 3–8 (or cap leaves at 8–32) and let the number of rounds do the work.
- Forgetting the constant initializer. Skipping
F_0(the mean or log-odds) makes the first few trees waste capacity fitting the global offset instead of the structure. - Expecting parallel training. You can't fan trees across cores the way a random forest does — round
mneeds roundm−1's residuals. Parallelism lives inside each tree's split search. - Data leakage via target encoding. Encoding categoricals with the mean target before the split leaks the label; this is exactly what CatBoost's ordered statistics were designed to prevent.
Frequently asked questions
What is the difference between gradient boosting and random forest?
Random forest builds many deep trees in parallel on bootstrap samples and averages them — it reduces variance. Gradient boosting builds shallow trees sequentially, each correcting the residual errors of the ensemble so far — it reduces bias. Boosting usually scores higher but overfits if you add too many trees; a random forest is far harder to over-train.
Why is it called gradient boosting?
Each new tree is fit to the negative gradient of the loss function with respect to the current predictions — that gradient is the steepest-descent direction in function space. For squared-error loss the negative gradient is exactly the residual y − F(x), which is why the algorithm looks like 'fit the leftover errors.' For other losses (log-loss, Huber) the target is the gradient, not the literal residual.
What does the learning rate do in gradient boosting?
The learning rate (shrinkage, often 0.01–0.1) scales each tree's contribution before adding it. Smaller rates mean each tree corrects less, so you need more trees, but the ensemble generalizes better. It is the single most important regularization knob: a rate of 0.1 with 100 trees and a rate of 0.01 with 1000 trees can reach similar accuracy, with the slower one usually more robust.
Why are gradient boosting trees shallow?
Each tree only needs to capture a small slice of the remaining error, so depths of 3–8 (or 8–32 leaves) are typical. Shallow trees are weak learners — high bias, low variance — and boosting works precisely by stacking many weak learners. A forest of deep trees would each already overfit, leaving nothing useful to boost.
Does gradient boosting overfit?
Yes — unlike bagging, adding more boosting rounds keeps lowering training loss and will eventually memorize noise. You control it with a small learning rate, shallow trees, row and column subsampling (stochastic gradient boosting), L1/L2 leaf penalties, and early stopping on a validation set when the validation loss stops improving.
What is the difference between XGBoost, LightGBM, and CatBoost?
All three are gradient boosting libraries. XGBoost adds a second-order (Newton) loss approximation and strong regularization. LightGBM grows trees leaf-wise and uses histogram binning for large datasets, making it the fastest on millions of rows. CatBoost uses ordered boosting and native categorical encoding to avoid target leakage. They share the same core algorithm and differ mainly in tree-growth strategy and engineering.