Machine Learning
Random Forest
Average a crowd of overconfident trees — and the crowd is wise
A random forest is an ensemble that averages hundreds of decorrelated decision trees — each grown on a bootstrap sample with a random subset of features at every split — to slash the high variance of a single tree without raising its bias.
- Base learnerDecision tree
- TrainingEmbarrassingly parallel
- Bootstrap row leave-out≈ 36.8%
- Features per split√p (class.) · p/3 (reg.)
- InferenceO(T · depth)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a random forest works
A single decision tree is a high-variance learner. Grow it to full depth and it carves the feature space into tiny regions that perfectly fit the training data — including its noise. Change a handful of training rows and the splits near the top can flip, cascading into a completely different tree and completely different predictions. Low bias, terrible variance.
Leo Breiman's 2001 random forest tames that variance with one idea applied twice: inject randomness, then average it out. If you average many independent estimators, the variance of the average drops by a factor equal to the number of estimators. The whole design is a campaign to make the trees as independent as possible so the average actually delivers that reduction.
Two sources of randomness do the work:
- Bagging (bootstrap aggregating). Each tree trains on a bootstrap sample: draw
nrows at random with replacement from then-row dataset. Some rows appear several times, others not at all. Every tree sees a different slice of reality, so they make different mistakes. - Random feature subsampling. At every split, the tree may only consider a random subset of
mof thepfeatures (the classic default ism = √pfor classification). This stops one dominant feature from being the first split in every tree.
At prediction time you push the input down all T trees. For classification, the forest takes the majority vote; for regression, the mean of the leaf values. That averaging is the entire payoff.
Why decorrelation, not just averaging
The variance math is what justifies the second random trick. If you average T identically-distributed predictors, each with variance σ² and pairwise correlation ρ, the variance of the average is:
Var(avg) = ρ·σ² + (1 − ρ)/T · σ²
The second term vanishes as you add trees — that's plain bagging. But the first term, ρ·σ², is a floor you can never average away. If your trees are 90% correlated, you're stuck near 90% of the original variance no matter how many you grow. Feature subsampling exists to push ρ down. That is the single insight that separates a random forest from "bagged trees," and it's why dropping the feature randomness measurably hurts accuracy on most datasets.
Out-of-bag error: free validation
Bagging hands you a validation set for free. A bootstrap sample of size n drawn with replacement omits each particular row with probability (1 − 1/n)ⁿ → 1/e ≈ 0.368. So roughly 36.8% of rows are out-of-bag (OOB) for any given tree — never used to train it.
To get an honest accuracy estimate, score each training row using only the trees that did not see it, and aggregate. This OOB estimate tracks k-fold cross-validation closely but costs nothing extra: the trees already exist. On a 1-million-row dataset, that's a free held-out estimate over ~368,000 rows per tree without ever splitting off a validation set.
When to choose a random forest
- Tabular data with mixed feature types — numeric and categorical together, no scaling required. Forests are the default strong baseline for structured data.
- You want a model that works out of the box. Few hyperparameters matter, and the defaults are usually fine. Hard to break.
- Nonlinear interactions you can't specify. Trees capture feature interactions automatically without you engineering them.
- You need feature importance or OOB estimates. Both come built in.
- Robustness to outliers and irrelevant features — a noisy feature simply rarely gets chosen as a split.
Reach for something else when you need a few percent more accuracy on a competition leaderboard (gradient boosting), when the data is images, audio, or text (deep nets), when you need a tiny low-latency model (forests of 500 deep trees are large), or when the relationship is genuinely linear and a logistic regression would be both faster and more interpretable.
Random forest vs other ensembles
| Random forest | Bagged trees | Gradient boosting | Extra-Trees | Single tree | |
|---|---|---|---|---|---|
| Tree growth | Parallel, deep | Parallel, deep | Sequential, shallow | Parallel, deep | One, any depth |
| Randomness | Bagging + feature subset | Bagging only | Row/feature subsample (optional) | + random split thresholds | None |
| Mainly reduces | Variance | Variance | Bias | Variance (more) | — |
| Overfits with more trees? | No | No | Yes (needs early stopping) | No | n/a |
| Tuning effort | Low | Low | High | Low | Low |
| Typical accuracy | Strong | Good | Often best | Strong, faster train | Weak |
| Real-world use | scikit-learn, ranger, H2O | Rare standalone | XGBoost, LightGBM, CatBoost | scikit-learn ExtraTrees | CART, interpretable models |
The headline trade-off is variance versus bias. A forest grows deep low-bias trees and averages away their variance; boosting grows shallow high-bias trees and adds them up to chase down the bias. That's why forests cannot overfit by adding trees but boosting can, and why boosting needs early stopping while forests just need "enough."
What the numbers actually say
- Bootstrap leaves out ≈ 36.8% of rows per tree. The limit
1/eis exact asn → ∞; even atn = 1000it's already 36.77%. - Variance floor is
ρ·σ². Drop tree correlation from 0.9 to 0.3 and you cut the irreducible-by-averaging variance roughly threefold — that's the entire return on feature subsampling. - Diminishing returns past ~200 trees. OOB error typically flattens between 100 and 300 trees; the 1000th tree adds <0.1% accuracy while doubling the inference cost of the 500th.
- Training is embarrassingly parallel.
Ttrees onkcores is ~T/kwall-clock — a 500-tree forest on 16 cores trains in roughly the time of 32 sequential trees. - Inference is
O(T · depth). 500 trees at depth 20 is ~10,000 comparisons per prediction — fast, but the model itself can be hundreds of megabytes, which is the real deployment cost. - Training cost is
O(T · n · √p · log n)in the typical case — the√pinstead ofpis feature subsampling earning its keep on wide data.
JavaScript implementation
A compact classification forest — bootstrap sampling, √p feature subsets, Gini-driven splits, and majority voting. Trimmed for clarity (no pruning, numeric features only).
const gini = (counts, n) => {
let imp = 1;
for (const c of counts.values()) imp -= (c / n) ** 2;
return imp;
};
function classCounts(rows) {
const m = new Map();
for (const r of rows) m.set(r.y, (m.get(r.y) || 0) + 1);
return m;
}
// Build one tree on a bootstrap sample, trying only mFeatures per split.
function buildTree(rows, nFeatures, mFeatures, maxDepth, depth = 0) {
const counts = classCounts(rows);
// pure node, depth cap, or too few rows → leaf
if (counts.size === 1 || depth >= maxDepth || rows.length < 2) {
let best, bestN = -1;
for (const [k, v] of counts) if (v > bestN) { best = k; bestN = v; }
return { leaf: true, label: best };
}
// pick a random subset of features to consider at this split
const feats = [...Array(nFeatures).keys()];
for (let i = feats.length - 1; i > 0; i--) { // Fisher–Yates
const j = (Math.random() * (i + 1)) | 0;
[feats[i], feats[j]] = [feats[j], feats[i]];
}
const tryFeats = feats.slice(0, mFeatures);
let bestGain = 0, bestFeat = null, bestThresh = 0, bestL = null, bestR = null;
const parentImp = gini(counts, rows.length);
for (const f of tryFeats) {
const vals = [...new Set(rows.map(r => r.x[f]))].sort((a, b) => a - b);
for (let i = 0; i < vals.length - 1; i++) {
const thresh = (vals[i] + vals[i + 1]) / 2;
const L = rows.filter(r => r.x[f] <= thresh);
const R = rows.filter(r => r.x[f] > thresh);
if (!L.length || !R.length) continue;
const wImp = (L.length * gini(classCounts(L), L.length)
+ R.length * gini(classCounts(R), R.length)) / rows.length;
const gain = parentImp - wImp;
if (gain > bestGain) { bestGain = gain; bestFeat = f; bestThresh = thresh; bestL = L; bestR = R; }
}
}
if (bestFeat === null) { // no useful split found
let best, bestN = -1;
for (const [k, v] of counts) if (v > bestN) { best = k; bestN = v; }
return { leaf: true, label: best };
}
return {
leaf: false, feat: bestFeat, thresh: bestThresh,
left: buildTree(bestL, nFeatures, mFeatures, maxDepth, depth + 1),
right: buildTree(bestR, nFeatures, mFeatures, maxDepth, depth + 1),
};
}
function bootstrap(rows) {
const out = [];
for (let i = 0; i < rows.length; i++) out.push(rows[(Math.random() * rows.length) | 0]);
return out;
}
function trainForest(rows, { nTrees = 200, maxDepth = 20 } = {}) {
const nFeatures = rows[0].x.length;
const mFeatures = Math.max(1, Math.round(Math.sqrt(nFeatures)));
const forest = [];
for (let t = 0; t < nTrees; t++)
forest.push(buildTree(bootstrap(rows), nFeatures, mFeatures, maxDepth));
return forest;
}
function predictTree(node, x) {
while (!node.leaf) node = x[node.feat] <= node.thresh ? node.left : node.right;
return node.label;
}
function predictForest(forest, x) {
const votes = new Map();
for (const tree of forest) {
const y = predictTree(tree, x);
votes.set(y, (votes.get(y) || 0) + 1); // majority vote
}
let best, bestN = -1;
for (const [k, v] of votes) if (v > bestN) { best = k; bestN = v; }
return best;
}
Two details worth flagging. First, bootstrap samples with replacement — using rows.length draws, not a shuffled slice — which is what produces the ~37% OOB rows. Second, the feature subset is re-drawn at every split, not once per tree; sampling once per tree is a common bug that quietly re-correlates the forest.
Python implementation
In practice you reach for scikit-learn, but it's worth seeing the full API including the parts that make forests pleasant — OOB scoring and feature importances come for free.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
clf = RandomForestClassifier(
n_estimators=300, # number of trees; more never hurts accuracy
max_features="sqrt", # √p features tried per split — the decorrelation knob
max_depth=None, # grow trees fully; the average tames the variance
min_samples_leaf=1,
bootstrap=True,
oob_score=True, # free held-out estimate from out-of-bag rows
n_jobs=-1, # train all trees in parallel
random_state=42,
)
clf.fit(X, y)
print(f"OOB accuracy: {clf.oob_score_:.4f}") # ~0.96, no held-out set needed
print(f"5-fold CV: {cross_val_score(clf, X, y, cv=5).mean():.4f}")
# Mean-decrease-in-impurity importances (fast but biased to high-cardinality features)
order = np.argsort(clf.feature_importances_)[::-1][:5]
for i in order:
print(f"feature {i:2d}: {clf.feature_importances_[i]:.3f}")
Prefer sklearn.inspection.permutation_importance over the built-in feature_importances_ when importance actually matters: impurity-based importance inflates high-cardinality and continuous features, while permutation importance measures the real accuracy drop when a feature is shuffled.
Variants worth knowing
Extremely Randomized Trees (Extra-Trees). Goes one step further: instead of searching for the best split threshold, it picks split thresholds at random and keeps the best of those. Even more decorrelation, faster training (no threshold search), slightly higher bias. Often matches a random forest at a fraction of the training cost. Geurts, Ernst & Wehenkel, 2006.
Gradient-boosted trees. The other dominant tree ensemble — sequential rather than parallel. Each shallow tree fits the residual errors of the running model, attacking bias instead of variance. XGBoost, LightGBM, and CatBoost are the production implementations; they usually edge out forests on accuracy at the cost of careful tuning and early stopping.
Isolation Forest. A forest repurposed for anomaly detection. It builds trees with random splits and measures how quickly each point gets isolated — anomalies sit in sparse regions and fall out near the root after few splits, so a short average path length flags them.
Quantile Regression Forests. Instead of averaging the leaf means, keep the full distribution of training targets in each leaf, letting you read off prediction intervals and arbitrary quantiles — not just a point estimate.
Mondrian Forests. An online variant whose trees can be grown incrementally as data streams in, giving principled uncertainty estimates without retraining from scratch.
Common bugs and edge cases
- Sampling features once per tree instead of per split. This silently re-correlates the trees and erases most of the variance reduction. Re-draw the feature subset at every node.
- Forgetting replacement in the bootstrap. Sampling without replacement (or a shuffled split) breaks the OOB guarantee and reduces sample diversity. It must be
ndraws with replacement. - Trusting impurity-based feature importance. It is biased toward high-cardinality and continuous features; an ID column can look "important." Use permutation importance for decisions.
- Class imbalance. Bootstrap samples inherit the imbalance, so the majority class dominates every tree's vote. Use stratified or balanced bootstrap sampling, or class weights, when one class is rare.
- Expecting extrapolation in regression. A forest can only predict averages of values it has seen in leaves, so it cannot extrapolate beyond the training range — feed it a future trend and it flat-lines at the last seen value.
- Treating it as interpretable. A single tree is a flowchart you can read; 500 of them are not. Reach for SHAP values or partial-dependence plots rather than claiming the forest is "explainable" by inspection.
- Leaking data through preprocessing. Fit encoders and imputers on training folds only — fitting them on the full dataset before bagging leaks information and inflates the OOB estimate.
Frequently asked questions
Why does a random forest use both bagging and random feature selection?
Bagging alone gives each tree a different bootstrap sample, but if one feature is highly predictive every tree splits on it first and the trees stay correlated — averaging correlated predictors barely lowers variance. Random feature selection forces different trees to split on different features, decorrelating them so the average actually pays off. You need both: bagging for sample diversity, feature subsampling for split diversity.
What is out-of-bag (OOB) error and why is it free?
Each bootstrap sample leaves out about 36.8% of the rows — these are that tree's out-of-bag rows. To score a row, you average only the trees that never saw it during training, giving a built-in validation estimate without a held-out set or cross-validation. It is free because the trees already exist; you just route each training row through the subset of trees that excluded it.
How many trees should a random forest have?
More trees never hurt accuracy — they only cost time and memory — and the error curve flattens fast. 100 to 300 trees is typical; 500 is a common safe default. Past a few hundred the OOB error stops improving measurably, so adding the 1000th tree buys almost nothing while doubling inference latency from the 500th.
What is the difference between a random forest and gradient boosting?
A random forest grows deep, independent trees in parallel and averages them to cut variance — each tree is a full model. Gradient boosting grows shallow trees sequentially, where each tree corrects the residual errors of the ones before it, cutting bias. Forests are easier to tune and parallelize; boosting (XGBoost, LightGBM) usually wins accuracy contests but overfits more readily and must be trained in order.
Can a random forest overfit?
Adding more trees cannot overfit — the average converges, it does not chase noise. But individual trees grown to full depth can each memorize their bootstrap sample, and if the features are too noisy or the data too small, the ensemble can still overfit. The usual fixes are limiting tree depth, raising the minimum samples per leaf, or lowering the number of features tried per split.
How does a random forest measure feature importance?
The classic measure is mean decrease in impurity (Gini importance): sum the impurity reduction each feature produces across all splits in all trees. It is fast but biased toward high-cardinality features. Permutation importance is more trustworthy: shuffle one feature's values in the OOB rows and measure how much accuracy drops — a feature that matters causes a big drop.