Machine Learning

Decision Tree

Q: What's the difference between Gini impurity and entropy?

Both measure how mixed a node's labels are, and in practice they pick almost the same splits. Gini is 1 − Σpᵢ² and entropy is −Σpᵢ·log₂pᵢ. Gini is slightly cheaper because it skips the logarithm, which is why CART and scikit-learn default to it; the two disagree on the chosen split in well under 2% of cases.

Q: Why does a single decision tree overfit so easily?

A tree grown to purity memorizes the training set — it keeps splitting until every leaf is a single class, including the noise. With no depth limit a tree can reach 100% train accuracy and far worse test accuracy. Pruning, a max-depth cap, or a minimum-samples-per-leaf threshold trades a little training fit for much better generalization.

Q: Are decision trees greedy, and does that matter?

Yes. At each node the algorithm picks the single best split right now without looking ahead, so the tree is locally optimal but rarely globally optimal. Finding the smallest tree that fits the data is NP-complete, so the greedy heuristic is a deliberate trade: O(n·m·log n) training instead of an intractable search.

Q: How does a decision tree handle continuous features?

It sorts the values of a feature and tests thresholds at the midpoints between adjacent values, choosing the cut that most reduces impurity. A feature with k distinct values yields k−1 candidate thresholds, so the same feature can be split on repeatedly at different cutoffs deeper in the tree.

Q: Why is a random forest usually better than one tree?

A single tree has high variance — change a few rows and you get a very different tree. A random forest trains hundreds of trees on bootstrapped samples and random feature subsets, then averages them. Averaging cancels the per-tree noise and typically cuts test error by a third or more, at the cost of losing the single tree's readability.

Q: Do decision trees need feature scaling?

No. A split is a threshold test like x < 4.7, and the chosen threshold is invariant under any monotonic rescaling of that feature. Normalizing or standardizing inputs changes nothing about the tree, which is one reason trees are convenient on raw mixed-type tabular data where distance-based models like k-NN or SVMs need careful scaling.

A flowchart the machine writes for itself — every split chosen to cut confusion

A decision tree is an interpretable classifier that recursively splits data on the feature that most reduces impurity, building an if-then flowchart whose leaves are predictions — greedy to train, instant to read.

TrainingO(n·m·log n)
PredictionO(depth) ≈ O(log n)
Split criterionGini / entropy
Feature scalingNot needed
Optimal treeNP-complete

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a decision tree learns

Imagine a table of past loan applicants, each row tagged repaid or defaulted, with columns for income, age, and debt ratio. A decision tree turns that table into a flowchart of yes/no questions. At the top, it asks the single question that best separates the two classes — say, income < $42k? — then asks a follow-up question inside each branch, and keeps going until each leaf is (almost) all one class. To classify a new applicant you just walk the questions from the root to a leaf and read off the label.

The whole algorithm is one greedy loop. At each node it considers every feature and every candidate threshold, scores each split by how much it reduces impurity, and commits to the best one. Then it recurses on the two resulting child sets. This is the core of CART (Classification And Regression Trees, Breiman et al., 1984) and its cousins ID3 and C4.5.

The only real design choice is how to measure "best split." Two impurity measures dominate:

Gini    = 1 − Σ pᵢ²              (probability two random picks disagree)
Entropy = − Σ pᵢ · log₂ pᵢ      (bits needed to encode the label)

Here pᵢ is the fraction of class i in the node. A pure node (all one class) scores 0; a 50/50 split scores the maximum (0.5 for Gini, 1.0 for entropy). The split's quality is the impurity of the parent minus the size-weighted impurity of its children — for entropy this gap is called information gain. The tree picks the split with the largest gain.

A worked split

Suppose a node holds 10 applicants: 6 repaid, 4 defaulted. Its Gini is 1 − (0.6² + 0.4²) = 0.48. Now test the split income < $42k, which sends 4 rows left (1 repaid, 3 defaulted) and 6 right (5 repaid, 1 defaulted):

Gini(left)  = 1 − (0.25² + 0.75²) = 0.375
Gini(right) = 1 − (0.833² + 0.167²) = 0.278
weighted    = (4/10)·0.375 + (6/10)·0.278 = 0.317
gain        = 0.48 − 0.317 = 0.163

The tree computes that gain for every feature and every threshold, then keeps the largest. A gain of 0 means the split tells you nothing — both children are as mixed as the parent — so the branch stops and becomes a leaf.

When to reach for a tree

You need to explain the model. A tree is a literal list of if-then rules a regulator, doctor, or loan officer can read. This interpretability is the single biggest reason trees survive in a deep-learning world.
Mixed-type tabular data. Numeric, ordinal, and one-hot categorical columns coexist with no scaling, no normalization, and graceful handling of nonlinear interactions.
Fast, branchy inference. A prediction is a handful of comparisons — no matrix multiplies — so trees run in microseconds on a CPU and embed cleanly in low-power devices.
A baseline before the ensemble. One tree is the unit cell of random forests and gradient boosting, which win most tabular-data competitions; understanding the tree is prerequisite to understanding them.

Reach for something else when the signal is a smooth function of the inputs (linear or logistic regression fit fewer parameters), when the data is high-dimensional and sparse like text (linear SVMs dominate), or when you have raw images or audio (convolutional and transformer networks). A single tree's axis-aligned cuts also struggle with diagonal decision boundaries — it approximates a 45° line with a noisy staircase.

Decision tree vs other classifiers

	Decision tree	Random forest	Gradient boosting	Logistic regression	k-NN
Interpretable	Yes (rules)	Partly (importances)	Partly (importances)	Yes (weights)	No
Train cost	O(n·m·log n)	T trees × tree	T trees × tree	O(n·m·iters)	O(1) (lazy)
Predict cost	O(depth)	O(T·depth)	O(T·depth)	O(m)	O(n·m)
Needs scaling	No	No	No	Yes	Yes
Captures interactions	Yes	Yes	Yes	No (linear)	Yes (local)
Variance	High	Low	Low–medium	Low	Medium
Typical accuracy on tabular	Baseline	Strong	State of the art	Good if linear	Mediocre

The pattern is consistent: a lone tree is the most readable and the least accurate of the tree family, and ensembles trade that readability for a large accuracy gain by averaging away the tree's high variance.

What the numbers actually say

Training is O(n·m·log n). For each of m features you sort the n samples once per level (O(n·log n)) and the tree has O(log n) levels in the balanced case. On a 100k-row, 50-feature dataset that's seconds, not minutes.
Prediction is O(depth). A depth-10 tree answers in 10 comparisons — well under a microsecond. This is why trees power ad-ranking and fraud-scoring paths with single-digit-millisecond budgets.
Unpruned trees overfit hard. On a noisy dataset an unbounded tree routinely hits 100% train accuracy while test accuracy lags 15–25 points behind. Capping max_depth or setting min_samples_leaf closes most of that gap.
Finding the optimal tree is NP-complete. Hyafil and Rivest proved in 1976 that building the minimal-size tree consistent with the data is intractable, which is why every practical learner is greedy and accepts a locally optimal tree.
Gini and entropy agree ≈98% of the time. Empirically they pick the same split on the vast majority of nodes; Gini is the default because it avoids the logarithm.

JavaScript implementation

// rows: array of { x: [f0, f1, ...], y: classLabel }
const gini = rows => {
  const counts = {};
  for (const r of rows) counts[r.y] = (counts[r.y] || 0) + 1;
  let g = 1;
  for (const c in counts) { const p = counts[c] / rows.length; g -= p * p; }
  return g;
};

// Try every feature × threshold; return the split with the largest impurity drop.
function bestSplit(rows) {
  const parent = gini(rows);
  let best = { gain: 0, feat: -1, thresh: 0, left: null, right: null };
  const m = rows[0].x.length;
  for (let f = 0; f < m; f++) {
    const vals = [...new Set(rows.map(r => r.x[f]))].sort((a, b) => a - b);
    for (let i = 0; i < vals.length - 1; i++) {
      const thresh = (vals[i] + vals[i + 1]) / 2;          // midpoint candidate
      const left  = rows.filter(r => r.x[f] < thresh);
      const right = rows.filter(r => r.x[f] >= thresh);
      if (!left.length || !right.length) continue;
      const w = (left.length * gini(left) + right.length * gini(right)) / rows.length;
      const gain = parent - w;
      if (gain > best.gain) best = { gain, feat: f, thresh, left, right };
    }
  }
  return best;
}

function build(rows, depth = 0, maxDepth = 6, minLeaf = 1) {
  const majority = () => {                                 // most common label
    const c = {}; for (const r of rows) c[r.y] = (c[r.y] || 0) + 1;
    return Object.entries(c).sort((a, b) => b[1] - a[1])[0][0];
  };
  if (depth >= maxDepth || rows.length <= minLeaf || gini(rows) === 0)
    return { leaf: true, label: majority() };
  const s = bestSplit(rows);
  if (s.gain === 0) return { leaf: true, label: majority() }; // no useful split
  return {
    leaf: false, feat: s.feat, thresh: s.thresh,
    left:  build(s.left,  depth + 1, maxDepth, minLeaf),
    right: build(s.right, depth + 1, maxDepth, minLeaf),
  };
}

function predict(node, x) {
  while (!node.leaf) node = x[node.feat] < node.thresh ? node.left : node.right;
  return node.label;
}

Two details to notice. First, the stopping rules — maxDepth, minLeaf, and the pure-node check — are the only thing standing between you and a tree that memorizes the training set. Second, predict is a plain while loop down the tree; there is no arithmetic beyond a comparison per level, which is why inference is so cheap.

Python implementation

from collections import Counter
import math

def gini(rows):
    n = len(rows)
    counts = Counter(y for _, y in rows)
    return 1 - sum((c / n) ** 2 for c in counts.values())

def best_split(rows):
    parent, m = gini(rows), len(rows[0][0])
    best = (0.0, None, None, None, None)          # gain, feat, thresh, left, right
    for f in range(m):
        vals = sorted({x[f] for x, _ in rows})
        for a, b in zip(vals, vals[1:]):
            thresh = (a + b) / 2
            left  = [r for r in rows if r[0][f] <  thresh]
            right = [r for r in rows if r[0][f] >= thresh]
            if not left or not right:
                continue
            w = (len(left) * gini(left) + len(right) * gini(right)) / len(rows)
            gain = parent - w
            if gain > best[0]:
                best = (gain, f, thresh, left, right)
    return best

def build(rows, depth=0, max_depth=6, min_leaf=1):
    majority = Counter(y for _, y in rows).most_common(1)[0][0]
    if depth >= max_depth or len(rows) <= min_leaf or gini(rows) == 0:
        return {"leaf": True, "label": majority}
    gain, f, thresh, left, right = best_split(rows)
    if gain == 0:
        return {"leaf": True, "label": majority}
    return {"leaf": False, "feat": f, "thresh": thresh,
            "left":  build(left,  depth + 1, max_depth, min_leaf),
            "right": build(right, depth + 1, max_depth, min_leaf)}

def predict(node, x):
    while not node["leaf"]:
        node = node["left"] if x[node["feat"]] < node["thresh"] else node["right"]
    return node["label"]

# In production, prefer the battle-tested CART implementation:
#   from sklearn.tree import DecisionTreeClassifier
#   clf = DecisionTreeClassifier(criterion="gini", max_depth=6, min_samples_leaf=5)
#   clf.fit(X, y)

The from-scratch version mirrors what scikit-learn does, but real implementations add surrogate splits for missing values, presorted feature arrays so each level is a linear scan instead of a re-sort, and cost-complexity pruning. Reach for DecisionTreeClassifier in anger; build it once by hand to understand it.

Variants worth knowing

CART. The standard binary tree using Gini for classification and squared-error reduction for regression. This is what scikit-learn, XGBoost, and LightGBM grow under the hood.

ID3 and C4.5. Quinlan's family. ID3 uses information gain and multi-way splits on categorical features; C4.5 adds gain-ratio (to stop high-cardinality features from winning by default), continuous-feature handling, and pruning.

Regression trees. Swap the impurity measure for variance or mean-squared-error and the leaf prediction for the mean of its samples. Same algorithm, continuous output — the basis of gradient-boosted regression.

Random forest. Hundreds of trees, each trained on a bootstrap sample with a random feature subset per split, then averaged. Drops variance dramatically; the canonical fix for the single tree's instability.

Gradient-boosted trees. Trees added sequentially, each fitting the residual errors of the ensemble so far. XGBoost, LightGBM, and CatBoost are the state of the art on tabular data — and every one of them is just decision trees plus arithmetic.

Oblique / optimal trees. Splits on linear combinations of features (oblique) capture diagonal boundaries the axis-aligned tree cannot; "optimal" trees (e.g. via mixed-integer programming) search for the globally best small tree, accepting much higher training cost for a more compact, accurate model.

Common bugs and edge cases

No depth limit. The default in some libraries is "grow until pure," which overfits. Always set max_depth or min_samples_leaf and validate on held-out data.
Letting an ID column in. A unique key or row index gives a "perfect" split with maximum gain at every node — the tree memorizes the training set through that column. Drop identifiers before training.
High-cardinality feature bias. Plain information gain favors features with many distinct values (more thresholds to win on). Use gain-ratio or Gini, and beware categorical columns with thousands of levels.
Threshold off-by-one. Use the midpoint between adjacent sorted values, not a raw value, so a sample equal to the threshold lands consistently. Pick one convention — < goes left — and apply it identically at train and predict time.
Class imbalance. On a 99/1 split a tree can score 99% accuracy by always predicting the majority and learning nothing. Use class weights, resampling, or evaluate with precision/recall rather than accuracy.
Comparing a tuned ensemble to a default tree. Benchmarks often pit a heavily-tuned forest against a vanilla single tree. Tune the tree's depth and leaf size first — a well-pruned tree is far more competitive than the unbounded default suggests.

Frequently asked questions

What's the difference between Gini impurity and entropy?