Data Structures

Centroid Decomposition

Halve the tree at every step — log N depth, distance queries in seconds

Centroid decomposition recursively splits a tree at its centroid — the node whose removal leaves balanced subtrees. The result is a recursion tree of depth log N, the basis for tree distance queries and divide-and-conquer on trees.

  • BuildO(N log N)
  • Centroid-tree depth≤ log₂ N
  • Distance queryO(log N) typical
  • Distance updateO(log N) per ancestor
  • MemoryO(N log N)
  • N = 10⁵~17 levels deep

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How centroid decomposition works

Consider a tree of N nodes. A centroid is a node whose removal partitions the remaining tree into subtrees each of size ≤ ⌊N/2⌋. Every tree has at least one centroid (sometimes two — adjacent — when the tree is even in size). You find it by computing subtree sizes from an arbitrary root, then walking toward the heaviest child until every remaining subtree has size ≤ N/2.

Centroid decomposition runs as a recursion:

  1. Find the centroid c of the current component.
  2. Make c a node of the centroid tree.
  3. Remove c. For each remaining component, recurse — the result becomes a child of c in the centroid tree.

The centroid tree has depth ≤ log₂ N because each recursive call at least halves component size. Crucially, the centroid tree is independent of the original tree's shape: a 10⁵-node linear path has a centroid tree of depth ~17, even though the original has depth 10⁵.

The key property: every path in the original tree passes through the centroid that becomes the lowest common ancestor of its endpoints in the centroid tree. That makes distance and counting problems decomposable — for each centroid c, look only at paths whose midpoint or LCA in the centroid tree is c.

Worked example — distance queries

Take a binary-tree-shaped tree of 15 nodes rooted at 1. The centroid sits somewhere near the middle — say node 4 — whose removal splits the tree into three roughly-7-node components. Inside each component find the local centroid (a node like 2 or 11), recurse, until every component has size 1.

For each node u and each of its centroid-tree ancestors c, precompute dist(u, c) in the original tree — at most log N entries per node. Total storage: O(N log N).

To answer "what's the closest red node to u", walk u's centroid-tree ancestors. At each ancestor c, look up the minimum dist(red, c) over all red nodes recorded at c, then add dist(u, c). Take the minimum across log N ancestors. Each query: O(log N) time, sometimes O(log² N) when updates are interleaved.

Numbers for N = 10⁵: 17 ancestors per node, 17 lookups per query. A red-update touches 17 ancestor records. Compare to brute force — O(N) BFS per query, 10⁵ ops × 10⁵ queries = 10¹⁰ ops, far too slow.

When to use centroid decomposition

  • Tree distance queries with updates — closest-red, k-th nearest, weight-conditioned shortest path.
  • Counting paths in a tree with a property — paths of length k, paths whose XOR equals a value, paths with sum divisible by k.
  • Divide and conquer on trees where the problem splits at any balanced cut — find pairs (u, v) satisfying some predicate.
  • Offline batched queries that benefit from the centroid-tree ancestor chain caching mechanism.
  • Augmented data structures over trees where path-shaped structures (HLD) don't apply.

Avoid it when the operation needs path-aggregate queries with point updates — HLD is faster and simpler. Avoid it when updates change tree structure — fall back to link-cut trees.

Centroid decomposition vs HLD vs Euler tour

Centroid decompositionHeavy-light decompositionEuler tour + segment tree
Path queryLimited / awkwardO(log² N)No native support
Distance queryO(log N) typicalO(log² N) with extra workO(log N) for LCA + O(1) depth
Counting paths-with-propertyO(N log² N) totalAwkwardAwkward
Subtree queryAwkwardO(log N)O(log N)
BuildO(N log N)O(N)O(N)
MemoryO(N log N)O(N)O(N)
Code length~80 lines~80 lines~50 lines

The three are complementary. Path query: HLD. Subtree query: Euler tour. Distance and counting: centroid.

Pseudo-code

subtreeSize[v] = size of v's subtree (re-computed per component)
deleted[v] = whether v has been removed

function dfsSize(u, parent):
    subtreeSize[u] = 1
    for v in children(u):
        if v == parent or deleted[v]: continue
        subtreeSize[u] += dfsSize(v, u)
    return subtreeSize[u]

function findCentroid(u, parent, totalSize):
    for v in children(u):
        if v == parent or deleted[v]: continue
        if subtreeSize[v] > totalSize / 2:
            return findCentroid(v, u, totalSize)
    return u

function decompose(u, centroidParent):
    totalSize = dfsSize(u, -1)
    c = findCentroid(u, -1, totalSize)
    centroidTreeParent[c] = centroidParent
    deleted[c] = true
    for v in neighbours(c):
        if not deleted[v]:
            decompose(v, c)

JavaScript implementation

class CentroidDecomp {
  constructor(tree) {
    this.n = tree.length;
    this.tree = tree;
    this.size = new Int32Array(this.n);
    this.deleted = new Uint8Array(this.n);
    this.parent = new Int32Array(this.n).fill(-1);  // centroid-tree parent
    this.decompose(0, -1);
  }

  dfsSize(u, p) {
    this.size[u] = 1;
    for (const v of this.tree[u]) {
      if (v === p || this.deleted[v]) continue;
      this.dfsSize(v, u);
      this.size[u] += this.size[v];
    }
  }

  findCentroid(u, p, total) {
    for (const v of this.tree[u]) {
      if (v === p || this.deleted[v]) continue;
      if (this.size[v] > total / 2) {
        // re-orient size for the walk
        this.size[u] = total - this.size[v];
        return this.findCentroid(v, u, total);
      }
    }
    return u;
  }

  decompose(u, centroidParent) {
    this.dfsSize(u, -1);
    const total = this.size[u];
    const c = this.findCentroid(u, -1, total);
    this.parent[c] = centroidParent;
    this.deleted[c] = 1;
    for (const v of this.tree[c]) {
      if (!this.deleted[v]) this.decompose(v, c);
    }
  }

  ancestors(u) {
    const path = [];
    while (u !== -1) { path.push(u); u = this.parent[u]; }
    return path;  // root of centroid tree last
  }
}

// Example
const tree = [[1,2], [0,3,4], [0,5,6], [1], [1], [2], [2]];
const cd = new CentroidDecomp(tree);
console.log(cd.parent);            // centroid-tree parents
console.log(cd.ancestors(3));      // path 3 → ... → root

Python implementation — closest-red query

Combine the centroid tree with per-centroid sorted distance lists, and you can answer "closest red node to u" with point updates. Each centroid keeps a min-heap of distances to red nodes recorded under it; query and update walk u's O(log N) centroid ancestors.

import heapq
from collections import defaultdict, deque

class ClosestRed:
    def __init__(self, tree):
        self.n = len(tree)
        self.tree = tree
        self.size = [0] * self.n
        self.deleted = [False] * self.n
        self.parent = [-1] * self.n
        # For each node u, distance to each centroid ancestor
        self.dist_to = [dict() for _ in range(self.n)]
        # For each centroid c, heap of (dist, node) for red nodes
        self.heap = defaultdict(list)
        # For each centroid c, current "alive" status of red node counts
        self.is_red = [False] * self.n
        self._decompose(0, -1)

    def _dfs_size(self, root):
        order = []
        stack = [(root, -1)]
        while stack:
            u, p = stack.pop()
            self.size[u] = 1
            order.append((u, p))
            for v in self.tree[u]:
                if v == p or self.deleted[v]: continue
                stack.append((v, u))
        for u, p in reversed(order):
            if p != -1 and not self.deleted[p]:
                self.size[p] += self.size[u]

    def _find_centroid(self, root):
        self._dfs_size(root)
        total = self.size[root]
        u, p = root, -1
        while True:
            heavy = -1
            for v in self.tree[u]:
                if v == p or self.deleted[v]: continue
                if self.size[v] > total // 2: heavy = v; break
            if heavy == -1: return u
            self.size[u] = total - self.size[heavy]
            p, u = u, heavy

    def _bfs_dist(self, c):
        # Distances from centroid c to every alive node in its component
        d = {c: 0}
        q = deque([c])
        while q:
            u = q.popleft()
            for v in self.tree[u]:
                if v in d or self.deleted[v]: continue
                d[v] = d[u] + 1
                q.append(v)
        return d

    def _decompose(self, root, centroid_parent):
        c = self._find_centroid(root)
        self.parent[c] = centroid_parent
        for u, dist in self._bfs_dist(c).items():
            self.dist_to[u][c] = dist
        self.deleted[c] = True
        for v in self.tree[c]:
            if not self.deleted[v]:
                self._decompose(v, c)

    def toggle_red(self, u):
        self.is_red[u] = not self.is_red[u]
        if self.is_red[u]:
            cur = u
            while cur != -1:
                heapq.heappush(self.heap[cur], (self.dist_to[u][cur], u))
                cur = self.parent[cur]
        # (lazy deletion: heap entries are validated at query time)

    def closest_red(self, u):
        best = float('inf')
        cur = u
        while cur != -1:
            # Drop stale heap entries
            while self.heap[cur] and not self.is_red[self.heap[cur][0][1]]:
                heapq.heappop(self.heap[cur])
            if self.heap[cur]:
                d_cu = self.dist_to[u][cur]
                d_centroid = self.heap[cur][0][0]
                best = min(best, d_cu + d_centroid)
            cur = self.parent[cur]
        return best if best < float('inf') else -1

Each toggle walks log N centroid ancestors and pushes log N heap entries. Each query also walks log N ancestors and at most pops a few stale entries amortised. Total per operation: O(log² N) with the lazy-delete trick.

Variants

Path-counting

Count pairs (u, v) whose tree distance equals k. For each centroid c, BFS to get distances from c to all alive nodes. Pair distances d_u + d_v = k contribute one count — implementable with a sorted multiset or a frequency array. Total: O(N log N).

Path-XOR queries

Replace distance with XOR of edge weights on the path. Centroid c sees all paths through itself; count pairs (u, v) with xor[u] ^ xor[v] = target using a hashmap. Same O(N log N).

Dynamic centroid decomposition

Edge insertions/deletions need link-cut trees or top trees — pure centroid decomposition is static. Approximate dynamic variants exist (rebuild every √N updates) but are rarely worth the complexity.

Centroid + segment tree per centroid

For ordered queries (k-th smallest distance), maintain a segment tree of distances at each centroid. Updates and queries become O(log² N).

Common bugs and edge cases

  • Forgetting to mark deleted before recursing. The recursion must mark the centroid as deleted before diving into children, or DFS revisits it and the size DFS becomes nonsense.
  • Subtree-size reuse across components. Re-running dfsSize for each component is mandatory; sizes computed before the deletion are stale for the post-deletion subgraph.
  • Centroid not found because of stale sizes. When walking toward the heaviest child, the walk needs current sizes — flip size[u] = total − size[v] as you descend, or recompute from each subtree root.
  • Stack depth on chain-shaped trees. The size DFS can recurse 10⁵ deep on a linear input — convert to iterative BFS or raise the recursion limit.
  • Distance arrays sized wrong. Each node stores one distance per centroid ancestor — at most log N per node, but the worst case is sometimes N for poor centroid choices. Use a hashmap, not a fixed array.
  • Heap of stale entries growing unbounded. Lazy deletion on heaps requires validating the top entry against current state at every query; otherwise the heap leaks memory and slows queries.

Frequently asked questions

What is a tree centroid?

A centroid is a node whose removal leaves no subtree with more than ⌊N/2⌋ nodes. Every tree has one or two centroids — never more. You find one in O(N) by computing subtree sizes from any root and walking toward the heaviest child until the heaviest subtree drops to ≤ N/2.

Why does centroid decomposition give O(N log N) preprocessing?

Each recursion level processes the full N nodes (computing centroid, walking depths). Each split at least halves the component size, so the recursion is at most log N deep. N work × log N levels = O(N log N) total. The same shape as merge sort or median-of-medians.

How does centroid decomposition speed up tree distance queries?

Every pair of nodes (u, v) shares some centroid c in the centroid tree where the path u — v passes through c. Stored: for each node u, the list of (centroid, distance) pairs along its ancestor chain in the centroid tree — O(log N) entries. To query distance, find each centroid common to u and v, take min of d(u, c) + d(c, v). Total query work O(log N) — sometimes log² with extra updates.

What problems besides distance use centroid decomposition?

Counting paths in a tree whose total weight is divisible by k, finding the closest red node to each node with dynamic colour updates, the K-th smallest distance in a tree, and the path-with-specific-edge-count counting problem. Any problem where 'paths can be decomposed at a balanced midpoint' fits the pattern.

Is the centroid tree the same shape as the original tree?

No. The centroid tree is a balanced recursion structure: depth ≤ log N regardless of how skewed the original tree was. A 10⁵-node path graph (linear chain) has a centroid tree of depth ~17, even though the original tree has depth 10⁵. That depth flattening is the whole point.

How does centroid decomposition differ from heavy-light decomposition?

HLD partitions the tree into paths and uses segment trees for path queries. Centroid decomposition splits the tree at balanced cuts and answers distance/aggregate queries that don't fit a chain shape. HLD is great for path queries; centroid is great for counting pairs and distance metrics.