Data Structures

B+ Tree

The shape of every database index since 1972

A B+ tree is a balanced multi-way tree that stores all data in linked leaf nodes and uses internal nodes only as a search index. It's the data structure behind nearly every relational database index — fast point lookups and even faster range scans.

  • Search / Insert / DeleteO(logb n)
  • Range scan k itemsO(logb n + k)
  • Typical fanout (b)100–500
  • Tree height for 1B keys3–4 levels
  • Min leaf occupancy≈ 50%

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a B+ tree works

A B+ tree is a generalisation of the binary search tree where each node has many children instead of two. The shape is shallow and wide — a 4-level tree with fanout 250 indexes about 2504 ≈ 4 billion keys. Reading any key from disk costs at most 4 page reads.

Two structural rules separate B+ from B-tree:

  1. Internal nodes are pure index. They hold separator keys and child pointers. No row data lives in them.
  2. Leaves hold all the data, and they're linked. Each leaf has a pointer to its right sibling. A range scan descends once, then walks leaves left-to-right.
            [50 | 90]                ← internal: separators only
            /    |    \
   [10 30] [60 70 80] [95 99]        ← leaves: actual rows
   ↔----↔  ↔-------↔  ↔----↔         ← linked siblings

To find a key, walk down comparing against separators. To range-scan from key 25 to 95, descend to the leaf containing 25, then follow sibling pointers until you pass 95. No backtracking up the tree — sequential disk reads, prefetcher-friendly, the SQL BETWEEN operator's best friend.

Why this shape, on disk

The cost model on disk is dominated by page reads, not comparisons. A typical 4 KB or 8 KB page holds hundreds of (key, pointer) entries — call that the fanout b. The number of page reads to find a key is logb n, which is tiny when b is in the hundreds:

KeysTree levels (b = 250)
1,0002
62,5003
15,600,0004
4,000,000,0005

The root is almost always cached in RAM, so a 4-level tree often costs just 3 disk reads — and the bottom one is the only one that touches actual data. That's why "the index fits in memory but the data doesn't" is the normal operating regime.

When B+ trees shine

  • Range queries — between, less-than, greater-than, ORDER BY with LIMIT.
  • Mixed read/write loads — writes amplify only ~logb n pages, but the tree stays balanced.
  • Disk- and SSD-backed storage where page-aligned IO matters more than comparison count.
  • Prefix scans — composite indexes naturally answer "all rows where (a, b, c) starts with (5, 7)".

If your workload is purely random point inserts at write-heavy throughput (Cassandra, time-series ingest), an LSM tree often beats a B+ tree. If you need durability without locking, copy-on-write B+ trees (LMDB) skip the write-ahead log entirely.

B+ tree vs B-tree vs LSM tree

B+ treeB-treeLSM treeAVL / RB treeHash indexSkip list
Point lookupO(logb n) ~3-4 IOsO(logb n) similarO(L) levels checkedO(log n) in-memoryO(1) avgO(log n) probabilistic
Range scanExcellent (linked leaves)Slower (in-order traversal)Merge across L levelsIn-order traversalNot supportedExcellent
Insert throughputModerate (in-place writes)ModerateHigh (sequential writes only)O(log n) in-memoryO(1) avgO(log n) probabilistic
Write amplification1× to 3×1× to 3×10× to 30×N/A (no disk)
Read amplification1× (one leaf read)1× per readL× (check each level)N/AO(log n) hops
Space amplification~1.5× (50% min fill)~1.5×2×–10× (uncompacted)1× + pointers1× + load factor1× + ~2 pointers/node
Where usedInnoDB, Postgres, SQLite, LMDBFilesystems (NTFS, HFS+)RocksDB, Cassandra, LevelDBstd::map, Java TreeMapMemcached, Redis hashRedis sorted set, RocksDB memtable

The headline numbers aren't asymptotic — they're amplification ratios. A B+ tree's small read amplification is why DBAs reach for it first; an LSM's small write amplification per request (despite high background amplification) is why ingest-heavy systems pick it instead.

What the numbers actually look like

  • 4 levels indexes 256M keys at fanout 250, assuming nodes are 80% full: 250⁴ × 0.8⁴ ≈ 1.6B in theory, ~256M observed in practice once you account for partial pages.
  • Minimum occupancy ≈ 50%. A leaf with capacity 250 must hold at least 125 keys before it can release a sibling on delete. Worst-case space waste: 2× the data size.
  • Write amplification of 1× to 3×. A single row insert dirties one leaf page; a leaf split dirties two leaves and one parent. Compare with LSM trees at 10× to 30×.
  • Range scan reads exactly ⌈k / leaf_capacity⌉ + logb n pages for k results. With capacity 250 and 1M results, that's ~4000 page reads vs ~1M comparisons in a binary tree.

JavaScript implementation (in-memory B+ tree)

// In-memory B+ tree, fanout = ORDER. Real DBs page nodes; this version stays simple.
const ORDER = 4;

class Leaf {
  constructor() { this.keys = []; this.vals = []; this.next = null; this.isLeaf = true; }
}

class Internal {
  constructor() { this.keys = []; this.children = []; this.isLeaf = false; }
}

class BPlusTree {
  constructor() { this.root = new Leaf(); }

  // Returns leaf containing key (or where key would be inserted).
  _findLeaf(key) {
    let n = this.root;
    while (!n.isLeaf) {
      let i = 0;
      while (i < n.keys.length && key >= n.keys[i]) i++;
      n = n.children[i];
    }
    return n;
  }

  get(key) {
    const leaf = this._findLeaf(key);
    const idx  = leaf.keys.indexOf(key);
    return idx === -1 ? undefined : leaf.vals[idx];
  }

  insert(key, val) {
    const leaf = this._findLeaf(key);
    const idx  = leaf.keys.findIndex(k => k >= key);
    const at   = idx === -1 ? leaf.keys.length : idx;
    if (leaf.keys[at] === key) { leaf.vals[at] = val; return; }   // upsert
    leaf.keys.splice(at, 0, key);
    leaf.vals.splice(at, 0, val);
    if (leaf.keys.length < ORDER) return;
    this._splitLeaf(leaf);
  }

  _splitLeaf(leaf) {
    const mid     = Math.ceil(leaf.keys.length / 2);
    const sibling = new Leaf();
    sibling.keys  = leaf.keys.splice(mid);
    sibling.vals  = leaf.vals.splice(mid);
    sibling.next  = leaf.next;
    leaf.next     = sibling;
    this._insertIntoParent(leaf, sibling.keys[0], sibling);
  }

  _insertIntoParent(left, sepKey, right) {
    if (left === this.root) {
      const r = new Internal();
      r.keys = [sepKey];
      r.children = [left, right];
      this.root = r;
      return;
    }
    // (For brevity: production code keeps a parent pointer or stack
    // built during _findLeaf to walk the split up the tree.)
  }

  // Range scan from start (inclusive) to end (exclusive)
  *range(start, end) {
    let n = this._findLeaf(start);
    while (n) {
      for (let i = 0; i < n.keys.length; i++) {
        if (n.keys[i] >= end) return;
        if (n.keys[i] >= start) yield [n.keys[i], n.vals[i]];
      }
      n = n.next;
    }
  }
}

Two real-world implementation details get glossed over in textbooks. First, internal-node splits move the median key up (it doesn't appear in either child); leaf-node splits copy the median up so it still appears in the right child as a real key. Second, production B+ trees keep a parent pointer or build a path stack during search — it's how splits propagate without a second descent.

Python implementation (range scan + split)

ORDER = 4

class Leaf:
    __slots__ = ('keys', 'vals', 'next')
    def __init__(self):
        self.keys, self.vals, self.next = [], [], None
    is_leaf = True

class Internal:
    __slots__ = ('keys', 'children')
    def __init__(self):
        self.keys, self.children = [], []
    is_leaf = False

class BPlusTree:
    def __init__(self):
        self.root = Leaf()

    def _find_leaf_with_path(self, key):
        path, n = [], self.root
        while not n.is_leaf:
            i = 0
            while i < len(n.keys) and key >= n.keys[i]: i += 1
            path.append((n, i))
            n = n.children[i]
        return n, path

    def get(self, key):
        leaf, _ = self._find_leaf_with_path(key)
        for i, k in enumerate(leaf.keys):
            if k == key: return leaf.vals[i]
        return None

    def insert(self, key, val):
        leaf, path = self._find_leaf_with_path(key)
        i = 0
        while i < len(leaf.keys) and leaf.keys[i] < key: i += 1
        if i < len(leaf.keys) and leaf.keys[i] == key:
            leaf.vals[i] = val
            return
        leaf.keys.insert(i, key)
        leaf.vals.insert(i, val)
        if len(leaf.keys) >= ORDER:
            self._split_leaf(leaf, path)

    def _split_leaf(self, leaf, path):
        mid = len(leaf.keys) // 2
        sib = Leaf()
        sib.keys, leaf.keys = leaf.keys[mid:], leaf.keys[:mid]
        sib.vals, leaf.vals = leaf.vals[mid:], leaf.vals[:mid]
        sib.next, leaf.next = leaf.next, sib
        sep = sib.keys[0]
        # Walk up the path, splitting internal nodes as needed
        right = sib
        while path:
            parent, idx = path.pop()
            parent.keys.insert(idx, sep)
            parent.children.insert(idx + 1, right)
            if len(parent.keys) < ORDER: return
            mid = len(parent.keys) // 2
            sep = parent.keys[mid]
            new = Internal()
            new.keys = parent.keys[mid + 1:]
            new.children = parent.children[mid + 1:]
            parent.keys = parent.keys[:mid]
            parent.children = parent.children[:mid + 1]
            right = new
        # Bumped past the root: grow the tree taller
        new_root = Internal()
        new_root.keys = [sep]
        new_root.children = [self.root, right]
        self.root = new_root

    def range_scan(self, start, end):
        leaf, _ = self._find_leaf_with_path(start)
        while leaf is not None:
            for k, v in zip(leaf.keys, leaf.vals):
                if k >= end: return
                if k >= start: yield k, v
            leaf = leaf.next

The split has three distinct cases: leaf overflow (copy median up), internal overflow (move median up), and root overflow (allocate a new root). Most textbook B+ tree bugs come from forgetting that copy-vs-move distinction.

Variants used in production

Copy-on-write B+ tree (LMDB, BoltDB). Instead of in-place writes, every modified page is rewritten as a new page; the root is swapped atomically. No write-ahead log, no torn writes, snapshot reads for free. The trade-off is write amplification — a single change rewrites every page on the path to the root.

Bw-tree (Microsoft Hekaton, CMU Peloton). Lock-free B+ tree variant that appends "delta records" to nodes instead of mutating them. A periodic consolidation collapses deltas back into the base node. Optimised for many-core CPUs.

Adaptive Radix Tree (ART). Not strictly a B+ tree, but a common in-memory replacement. ART uses 4/16/48/256-way nodes that grow as keys arrive, plus path compression. Used by HyPer, DuckDB, and TigerBeetle as the in-memory index of choice.

Fractal tree (Tokutek, MyRocks's predecessor). Adds a per-node message buffer; writes are queued and flushed lazily on read. Trades read amplification for dramatically better write throughput.

Compressed B+ tree. Many real systems prefix-compress keys within a leaf and run-length-encode duplicates. PostgreSQL's btree_gin, MySQL's compressed pages, and LMDB's prefix compression are examples.

FB-tree / OLFIT-style concurrent B+ trees. Use optimistic concurrency control (versioning + retry) instead of latch coupling. The default in modern in-memory engines.

Common bugs and edge cases

  • Forgetting to update the sibling pointer. A leaf split that doesn't link the new sibling makes range scans skip half the data, silently.
  • Confusing copy-up (leaves) with move-up (internals). If you remove the median from the leaf during a split, point lookups for that key fail.
  • Underflow merging that forgets to fix the parent separator. Delete is harder than insert; missing the separator update creates phantom keys.
  • Treating equal keys ambiguously. Production B+ trees use unique tie-breakers (row IDs) to disambiguate duplicates and keep range scans deterministic.
  • Not handling root collapse. When the root has exactly one child after a delete, you must replace the root with that child or the tree grows artificially tall.
  • Concurrent splits without latch coupling. Two writers can split the same node simultaneously and lose one update — production engines use optimistic concurrency or hand-over-hand latching.
  • Page-aligned writes that aren't page-aligned. A node that crosses a 4 KB boundary defeats the cache benefit of the whole structure.

Frequently asked questions

What's the difference between a B-tree and a B+ tree?

In a B-tree, every node stores both keys and the associated values. In a B+ tree, internal nodes are pure search index — only the leaves hold values. The leaves are also linked left-to-right, which makes range scans dramatically faster.

Why do databases use B+ trees instead of hash tables?

Hash tables can't answer range queries (WHERE age BETWEEN 30 AND 40), and they don't keep data sorted on disk for sequential reads. B+ trees give O(log n) point lookups and O(k) range scans on linked leaves — both essential for SQL workloads.

What does fanout mean and why does it matter?

Fanout is how many children each internal node has. With 4 KB pages and 16-byte keys, fanout sits around 250. A 4-level tree fits 250⁴ ≈ 4 billion keys; in practice 256 million entries fit in 4 levels because nodes don't always fill completely.

How does insertion split a leaf?

When a leaf overflows past its capacity, it splits in half: the median key is copied (not moved) up to the parent as a separator, and the leaves are linked through the sibling pointer. If the parent overflows, it splits too — the median there is moved up, since internal nodes don't hold values.

Are B+ trees still used in modern systems?

Yes — InnoDB (MySQL), PostgreSQL's btree access method, SQLite, SQL Server, Oracle, LMDB, and BoltDB all use B+ trees. The challengers are LSM trees (RocksDB, Cassandra) for write-heavy workloads and log-structured indexes for SSDs.

Why are leaves linked into a list?

So a range scan can walk from one leaf to the next without climbing back up to the parent. After locating the start of the range with O(log n) descent, scanning k consecutive entries is O(k) — basically a linked-list traversal across leaves.