Data Structures
B+ Tree
The shape of every database index since 1972
A B+ tree is a balanced multi-way tree that stores all data in linked leaf nodes and uses internal nodes only as a search index. It's the data structure behind nearly every relational database index — fast point lookups and even faster range scans.
- Search / Insert / DeleteO(logb n)
- Range scan k itemsO(logb n + k)
- Typical fanout (b)100–500
- Tree height for 1B keys3–4 levels
- Min leaf occupancy≈ 50%
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a B+ tree works
A B+ tree is a generalisation of the binary search tree where each node has many children instead of two. The shape is shallow and wide — a 4-level tree with fanout 250 indexes about 2504 ≈ 4 billion keys. Reading any key from disk costs at most 4 page reads.
Two structural rules separate B+ from B-tree:
- Internal nodes are pure index. They hold separator keys and child pointers. No row data lives in them.
- Leaves hold all the data, and they're linked. Each leaf has a pointer to its right sibling. A range scan descends once, then walks leaves left-to-right.
[50 | 90] ← internal: separators only
/ | \
[10 30] [60 70 80] [95 99] ← leaves: actual rows
↔----↔ ↔-------↔ ↔----↔ ← linked siblings
To find a key, walk down comparing against separators. To range-scan from key 25 to 95, descend to the leaf containing 25, then follow sibling pointers until you pass 95. No backtracking up the tree — sequential disk reads, prefetcher-friendly, the SQL BETWEEN operator's best friend.
Why this shape, on disk
The cost model on disk is dominated by page reads, not comparisons. A typical 4 KB or 8 KB page holds hundreds of (key, pointer) entries — call that the fanout b. The number of page reads to find a key is logb n, which is tiny when b is in the hundreds:
| Keys | Tree levels (b = 250) |
|---|---|
| 1,000 | 2 |
| 62,500 | 3 |
| 15,600,000 | 4 |
| 4,000,000,000 | 5 |
The root is almost always cached in RAM, so a 4-level tree often costs just 3 disk reads — and the bottom one is the only one that touches actual data. That's why "the index fits in memory but the data doesn't" is the normal operating regime.
When B+ trees shine
- Range queries — between, less-than, greater-than, ORDER BY with LIMIT.
- Mixed read/write loads — writes amplify only ~logb n pages, but the tree stays balanced.
- Disk- and SSD-backed storage where page-aligned IO matters more than comparison count.
- Prefix scans — composite indexes naturally answer "all rows where (a, b, c) starts with (5, 7)".
If your workload is purely random point inserts at write-heavy throughput (Cassandra, time-series ingest), an LSM tree often beats a B+ tree. If you need durability without locking, copy-on-write B+ trees (LMDB) skip the write-ahead log entirely.
B+ tree vs B-tree vs LSM tree
| B+ tree | B-tree | LSM tree | AVL / RB tree | Hash index | Skip list | |
|---|---|---|---|---|---|---|
| Point lookup | O(logb n) ~3-4 IOs | O(logb n) similar | O(L) levels checked | O(log n) in-memory | O(1) avg | O(log n) probabilistic |
| Range scan | Excellent (linked leaves) | Slower (in-order traversal) | Merge across L levels | In-order traversal | Not supported | Excellent |
| Insert throughput | Moderate (in-place writes) | Moderate | High (sequential writes only) | O(log n) in-memory | O(1) avg | O(log n) probabilistic |
| Write amplification | 1× to 3× | 1× to 3× | 10× to 30× | N/A (no disk) | 1× | 1× |
| Read amplification | 1× (one leaf read) | 1× per read | L× (check each level) | N/A | 1× | O(log n) hops |
| Space amplification | ~1.5× (50% min fill) | ~1.5× | 2×–10× (uncompacted) | 1× + pointers | 1× + load factor | 1× + ~2 pointers/node |
| Where used | InnoDB, Postgres, SQLite, LMDB | Filesystems (NTFS, HFS+) | RocksDB, Cassandra, LevelDB | std::map, Java TreeMap | Memcached, Redis hash | Redis sorted set, RocksDB memtable |
The headline numbers aren't asymptotic — they're amplification ratios. A B+ tree's small read amplification is why DBAs reach for it first; an LSM's small write amplification per request (despite high background amplification) is why ingest-heavy systems pick it instead.
What the numbers actually look like
- 4 levels indexes 256M keys at fanout 250, assuming nodes are 80% full: 250⁴ × 0.8⁴ ≈ 1.6B in theory, ~256M observed in practice once you account for partial pages.
- Minimum occupancy ≈ 50%. A leaf with capacity 250 must hold at least 125 keys before it can release a sibling on delete. Worst-case space waste: 2× the data size.
- Write amplification of 1× to 3×. A single row insert dirties one leaf page; a leaf split dirties two leaves and one parent. Compare with LSM trees at 10× to 30×.
- Range scan reads exactly ⌈k / leaf_capacity⌉ + logb n pages for k results. With capacity 250 and 1M results, that's ~4000 page reads vs ~1M comparisons in a binary tree.
JavaScript implementation (in-memory B+ tree)
// In-memory B+ tree, fanout = ORDER. Real DBs page nodes; this version stays simple.
const ORDER = 4;
class Leaf {
constructor() { this.keys = []; this.vals = []; this.next = null; this.isLeaf = true; }
}
class Internal {
constructor() { this.keys = []; this.children = []; this.isLeaf = false; }
}
class BPlusTree {
constructor() { this.root = new Leaf(); }
// Returns leaf containing key (or where key would be inserted).
_findLeaf(key) {
let n = this.root;
while (!n.isLeaf) {
let i = 0;
while (i < n.keys.length && key >= n.keys[i]) i++;
n = n.children[i];
}
return n;
}
get(key) {
const leaf = this._findLeaf(key);
const idx = leaf.keys.indexOf(key);
return idx === -1 ? undefined : leaf.vals[idx];
}
insert(key, val) {
const leaf = this._findLeaf(key);
const idx = leaf.keys.findIndex(k => k >= key);
const at = idx === -1 ? leaf.keys.length : idx;
if (leaf.keys[at] === key) { leaf.vals[at] = val; return; } // upsert
leaf.keys.splice(at, 0, key);
leaf.vals.splice(at, 0, val);
if (leaf.keys.length < ORDER) return;
this._splitLeaf(leaf);
}
_splitLeaf(leaf) {
const mid = Math.ceil(leaf.keys.length / 2);
const sibling = new Leaf();
sibling.keys = leaf.keys.splice(mid);
sibling.vals = leaf.vals.splice(mid);
sibling.next = leaf.next;
leaf.next = sibling;
this._insertIntoParent(leaf, sibling.keys[0], sibling);
}
_insertIntoParent(left, sepKey, right) {
if (left === this.root) {
const r = new Internal();
r.keys = [sepKey];
r.children = [left, right];
this.root = r;
return;
}
// (For brevity: production code keeps a parent pointer or stack
// built during _findLeaf to walk the split up the tree.)
}
// Range scan from start (inclusive) to end (exclusive)
*range(start, end) {
let n = this._findLeaf(start);
while (n) {
for (let i = 0; i < n.keys.length; i++) {
if (n.keys[i] >= end) return;
if (n.keys[i] >= start) yield [n.keys[i], n.vals[i]];
}
n = n.next;
}
}
}
Two real-world implementation details get glossed over in textbooks. First, internal-node splits move the median key up (it doesn't appear in either child); leaf-node splits copy the median up so it still appears in the right child as a real key. Second, production B+ trees keep a parent pointer or build a path stack during search — it's how splits propagate without a second descent.
Python implementation (range scan + split)
ORDER = 4
class Leaf:
__slots__ = ('keys', 'vals', 'next')
def __init__(self):
self.keys, self.vals, self.next = [], [], None
is_leaf = True
class Internal:
__slots__ = ('keys', 'children')
def __init__(self):
self.keys, self.children = [], []
is_leaf = False
class BPlusTree:
def __init__(self):
self.root = Leaf()
def _find_leaf_with_path(self, key):
path, n = [], self.root
while not n.is_leaf:
i = 0
while i < len(n.keys) and key >= n.keys[i]: i += 1
path.append((n, i))
n = n.children[i]
return n, path
def get(self, key):
leaf, _ = self._find_leaf_with_path(key)
for i, k in enumerate(leaf.keys):
if k == key: return leaf.vals[i]
return None
def insert(self, key, val):
leaf, path = self._find_leaf_with_path(key)
i = 0
while i < len(leaf.keys) and leaf.keys[i] < key: i += 1
if i < len(leaf.keys) and leaf.keys[i] == key:
leaf.vals[i] = val
return
leaf.keys.insert(i, key)
leaf.vals.insert(i, val)
if len(leaf.keys) >= ORDER:
self._split_leaf(leaf, path)
def _split_leaf(self, leaf, path):
mid = len(leaf.keys) // 2
sib = Leaf()
sib.keys, leaf.keys = leaf.keys[mid:], leaf.keys[:mid]
sib.vals, leaf.vals = leaf.vals[mid:], leaf.vals[:mid]
sib.next, leaf.next = leaf.next, sib
sep = sib.keys[0]
# Walk up the path, splitting internal nodes as needed
right = sib
while path:
parent, idx = path.pop()
parent.keys.insert(idx, sep)
parent.children.insert(idx + 1, right)
if len(parent.keys) < ORDER: return
mid = len(parent.keys) // 2
sep = parent.keys[mid]
new = Internal()
new.keys = parent.keys[mid + 1:]
new.children = parent.children[mid + 1:]
parent.keys = parent.keys[:mid]
parent.children = parent.children[:mid + 1]
right = new
# Bumped past the root: grow the tree taller
new_root = Internal()
new_root.keys = [sep]
new_root.children = [self.root, right]
self.root = new_root
def range_scan(self, start, end):
leaf, _ = self._find_leaf_with_path(start)
while leaf is not None:
for k, v in zip(leaf.keys, leaf.vals):
if k >= end: return
if k >= start: yield k, v
leaf = leaf.next
The split has three distinct cases: leaf overflow (copy median up), internal overflow (move median up), and root overflow (allocate a new root). Most textbook B+ tree bugs come from forgetting that copy-vs-move distinction.
Variants used in production
Copy-on-write B+ tree (LMDB, BoltDB). Instead of in-place writes, every modified page is rewritten as a new page; the root is swapped atomically. No write-ahead log, no torn writes, snapshot reads for free. The trade-off is write amplification — a single change rewrites every page on the path to the root.
Bw-tree (Microsoft Hekaton, CMU Peloton). Lock-free B+ tree variant that appends "delta records" to nodes instead of mutating them. A periodic consolidation collapses deltas back into the base node. Optimised for many-core CPUs.
Adaptive Radix Tree (ART). Not strictly a B+ tree, but a common in-memory replacement. ART uses 4/16/48/256-way nodes that grow as keys arrive, plus path compression. Used by HyPer, DuckDB, and TigerBeetle as the in-memory index of choice.
Fractal tree (Tokutek, MyRocks's predecessor). Adds a per-node message buffer; writes are queued and flushed lazily on read. Trades read amplification for dramatically better write throughput.
Compressed B+ tree. Many real systems prefix-compress keys within a leaf and run-length-encode duplicates. PostgreSQL's btree_gin, MySQL's compressed pages, and LMDB's prefix compression are examples.
FB-tree / OLFIT-style concurrent B+ trees. Use optimistic concurrency control (versioning + retry) instead of latch coupling. The default in modern in-memory engines.
Common bugs and edge cases
- Forgetting to update the sibling pointer. A leaf split that doesn't link the new sibling makes range scans skip half the data, silently.
- Confusing copy-up (leaves) with move-up (internals). If you remove the median from the leaf during a split, point lookups for that key fail.
- Underflow merging that forgets to fix the parent separator. Delete is harder than insert; missing the separator update creates phantom keys.
- Treating equal keys ambiguously. Production B+ trees use unique tie-breakers (row IDs) to disambiguate duplicates and keep range scans deterministic.
- Not handling root collapse. When the root has exactly one child after a delete, you must replace the root with that child or the tree grows artificially tall.
- Concurrent splits without latch coupling. Two writers can split the same node simultaneously and lose one update — production engines use optimistic concurrency or hand-over-hand latching.
- Page-aligned writes that aren't page-aligned. A node that crosses a 4 KB boundary defeats the cache benefit of the whole structure.
Frequently asked questions
What's the difference between a B-tree and a B+ tree?
In a B-tree, every node stores both keys and the associated values. In a B+ tree, internal nodes are pure search index — only the leaves hold values. The leaves are also linked left-to-right, which makes range scans dramatically faster.
Why do databases use B+ trees instead of hash tables?
Hash tables can't answer range queries (WHERE age BETWEEN 30 AND 40), and they don't keep data sorted on disk for sequential reads. B+ trees give O(log n) point lookups and O(k) range scans on linked leaves — both essential for SQL workloads.
What does fanout mean and why does it matter?
Fanout is how many children each internal node has. With 4 KB pages and 16-byte keys, fanout sits around 250. A 4-level tree fits 250⁴ ≈ 4 billion keys; in practice 256 million entries fit in 4 levels because nodes don't always fill completely.
How does insertion split a leaf?
When a leaf overflows past its capacity, it splits in half: the median key is copied (not moved) up to the parent as a separator, and the leaves are linked through the sibling pointer. If the parent overflows, it splits too — the median there is moved up, since internal nodes don't hold values.
Are B+ trees still used in modern systems?
Yes — InnoDB (MySQL), PostgreSQL's btree access method, SQLite, SQL Server, Oracle, LMDB, and BoltDB all use B+ trees. The challengers are LSM trees (RocksDB, Cassandra) for write-heavy workloads and log-structured indexes for SSDs.
Why are leaves linked into a list?
So a range scan can walk from one leaf to the next without climbing back up to the parent. After locating the start of the range with O(log n) descent, scanning k consecutive entries is O(k) — basically a linked-list traversal across leaves.