Data Structures

B-Tree

A tree with hundreds of children per node — built for disk

A B-tree is a balanced search tree where each node holds many keys (often hundreds) and has a high branching factor. It's the data structure behind every relational database index, every NoSQL store, and most file systems. The high branching factor minimizes disk reads — finding a key in a billion records takes 3-4 disk seeks instead of 30 with a binary tree.

  • Branching factor (typical)100-1000 children per node
  • Height for n keyslog_b(n) — log base b where b is branching factor
  • Disk reads for 1B records (b=128)3-4 (vs 30 for binary tree)
  • VariantsB-tree, B+tree (most common), B*-tree
  • Used byPostgreSQL, MySQL InnoDB, SQLite, ext4, NTFS, MongoDB
  • Node sizeOne disk page (4 KB-16 KB typical)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a B-tree works

A B-tree of order m is a search tree with these properties:

  1. Every node holds between ⌈m/2⌉ and m keys (except the root, which can have fewer).
  2. A node with k keys has k+1 children.
  3. Within a node, keys are stored in sorted order. Children are arranged so the i-th child contains all keys between the (i-1)-th and i-th keys.
  4. All leaves are at the same depth.

That last property is the magic — every search takes exactly the same number of steps. The tree is balanced not by rotations (like RB or AVL) but by node splits and merges that maintain uniform depth.

B-tree vs B+tree

B-treeB+tree
Where values liveIn any node (internal or leaf)Only in leaves
Internal nodes holdKeys + valuesKeys only (for navigation)
Leaves linked togetherNoYes — sorted linked list
Range scansO(k log n) for k resultsO(log n + k) — much faster
Used inSome old databases, theoretical CSPostgreSQL, MySQL InnoDB, SQLite, MongoDB, ext4, NTFS
Memory per internal nodeHigher (values too)Lower (just keys)

Modern database B-trees are almost always B+trees. The leaf-linked-list optimization is too valuable for range queries to skip. When someone says "B-tree" in a database context, they usually mean B+tree.

Why B-trees crush binary trees on disk

The math comes down to disk-read cost. Every operation cost is dominated by disk seeks (still ~5-10 ms on spinning disk, ~0.1-1 ms on SSD), not CPU comparisons. So we measure tree performance in disk-reads-per-operation, not in comparisons.

RecordsBinary tree (depth)B-tree, b=128 (depth)B-tree, b=512 (depth)
10K~1422
1M~2033
100M~2743
1B~3044
1T~405-65

For a billion-row index on a hard disk: a binary-tree lookup costs 30 × 10 ms = 300 ms. A B-tree lookup costs 4 × 10 ms = 40 ms. The branching-factor optimization gives an order of magnitude speedup.

On SSDs, the absolute numbers shrink (4-40 ms becomes 0.4-4 ms) but the ratio holds. On in-memory data the equivalent comparison is "how many cache lines did we touch" — same shape, smaller scale.

When to use a B-tree

  • Database indexes. Every relational database (PostgreSQL, MySQL, SQLite, Oracle, SQL Server) and most NoSQL stores (MongoDB) use B+trees as the default index. Built-in, configured by default, optimized for the database's storage layer.
  • File systems. ext4 directories, NTFS, HFS+, btrfs, and ZFS all use B-tree variants for directory and metadata indexing. Looking up a file by name is a B-tree query.
  • Persistent ordered maps. Any data structure that needs to span memory and disk benefits from B-tree's branching factor.
  • In-memory ordered maps with cache awareness. Rust's BTreeMap, modern C++ libraries' B-tree implementations. Branching factor tuned to L1 cache line size (~16 keys per node).

Walking through an insert

Take a B-tree of order 4 (each node holds up to 3 keys). Insert key 30 into:

           [10, 20, 40]
           /   |    |   \
       [5]  [15] [25,30] [50]

Wait — there's already a 30 in the tree. Inserting another 30 would target the [25, 30] node. After insert it becomes [25, 30, 30] which is full but legal (3 keys in an order-4 node).

Now insert 35. It targets the same node, which is now full. Splitting:

Before split:  [25, 30, 30, 35]   (4 keys — over capacity)
After split:   [25, 30] | [30, 35]   (median 30 promoted)

The median (30) goes up to the parent. The parent becomes [10, 20, 30, 40] — still fits in an order-4 node:

           [10, 20, 30, 40]
           /  |   |   |   \
        [5] [15] [25,30] [30,35] [50]

If the parent had been full too, the split would propagate. In the worst case it propagates to the root, splitting the root and increasing the tree's height by 1. That's the only way the tree grows in depth.

Real-world numbers — PostgreSQL B+tree

  • Default page size: 8 KB.
  • Internal nodes hold ~370 entries; leaves hold similar (varies with tuple size).
  • 3 levels covers ~50M rows; 4 levels covers ~18B rows. Most databases never exceed 4 levels for indexes.
  • A point lookup costs 4 disk reads in the worst case (1 per level). With buffer cache, the upper levels are almost always in memory — only the leaf is a real disk read.
  • Range scans cost 4 disk reads to find the start, then sequential reads of leaf pages until the end of the range — much faster than B-tree without leaf links.

B-tree variants

VariantDistinguishing featureUsed in
B-treeOriginal. Values in any node.Theoretical CS, some old DBs
B+treeValues only in leaves; leaves linked.Almost all modern DBs and FSs
B*-treeNodes 2/3 full instead of 1/2 — denserSome specialized DBs
Bε-treeLeaves hold a write buffer; defers writes for batchingTokuDB, BetrFS
Bw-treeLock-free, log-structuredMicrosoft Hekaton
Copy-on-Write B-treeUpdates produce new nodes; old tree visible to existing readersbtrfs, ZFS

JavaScript: skeleton B-tree (educational, not production)

class BTreeNode {
  constructor(t) {  // t = minimum degree (each node holds t-1 to 2t-1 keys)
    this.t = t;
    this.keys = [];
    this.children = [];  // empty for leaves
    this.leaf = true;
  }

  isFull() { return this.keys.length === 2 * this.t - 1; }
}

class BTree {
  constructor(t = 32) { this.t = t; this.root = new BTreeNode(t); }

  search(key, node = this.root) {
    let i = 0;
    while (i < node.keys.length && key > node.keys[i]) i++;
    if (i < node.keys.length && key === node.keys[i]) return node;
    if (node.leaf) return null;
    return this.search(key, node.children[i]);
  }

  insert(key) {
    if (this.root.isFull()) {
      const newRoot = new BTreeNode(this.t);
      newRoot.leaf = false;
      newRoot.children.push(this.root);
      this._splitChild(newRoot, 0);
      this.root = newRoot;
    }
    this._insertNonFull(this.root, key);
  }

  _splitChild(parent, i) { /* ~20 lines — split child[i] around median */ }
  _insertNonFull(node, key) { /* ~15 lines — descend or insert in leaf */ }
}

Real B-tree implementations are several hundred lines because of disk-page management, write-ahead logging, concurrent access, and crash recovery. Use a database or storage engine instead of writing your own.

Common B-tree pitfalls

  • Forgetting to keep leaves at the same depth. The defining property. If your splits or merges break it, the tree silently degrades to an unbalanced search tree with O(n) worst case.
  • Splitting the wrong node. The standard algorithm splits a node BEFORE descending into it (proactively, on the way down), so the recursion never has to back up. Reactive splits (after the descent) require returning info up the recursion and are more error-prone.
  • Not handling deletion's complex cases. Delete in a B-tree has multiple cases (key in leaf vs internal, sibling can lend a key, must merge with sibling, etc.). Six cases is normal. Skipping cases produces tree corruption that's invisible until a later query fails.
  • Using too small a branching factor for disk. A "B-tree" with branching factor 4 is just an unbalanced quaternary tree on disk — every level still costs a seek. Pick branching factor based on disk-page-size and key-size; aim for branching ≥ 100 for spinning disk, ≥ 16 for SSD or RAM.
  • Not using a memory-mapped or buffered I/O layer. Naive disk-per-node reads are an order of magnitude slower than batched I/O. Real B-trees have a page cache, write-ahead log, and dirty-page flush policy.

Frequently asked questions

What's the difference between a B-tree and a B+tree?

In a B-tree, keys and values can be stored in any node. In a B+tree, internal nodes hold only keys (for navigation); all values live in the leaf nodes, which are themselves linked in a sorted linked list. B+trees dominate database use because the leaf-list lets you do range scans efficiently — start at the leaf containing key X, walk forward through linked leaves until key &gt; Y. PostgreSQL, MySQL, and most others use B+trees.

Why are B-trees better than binary trees for disk storage?

Each node read costs one disk seek (~10 ms) regardless of node size. A binary tree of a billion keys is ~30 levels deep — 30 seeks per lookup. A B-tree with branching factor 128 is ~4 levels deep — 4 seeks per lookup. Same algorithm, 7-8× faster on real disks. Modern SSDs have lower seek penalty (~0.1 ms) but the math still favors B-trees.

What's the branching factor in practice?

Driven by node size, which is typically one disk page (4 KB on Linux ext4, 16 KB on InnoDB). With 32-byte keys, a 16 KB page fits ~500 keys. So branching factor is ~500. Each level multiplies n by 500 — three levels covers 125M records, four levels covers 60B.

How does a B-tree split on insert?

When inserting into a full node, split it into two nodes around the median key. The median moves up to the parent (potentially causing further splits up the tree). The split keeps the tree balanced — every leaf stays at the same depth. In the worst case, the split propagates all the way to the root, increasing the tree height by 1.

B-tree vs hash index — which to use for a database?

B-tree for range queries (BETWEEN, ORDER BY, &gt;, &lt;) and prefix matches (LIKE 'foo%'). Hash for exact-match lookups only. B-trees are the default — most queries benefit from range support even when "exact match" is the explicit operation. Hash indexes save 10-30% on equality lookups in exchange for losing every other capability.

How do databases handle very-high-cardinality columns?

Same B-tree, just deeper. A column with 100M unique values across 1B rows still has ~5-6 levels with branching factor 128. The B-tree handles cardinality automatically — there's no "this index is too granular" failure mode. Performance degrades only when working set exceeds RAM and cold cache reads start dominating.

Why aren't B-trees used in pure in-memory data structures?

They are sometimes — Rust's BTreeMap, Java's older NavigableMap implementations. The branching factor that wins on disk (cache-line-sized nodes) also wins on modern CPUs (cache-line is 64 bytes, so branching ~16). In-memory B-tree implementations typically use ~64-byte nodes for L1 cache fit. Performance is competitive with red-black trees on modern hardware due to cache locality.