Data Structures

Log-Structured Merge Tree

Trade read effort for sequential writes — the engine of modern key-value stores

An LSM tree buffers writes in memory, flushes them as immutable sorted files on disk, and merges those files in the background. It's how RocksDB, Cassandra, and HBase ingest millions of writes per second on commodity SSDs.

  • WriteO(1) amortized to memtable
  • Read (worst)O(L) levels checked
  • Write amplification10× to 30×
  • Space amplification1.1× to 2×
  • Disk patternSequential only

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How an LSM tree works

Disks (and even SSDs) hate random small writes. A B+ tree update reads a page, modifies it, and writes it back — that's 8 KB of physical work for a 100-byte change. Multiply by millions of writes per second and you blow past disk bandwidth.

The LSM trick: never modify in place. Writes go to a small in-memory structure called the memtable, plus an append-only write-ahead log for durability. Reads check the memtable first, then fall back to disk. When the memtable fills up, it's flushed to disk as a single immutable sorted file — an SSTable (Sorted String Table). Background workers periodically merge SSTables to keep the count manageable.

Write path:
  put(k, v) → WAL append → memtable insert (skip list)
                                    │
                                    ▼ when full
                             flush as SSTable to L0

L0  [a–z][a–z][a–z]   ← may overlap; recently flushed
L1  [a–c][d–f][g–i]…  ← non-overlapping ranges (in leveled compaction)
L2  10× larger, also non-overlapping
L3  10× larger again
…

Reads check sources in order: memtable → L0 SSTables (newest first) → L1, L2, L3. The first version found wins. Bloom filters and per-SSTable index blocks make most checks O(1) microseconds.

The three moving parts

Memtable. An in-memory ordered structure — usually a skip list (RocksDB, Cassandra) or a balanced tree. Holds the most recent ~10 MB of writes. Cheap to insert and to scan in order; flushed when full.

Write-ahead log (WAL). A simple append-only file. The memtable is volatile, so every write is also logged here for crash recovery. After a crash, replay the WAL into a fresh memtable.

SSTable. An immutable, sorted-on-disk file: a sequence of (key, value) entries, plus an index block (sparse index of every Nth key) and a Bloom filter. Once written, an SSTable is never modified — only superseded by compaction.

When LSM trees win

  • Write-heavy workloads — ingest pipelines, time-series, IoT, logs.
  • Append-mostly workloads — event stores, audit logs, append-only ledgers.
  • SSDs and disk-tiered storage — sequential writes are far cheaper than random.
  • Workloads tolerant of read amplification — Bloom filters cover most lookups, but not all.

If your workload is read-heavy with strict tail-latency SLOs, B+ trees often still win — every LSM read can hit several files in the worst case, and tail latency suffers during compaction storms.

LSM vs B-tree write/read amplification

LSM (leveled)LSM (tiered)B+ treeB-treeAppend-only logIn-memory hash
Write throughputVery highHighestModerateModerateHighestHighest
Write amplification10× to 30×4× to 10×1× to 3×1× to 3×
Read amplification1× (Bloom + index)L× worst casen (full scan)
Space amplification1.1× to 1.4×2× to 10×1.5× (50% min fill)1.5×n× (no GC)≤2× (load factor)
Range scanMerge-iterateMerge-iterateExcellent (linked leaves)In-order traversalLinear scanNot supported
Background CPUHigh (compaction)ModerateNoneNoneNoneNone
Used byRocksDB, LevelDBCassandra, ScyllaDBInnoDB, PostgresFilesystemsKafka log segmentsMemcached, Redis

The amplification numbers are what most architects miss. An LSM has high internal write amplification — each byte you write may be rewritten 10–30 times during compaction — but the disk only sees sequential writes, which it loves. A B+ tree has 1× write amplification but every write is a random IO, which spinning disks and even SSDs handle poorly under load.

What the numbers actually look like

  • Write amplification ≈ 10× for leveled compaction with size ratio 10 between levels — each byte gets rewritten once per level, and there are usually 4–7 levels.
  • Read amplification ≤ 1.01× with Bloom filters at 10 bits per key (1% false positive rate). Without Bloom filters: O(number of SSTables).
  • Memtable flush at ~64 MB in RocksDB defaults, producing a similar-sized L0 SSTable.
  • Compaction debt — if writes outpace compaction, L0 grows and reads slow down. Production systems backpressure writes when L0 exceeds ~4 SSTables.
  • Tombstones can resurrect data if compaction garbage-collects them too eagerly across levels — Cassandra's gc_grace_seconds exists exactly for this reason.

JavaScript implementation (sketch)

// A toy LSM that holds memtable + a list of SSTables in memory.
// Real LSMs persist SSTables to disk and use Bloom filters per file.

class SSTable {
  constructor(entries) {
    this.entries = entries;          // sorted [key, value] pairs
    this.minKey  = entries[0][0];
    this.maxKey  = entries[entries.length - 1][0];
  }

  get(key) {
    if (key < this.minKey || key > this.maxKey) return undefined;
    // binary search the sorted entries
    let lo = 0, hi = this.entries.length - 1;
    while (lo <= hi) {
      const mid = (lo + hi) >> 1;
      const k = this.entries[mid][0];
      if (k === key) return this.entries[mid][1];
      if (k < key)   lo = mid + 1;
      else           hi = mid - 1;
    }
    return undefined;
  }
}

const TOMBSTONE = Symbol('TOMBSTONE');

class LSM {
  constructor(memtableLimit = 1024) {
    this.memtable = new Map();         // recent writes (insertion-order-iterable)
    this.ssTables = [];                // newest first
    this.limit = memtableLimit;
  }

  put(key, value) {
    this.memtable.set(key, value);
    if (this.memtable.size >= this.limit) this.flush();
  }

  delete(key) { this.put(key, TOMBSTONE); }

  flush() {
    const sorted = [...this.memtable.entries()].sort((a, b) => (a[0] > b[0]) - (a[0] < b[0]));
    this.ssTables.unshift(new SSTable(sorted));
    this.memtable = new Map();
    if (this.ssTables.length >= 4) this.compact();
  }

  // Tiered compaction: merge all current SSTables into one
  compact() {
    const merged = new Map();
    // walk newest → oldest, keep first occurrence (newer wins)
    for (const sst of this.ssTables) {
      for (const [k, v] of sst.entries) {
        if (!merged.has(k)) merged.set(k, v);
      }
    }
    // drop tombstones in the bottom level
    const sorted = [...merged.entries()]
      .filter(([, v]) => v !== TOMBSTONE)
      .sort((a, b) => (a[0] > b[0]) - (a[0] < b[0]));
    this.ssTables = [new SSTable(sorted)];
  }

  get(key) {
    if (this.memtable.has(key)) {
      const v = this.memtable.get(key);
      return v === TOMBSTONE ? undefined : v;
    }
    for (const sst of this.ssTables) {
      const v = sst.get(key);
      if (v !== undefined) return v === TOMBSTONE ? undefined : v;
    }
    return undefined;
  }
}

This is an in-memory caricature, but it captures every essential idea: memtable buffering, ordered SSTables, newest-first read order, tombstone deletion, and merge compaction.

Python implementation — leveled compaction strategy

from sortedcontainers import SortedDict
import bisect

TOMBSTONE = object()

class SSTable:
    __slots__ = ('keys', 'vals', 'min_key', 'max_key')
    def __init__(self, items):
        items.sort(key=lambda kv: kv[0])
        self.keys = [k for k, _ in items]
        self.vals = [v for _, v in items]
        self.min_key = self.keys[0]
        self.max_key = self.keys[-1]

    def get(self, key):
        if key < self.min_key or key > self.max_key: return None
        i = bisect.bisect_left(self.keys, key)
        if i < len(self.keys) and self.keys[i] == key:
            return self.vals[i]
        return None

    def overlaps(self, other):
        return not (self.max_key < other.min_key or other.max_key < self.min_key)

class LeveledLSM:
    LEVEL_RATIO = 10                      # L(k+1) is 10× the size of L(k)
    L0_TRIGGER  = 4                       # compact when L0 has 4 SSTables

    def __init__(self, memtable_limit=1024):
        self.memtable = SortedDict()
        self.levels   = [[] for _ in range(7)]   # L0..L6
        self.limit    = memtable_limit

    def put(self, k, v):
        self.memtable[k] = v
        if len(self.memtable) >= self.limit:
            self._flush()

    def delete(self, k): self.put(k, TOMBSTONE)

    def _flush(self):
        sst = SSTable(list(self.memtable.items()))
        self.levels[0].append(sst)
        self.memtable = SortedDict()
        if len(self.levels[0]) >= self.L0_TRIGGER:
            self._compact(0)

    # Leveled compaction: pick one SSTable from L(n), find every overlapping
    # SSTable in L(n+1), merge them all into a new set of L(n+1) SSTables.
    def _compact(self, level):
        if level == 0:
            sources = self.levels[0]
            self.levels[0] = []
        else:
            sources = [self.levels[level].pop(0)]   # round-robin
        target  = level + 1
        # gather overlapping SSTables in target level
        overlapping = [s for s in self.levels[target] if any(s.overlaps(x) for x in sources)]
        for s in overlapping:
            self.levels[target].remove(s)
        # merge by walking all source iterators in lock-step
        merged = {}
        for sst in sources + overlapping:
            for k, v in zip(sst.keys, sst.vals):
                if k not in merged: merged[k] = v       # first occurrence = newest
        # drop tombstones if this is the deepest level (no shadows below)
        is_bottom = target == len(self.levels) - 1 or not self.levels[target + 1:]
        items = [(k, v) for k, v in merged.items() if not (is_bottom and v is TOMBSTONE)]
        # split back into target-sized chunks
        chunk = self.limit * (self.LEVEL_RATIO ** target)
        for i in range(0, len(items), chunk):
            self.levels[target].append(SSTable(items[i:i + chunk]))
        # cascade if target now exceeds its budget
        budget = self.LEVEL_RATIO ** target
        if len(self.levels[target]) > budget:
            self._compact(target)

    def get(self, key):
        if key in self.memtable:
            v = self.memtable[key]
            return None if v is TOMBSTONE else v
        for sst in reversed(self.levels[0]):           # L0 newest-first
            v = sst.get(key)
            if v is not None: return None if v is TOMBSTONE else v
        for level in self.levels[1:]:                   # L1+ are non-overlapping
            for sst in level:
                if sst.min_key <= key <= sst.max_key:
                    v = sst.get(key)
                    if v is not None: return None if v is TOMBSTONE else v
                    break
        return None

The compaction strategy choice is the central tuning question for any LSM deployment. Leveled compaction (above) keeps each level's SSTables non-overlapping, so reads check at most one file per level. The cost: every byte written to L(n) eventually gets rewritten when L(n+1) is recompacted. Tiered (Cassandra default) instead waits for several similar-sized SSTables in a level, then merges them all into one SSTable in the next level — fewer total rewrites, but reads must check every overlap.

Compaction strategies and LSM variants

Leveled compaction (RocksDB default). Each level is non-overlapping. Best read amplification, worst write amplification. Used in production at Facebook, MyRocks, CockroachDB.

Tiered (size-tiered) compaction. Cassandra's default. SSTables of roughly equal size are merged together into one in the next "tier". High write throughput, but read amplification grows with tier count and space amplification can be 2× during compaction.

Time-window compaction. SSTables within a time bucket are compacted together; older buckets are never touched. Designed for time-series with TTL — drops whole buckets atomically when expired.

Universal (RocksDB). Variant of tiered that prefers fewer, larger SSTables. Lower write amplification at the cost of higher space amplification.

FIFO compaction. Drop the oldest SSTable when total size exceeds a threshold. No merging at all. Ideal for cache-style workloads where stale data simply ages out.

WiscKey / Lethe / Dostoevsky. Research variants that separate keys from large values to avoid rewriting values during compaction. Adopted in Badger and TiKV's Titan.

RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB, Riak — all LSM-based at heart. The differences are in compaction policy, threading model, and whether they're ranged or partitioned.

Common bugs and edge cases

  • Compaction starvation. If writes outpace background compaction, L0 fills up and reads must scan dozens of files. Production systems throttle writes (or die) when L0 backlog crosses a threshold.
  • Tombstone resurrection. If a tombstone is dropped during compaction but an older copy of the key still exists in a deeper level, the deletion appears to roll back. Cassandra's gc_grace_seconds (10 days default) prevents this.
  • Read amplification under range scans. Bloom filters help point lookups but not ranges. Range scans must merge-iterate every overlapping SSTable, even if only one of them contains the actual data.
  • Snapshot isolation breaking compaction. Open snapshots pin SSTables from being deleted; long-running snapshots can balloon disk usage.
  • Bloom filter on the wrong column. A Bloom filter on a UUID is great; on a timestamp range query, useless.
  • Sequential keys defeating compaction. Inserting strictly increasing keys (epochs, autoincrement IDs) creates pathological compaction patterns; partitioned tables and time-window compaction exist exactly for this.
  • WAL fsync forgotten. Without an fsync after every write, a crash can lose data even though the memtable accepted it. Most production systems offer "synchronous" vs "batched" durability knobs.

Frequently asked questions

Why is an LSM tree faster for writes than a B-tree?

Writes go to an in-memory buffer (the memtable) and are appended to a write-ahead log — both sequential operations. Disk writes happen in big sorted batches when the memtable flushes. A B-tree, by contrast, has to read-modify-write a page in random places on every insert.

What is compaction and why does it matter?

Compaction merges multiple SSTable files into fewer, larger ones, dropping deleted (tombstoned) and overwritten keys. Without compaction, reads slow down (more files to check) and disk fills with obsolete versions. Compaction strategy is the single biggest tuning knob in any LSM.

What's the difference between leveled and tiered compaction?

Tiered (used by Cassandra by default) merges several similar-sized SSTables into one — high write throughput, more space and read amplification. Leveled (RocksDB default) keeps each level non-overlapping, so reads check at most one file per level — lower read and space amplification, higher write amplification.

How does an LSM handle deletes?

It writes a tombstone — a small marker entry saying "this key was deleted at time T". The tombstone shadows older versions on read. Compaction eventually drops both the tombstone and the older versions, but only after enough time has passed that no live snapshot still needs them.

Why use a Bloom filter with an LSM tree?

A read in the worst case checks every SSTable. A per-SSTable Bloom filter answers "definitely not in this file" in microseconds, sparing the disk read. With a 1% false-positive rate, you typically only read the one file that actually contains the key, even with hundreds of SSTables.

When should I pick an LSM over a B+ tree?

When write throughput dominates: time-series ingest, write-heavy logs, append-mostly workloads, IoT firehoses. B+ trees still win when reads dominate, when range scans must be ultra-low-latency, or when the data fits comfortably in memory.