Storage Systems

Log-Structured Merge

Append-only writes, tiered SSTables, background compaction — how modern KV stores absorb writes

Log-structured merge architectures buffer writes in RAM, flush as immutable sorted files on disk, and merge files across levels. 10× faster writes; 2× slower reads.

  • Writes10× faster than B-tree
  • Reads (with Bloom)~2× slower than B-tree
  • Write amplification10× to 30× (leveled)
  • LevelsL0 to L6 typical
  • Size ratio10× between adjacent levels
  • Used inRocksDB, Cassandra, BigTable

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The log-structured merge architecture

Log-structured merge (LSM) is a write path designed around one observation: disks reward sequential writes and punish random ones. An NVMe SSD pushes 5-7 GB/s sequentially but barely 500 MB/s on small random writes. Spinning disks are worse by an order of magnitude. Random writes are the bottleneck of every traditional database.

LSM rearranges every write into the cheap pattern:

  1. Write to a log (sequential). A write-ahead log on disk records the change for durability.
  2. Insert into RAM (memtable). A sorted in-memory structure absorbs the write at memory speed.
  3. Flush to disk when full (sequential). The memtable becomes an immutable sorted on-disk SSTable.
  4. Merge in the background (sequential, batched). Compaction merges SSTables across levels to keep the count manageable.

At no point does the disk see a small random write. The cost of this rearrangement is write amplification — each byte gets rewritten multiple times as it migrates down levels — but the disk handles that as more sequential work, which it loves.

[WRITE] put(k, v)
  ↓
WAL append         (sequential, fsync at commit)
  ↓
Memtable insert    (RAM, O(log n))
  ↓ when full
SSTable flush      (sequential write of sorted file → L0)
  ↓ when L0 full
Compaction L0→L1   (sequential merge → L1 SSTables)
  ↓ when L1 full
Compaction L1→L2   (sequential merge → L2 SSTables, ~10× bigger)
  ↓
...
L6 (largest, oldest data)

The level pyramid

LSM disks are organized as a pyramid of levels — L0 at top (smallest), L6 at bottom (largest). Each level is roughly 10× the size of the one above. Reads check every level; compaction merges adjacent levels.

L0 contains recently-flushed SSTables that may overlap in key range (no global ordering yet). RocksDB typically caps L0 at 4-8 SSTables before triggering compaction.

L1 through L6 are non-overlapping in leveled compaction: each level partitions the keyspace into disjoint SSTables. A read in level k checks at most one SSTable — binary search to locate, then within-file binary search. Bloom filters skip files that definitely don't contain the key.

L0  [a..z]  [a..z]  [a..z]  [a..z]    ← may overlap; recent
L1  [a..c]  [d..f]  [g..i]  [j..l] ... ← non-overlapping, sorted
L2  [a..a3] [a3..b1] ...               ← 10× bigger, non-overlapping
L3  ...                                ← 100× bigger
...
L6  ...                                ← millions of SSTables

The LSM cost model — what the numbers say

  • LSM writes 10× faster than B-trees in production ingest benchmarks (RocksDB vs InnoDB, Cassandra vs Postgres on time-series). The sequential write pattern is the entire reason.
  • LSM reads ~2× slower than B-trees in cold-cache point lookups. With Bloom filters at 10 bits/key and 1% false-positive rate, the practical read amp is ~1.01× — usually only one SSTable is actually read.
  • Write amplification 10-30× with leveled compaction. Each byte rewritten once per level migration. Tiered compaction is 4-10× write amp at the cost of higher read amp.
  • Space amplification 1.1-1.4× with leveled — older versions and tombstones add small overhead. Tiered is 2-10× during compaction.
  • L0 size threshold ~ 4 SSTables before write stall in RocksDB. Compaction must keep up; otherwise the API blocks.
  • Compaction CPU overhead dominates many LSM workloads — 20-50% of total CPU during steady-state ingestion.

LSM vs B-tree — workload by workload

LSM (leveled)LSM (tiered)B+ treeHash indexIn-memoryAppend-only log
Sequential write IOYesYesNo (random)No (random)Yes
Write amplification10-30×4-10×1-3×1-3×
Read amplification~1× w/ BloomL× without Bloomn (full scan)
Space amplification1.1-1.4×2-10×1.5× (50% fill)1.5-2×~2×
Range scanMerge-iterateMerge-iterateExcellentNot supportedExcellentLinear scan
Used inRocksDB, BigTableCassandra, ScyllaDBPostgres, InnoDBMemcached, RedisRedis, sorted setKafka segments

LSM dominates write-heavy and ingest-heavy workloads. B-trees dominate read-heavy workloads with strict tail-latency goals. Most modern systems pick LSM because writes have caught up with reads in modern web/IoT/telemetry workloads.

Python — leveled LSM with compaction

from sortedcontainers import SortedDict
import bisect

TOMBSTONE = object()

class SSTable:
    __slots__ = ('keys', 'vals', 'min_key', 'max_key')
    def __init__(self, items):
        items.sort(key=lambda kv: kv[0])
        self.keys = [k for k, _ in items]
        self.vals = [v for _, v in items]
        self.min_key = self.keys[0]
        self.max_key = self.keys[-1]

    def get(self, k):
        if k < self.min_key or k > self.max_key: return None
        i = bisect.bisect_left(self.keys, k)
        if i < len(self.keys) and self.keys[i] == k:
            return self.vals[i]
        return None

    def overlaps(self, other):
        return not (self.max_key < other.min_key or other.max_key < self.min_key)

class LSM:
    LEVEL_RATIO = 10                          # each level 10× the previous
    L0_TRIGGER = 4                            # compact L0 when 4 SSTables present

    def __init__(self, memtable_limit=1024):
        self.memtable = SortedDict()
        self.levels = [[] for _ in range(7)]  # L0..L6
        self.limit = memtable_limit

    def put(self, k, v):
        self.memtable[k] = v
        if len(self.memtable) >= self.limit:
            self._flush()

    def delete(self, k): self.put(k, TOMBSTONE)

    def _flush(self):
        sst = SSTable(list(self.memtable.items()))
        self.levels[0].append(sst)
        self.memtable = SortedDict()
        if len(self.levels[0]) >= self.L0_TRIGGER:
            self._compact(0)

    def _compact(self, level):
        # Pick all of L0 (overlapping) or one SSTable of deeper levels
        if level == 0:
            sources = self.levels[0]
            self.levels[0] = []
        else:
            sources = [self.levels[level].pop(0)]
        target = level + 1
        overlapping = [s for s in self.levels[target]
                       if any(s.overlaps(x) for x in sources)]
        for s in overlapping:
            self.levels[target].remove(s)
        # Merge keeping newest version per key
        merged = {}
        for sst in sources + overlapping:
            for k, v in zip(sst.keys, sst.vals):
                if k not in merged: merged[k] = v
        # Bottom level: drop tombstones
        is_bottom = target == len(self.levels) - 1
        items = [(k, v) for k, v in merged.items()
                 if not (is_bottom and v is TOMBSTONE)]
        # Chunk into target-sized SSTables
        chunk = self.limit * (self.LEVEL_RATIO ** target)
        for i in range(0, len(items), chunk):
            self.levels[target].append(SSTable(items[i:i + chunk]))
        # Cascade if target now over budget
        if len(self.levels[target]) > self.LEVEL_RATIO ** target:
            self._compact(target)

    def get(self, k):
        if k in self.memtable:
            v = self.memtable[k]
            return None if v is TOMBSTONE else v
        for sst in reversed(self.levels[0]):
            v = sst.get(k)
            if v is not None: return None if v is TOMBSTONE else v
        for level in self.levels[1:]:
            for sst in level:
                if sst.min_key <= k <= sst.max_key:
                    v = sst.get(k)
                    if v is not None: return None if v is TOMBSTONE else v
                    break
        return None

Compaction strategies — leveled, tiered, time-windowed

Leveled compaction (RocksDB default). Each level is non-overlapping. Best read amp, worst write amp. Industrial standard.

Tiered (size-tiered) compaction (Cassandra default). Multiple similar-sized SSTables per level; merge several into one in the next tier. Higher write throughput, more read and space amp.

Universal compaction (RocksDB). A tiered variant preferring fewer, larger SSTables. Lower write amp at the cost of higher peak space amp during merges.

Time-window compaction. SSTables grouped by time bucket; older buckets are immutable and dropped wholesale at TTL. Designed for time-series workloads.

FIFO compaction. Drop oldest SSTable when total size exceeds threshold. No merging. Ideal for cache-style data where stale aging-out is acceptable.

Key-value separation (WiscKey, Titan, Lethe). Keep small keys in the LSM, store large values in a separate value log to avoid rewriting them during compaction. Drastically lowers write amp on big-value workloads.

Production LSM systems

  • RocksDB — Facebook's embedded LSM. Powers MyRocks, CockroachDB, TiKV, ClickHouse storage layer.
  • LevelDB — Google's original LSM design from 2011. Smaller, simpler, the educational reference.
  • Cassandra / ScyllaDB — distributed wide-column stores using tiered compaction by default.
  • BigTable / HBase — Google's original distributed LSM (Bigtable paper, 2006) and its Apache clone.
  • DynamoDB — Amazon's managed KV store; partitions are individual LSMs internally.
  • Riak, FoundationDB, etcd, CockroachDB — all LSM-backed at the storage layer.
  • InfluxDB, Druid, Pinot, ClickHouse — time-series and analytical stores using LSM-style time-window organization.

When LSM is the right call

  • Write throughput dominates — telemetry ingest, IoT firehoses, log/metric collection.
  • Time-series workloads — append-mostly with time-window compaction dropping expired data atomically.
  • Append-mostly event stores — Kafka log compaction is itself an LSM.
  • SSDs and disk-tiered storage — the sequential-write friendliness is the entire point.
  • When you can spare CPU for compaction — LSM trades CPU during compaction for ingest throughput. Modern multicore boxes handle this gladly.

Avoid LSM when reads dominate and tail-latency must be predictable (compaction storms hurt p99). Avoid when the dataset fits in RAM — at that point, in-memory structures win on every axis.

Common LSM gotchas

  • Compaction starvation. Writes outpace compaction; L0 fills; reads slow; the system stalls. RocksDB throttles writes via level0_slowdown_writes_trigger.
  • Tombstone resurrection. A tombstone GC'd before all older versions are GC'd brings deleted data back. Cassandra's gc_grace_seconds (10 days) guards against this.
  • Range scan cost. Bloom filters help point lookups, not ranges. Range scans must merge-iterate every overlapping SSTable.
  • Snapshot pinning. Open snapshots keep old SSTables alive; long-running scans inflate disk usage.
  • Sequential keys defeat compaction. Strictly increasing keys (epochs, autoincrement IDs) create hot SSTables and pathological compaction patterns. Use partitioning or time-window compaction.
  • WAL fsync forgotten. Without fsync after every write, a crash can lose data the memtable already accepted. Production engines expose synchronous/batched durability knobs.
  • Bloom filter on the wrong key. Bloom filter on a UUID is great; on a timestamp range, useless.

Frequently asked questions

What does 'log-structured' mean?

The on-disk format is append-only. Writes go to the end of a log; existing data is never modified in place. Outdated entries are reclaimed asynchronously through compaction. Compare to a B-tree, which modifies the existing data pages directly — the random-write pattern LSM was designed to avoid.

How is LSM different from a plain log?

A plain log is great for writes but terrible for reads — you'd have to scan everything to find a key. LSM solves this by buffering writes in an in-memory sorted memtable, flushing periodically as sorted on-disk SSTables, and merging those SSTables into deeper levels in the background. The result: writes stay sequential, reads use binary search per file.

Why is LSM faster for writes than B-trees?

B-tree inserts modify a random page on disk per write. LSM buffers writes in RAM and flushes them in big sequential batches. Modern SSDs handle sequential writes at 5-7 GB/s vs. 500 MB/s for small random writes. The 10× gap that justifies LSM in write-heavy workloads.

What is write amplification in LSM?

Each byte you write at the API gets rewritten multiple times as it migrates through compaction levels. With leveled compaction and a 10× size ratio between levels, write amplification is roughly 10× per level × number of levels. RocksDB's typical write amp is 10-30×; that's the LSM tax for converting random writes into sequential ones.

Why use Bloom filters with LSM?

A read can in the worst case probe every SSTable in every level. A per-SSTable Bloom filter answers 'definitely not here' in microseconds without a disk read. With a 1% false-positive rate and 10 bits per key, reads usually touch only the one SSTable that actually contains the key — even across hundreds of SSTables.

When would you NOT pick an LSM?

Read-heavy workloads with strict tail-latency requirements, especially if data fits in memory. Single-host OLTP with low concurrent writes. Workloads requiring large range scans of cold data — compaction storms can interrupt scan progress. B+ trees, hash indexes, or in-memory structures often win these.