Storage Systems
Log-Structured Merge
Append-only writes, tiered SSTables, background compaction — how modern KV stores absorb writes
Log-structured merge architectures buffer writes in RAM, flush as immutable sorted files on disk, and merge files across levels. 10× faster writes; 2× slower reads.
- Writes10× faster than B-tree
- Reads (with Bloom)~2× slower than B-tree
- Write amplification10× to 30× (leveled)
- LevelsL0 to L6 typical
- Size ratio10× between adjacent levels
- Used inRocksDB, Cassandra, BigTable
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The log-structured merge architecture
Log-structured merge (LSM) is a write path designed around one observation: disks reward sequential writes and punish random ones. An NVMe SSD pushes 5-7 GB/s sequentially but barely 500 MB/s on small random writes. Spinning disks are worse by an order of magnitude. Random writes are the bottleneck of every traditional database.
LSM rearranges every write into the cheap pattern:
- Write to a log (sequential). A write-ahead log on disk records the change for durability.
- Insert into RAM (memtable). A sorted in-memory structure absorbs the write at memory speed.
- Flush to disk when full (sequential). The memtable becomes an immutable sorted on-disk SSTable.
- Merge in the background (sequential, batched). Compaction merges SSTables across levels to keep the count manageable.
At no point does the disk see a small random write. The cost of this rearrangement is write amplification — each byte gets rewritten multiple times as it migrates down levels — but the disk handles that as more sequential work, which it loves.
[WRITE] put(k, v)
↓
WAL append (sequential, fsync at commit)
↓
Memtable insert (RAM, O(log n))
↓ when full
SSTable flush (sequential write of sorted file → L0)
↓ when L0 full
Compaction L0→L1 (sequential merge → L1 SSTables)
↓ when L1 full
Compaction L1→L2 (sequential merge → L2 SSTables, ~10× bigger)
↓
...
L6 (largest, oldest data)
The level pyramid
LSM disks are organized as a pyramid of levels — L0 at top (smallest), L6 at bottom (largest). Each level is roughly 10× the size of the one above. Reads check every level; compaction merges adjacent levels.
L0 contains recently-flushed SSTables that may overlap in key range (no global ordering yet). RocksDB typically caps L0 at 4-8 SSTables before triggering compaction.
L1 through L6 are non-overlapping in leveled compaction: each level partitions the keyspace into disjoint SSTables. A read in level k checks at most one SSTable — binary search to locate, then within-file binary search. Bloom filters skip files that definitely don't contain the key.
L0 [a..z] [a..z] [a..z] [a..z] ← may overlap; recent
L1 [a..c] [d..f] [g..i] [j..l] ... ← non-overlapping, sorted
L2 [a..a3] [a3..b1] ... ← 10× bigger, non-overlapping
L3 ... ← 100× bigger
...
L6 ... ← millions of SSTables
The LSM cost model — what the numbers say
- LSM writes 10× faster than B-trees in production ingest benchmarks (RocksDB vs InnoDB, Cassandra vs Postgres on time-series). The sequential write pattern is the entire reason.
- LSM reads ~2× slower than B-trees in cold-cache point lookups. With Bloom filters at 10 bits/key and 1% false-positive rate, the practical read amp is ~1.01× — usually only one SSTable is actually read.
- Write amplification 10-30× with leveled compaction. Each byte rewritten once per level migration. Tiered compaction is 4-10× write amp at the cost of higher read amp.
- Space amplification 1.1-1.4× with leveled — older versions and tombstones add small overhead. Tiered is 2-10× during compaction.
- L0 size threshold ~ 4 SSTables before write stall in RocksDB. Compaction must keep up; otherwise the API blocks.
- Compaction CPU overhead dominates many LSM workloads — 20-50% of total CPU during steady-state ingestion.
LSM vs B-tree — workload by workload
| LSM (leveled) | LSM (tiered) | B+ tree | Hash index | In-memory | Append-only log | |
|---|---|---|---|---|---|---|
| Sequential write IO | Yes | Yes | No (random) | No (random) | — | Yes |
| Write amplification | 10-30× | 4-10× | 1-3× | 1-3× | 1× | 1× |
| Read amplification | ~1× w/ Bloom | L× without Bloom | 1× | 1× | 1× | n (full scan) |
| Space amplification | 1.1-1.4× | 2-10× | 1.5× (50% fill) | 1.5-2× | ~2× | n× |
| Range scan | Merge-iterate | Merge-iterate | Excellent | Not supported | Excellent | Linear scan |
| Used in | RocksDB, BigTable | Cassandra, ScyllaDB | Postgres, InnoDB | Memcached, Redis | Redis, sorted set | Kafka segments |
LSM dominates write-heavy and ingest-heavy workloads. B-trees dominate read-heavy workloads with strict tail-latency goals. Most modern systems pick LSM because writes have caught up with reads in modern web/IoT/telemetry workloads.
Python — leveled LSM with compaction
from sortedcontainers import SortedDict
import bisect
TOMBSTONE = object()
class SSTable:
__slots__ = ('keys', 'vals', 'min_key', 'max_key')
def __init__(self, items):
items.sort(key=lambda kv: kv[0])
self.keys = [k for k, _ in items]
self.vals = [v for _, v in items]
self.min_key = self.keys[0]
self.max_key = self.keys[-1]
def get(self, k):
if k < self.min_key or k > self.max_key: return None
i = bisect.bisect_left(self.keys, k)
if i < len(self.keys) and self.keys[i] == k:
return self.vals[i]
return None
def overlaps(self, other):
return not (self.max_key < other.min_key or other.max_key < self.min_key)
class LSM:
LEVEL_RATIO = 10 # each level 10× the previous
L0_TRIGGER = 4 # compact L0 when 4 SSTables present
def __init__(self, memtable_limit=1024):
self.memtable = SortedDict()
self.levels = [[] for _ in range(7)] # L0..L6
self.limit = memtable_limit
def put(self, k, v):
self.memtable[k] = v
if len(self.memtable) >= self.limit:
self._flush()
def delete(self, k): self.put(k, TOMBSTONE)
def _flush(self):
sst = SSTable(list(self.memtable.items()))
self.levels[0].append(sst)
self.memtable = SortedDict()
if len(self.levels[0]) >= self.L0_TRIGGER:
self._compact(0)
def _compact(self, level):
# Pick all of L0 (overlapping) or one SSTable of deeper levels
if level == 0:
sources = self.levels[0]
self.levels[0] = []
else:
sources = [self.levels[level].pop(0)]
target = level + 1
overlapping = [s for s in self.levels[target]
if any(s.overlaps(x) for x in sources)]
for s in overlapping:
self.levels[target].remove(s)
# Merge keeping newest version per key
merged = {}
for sst in sources + overlapping:
for k, v in zip(sst.keys, sst.vals):
if k not in merged: merged[k] = v
# Bottom level: drop tombstones
is_bottom = target == len(self.levels) - 1
items = [(k, v) for k, v in merged.items()
if not (is_bottom and v is TOMBSTONE)]
# Chunk into target-sized SSTables
chunk = self.limit * (self.LEVEL_RATIO ** target)
for i in range(0, len(items), chunk):
self.levels[target].append(SSTable(items[i:i + chunk]))
# Cascade if target now over budget
if len(self.levels[target]) > self.LEVEL_RATIO ** target:
self._compact(target)
def get(self, k):
if k in self.memtable:
v = self.memtable[k]
return None if v is TOMBSTONE else v
for sst in reversed(self.levels[0]):
v = sst.get(k)
if v is not None: return None if v is TOMBSTONE else v
for level in self.levels[1:]:
for sst in level:
if sst.min_key <= k <= sst.max_key:
v = sst.get(k)
if v is not None: return None if v is TOMBSTONE else v
break
return None
Compaction strategies — leveled, tiered, time-windowed
Leveled compaction (RocksDB default). Each level is non-overlapping. Best read amp, worst write amp. Industrial standard.
Tiered (size-tiered) compaction (Cassandra default). Multiple similar-sized SSTables per level; merge several into one in the next tier. Higher write throughput, more read and space amp.
Universal compaction (RocksDB). A tiered variant preferring fewer, larger SSTables. Lower write amp at the cost of higher peak space amp during merges.
Time-window compaction. SSTables grouped by time bucket; older buckets are immutable and dropped wholesale at TTL. Designed for time-series workloads.
FIFO compaction. Drop oldest SSTable when total size exceeds threshold. No merging. Ideal for cache-style data where stale aging-out is acceptable.
Key-value separation (WiscKey, Titan, Lethe). Keep small keys in the LSM, store large values in a separate value log to avoid rewriting them during compaction. Drastically lowers write amp on big-value workloads.
Production LSM systems
- RocksDB — Facebook's embedded LSM. Powers MyRocks, CockroachDB, TiKV, ClickHouse storage layer.
- LevelDB — Google's original LSM design from 2011. Smaller, simpler, the educational reference.
- Cassandra / ScyllaDB — distributed wide-column stores using tiered compaction by default.
- BigTable / HBase — Google's original distributed LSM (Bigtable paper, 2006) and its Apache clone.
- DynamoDB — Amazon's managed KV store; partitions are individual LSMs internally.
- Riak, FoundationDB, etcd, CockroachDB — all LSM-backed at the storage layer.
- InfluxDB, Druid, Pinot, ClickHouse — time-series and analytical stores using LSM-style time-window organization.
When LSM is the right call
- Write throughput dominates — telemetry ingest, IoT firehoses, log/metric collection.
- Time-series workloads — append-mostly with time-window compaction dropping expired data atomically.
- Append-mostly event stores — Kafka log compaction is itself an LSM.
- SSDs and disk-tiered storage — the sequential-write friendliness is the entire point.
- When you can spare CPU for compaction — LSM trades CPU during compaction for ingest throughput. Modern multicore boxes handle this gladly.
Avoid LSM when reads dominate and tail-latency must be predictable (compaction storms hurt p99). Avoid when the dataset fits in RAM — at that point, in-memory structures win on every axis.
Common LSM gotchas
- Compaction starvation. Writes outpace compaction; L0 fills; reads slow; the system stalls. RocksDB throttles writes via
level0_slowdown_writes_trigger. - Tombstone resurrection. A tombstone GC'd before all older versions are GC'd brings deleted data back. Cassandra's
gc_grace_seconds(10 days) guards against this. - Range scan cost. Bloom filters help point lookups, not ranges. Range scans must merge-iterate every overlapping SSTable.
- Snapshot pinning. Open snapshots keep old SSTables alive; long-running scans inflate disk usage.
- Sequential keys defeat compaction. Strictly increasing keys (epochs, autoincrement IDs) create hot SSTables and pathological compaction patterns. Use partitioning or time-window compaction.
- WAL fsync forgotten. Without fsync after every write, a crash can lose data the memtable already accepted. Production engines expose synchronous/batched durability knobs.
- Bloom filter on the wrong key. Bloom filter on a UUID is great; on a timestamp range, useless.
Frequently asked questions
What does 'log-structured' mean?
The on-disk format is append-only. Writes go to the end of a log; existing data is never modified in place. Outdated entries are reclaimed asynchronously through compaction. Compare to a B-tree, which modifies the existing data pages directly — the random-write pattern LSM was designed to avoid.
How is LSM different from a plain log?
A plain log is great for writes but terrible for reads — you'd have to scan everything to find a key. LSM solves this by buffering writes in an in-memory sorted memtable, flushing periodically as sorted on-disk SSTables, and merging those SSTables into deeper levels in the background. The result: writes stay sequential, reads use binary search per file.
Why is LSM faster for writes than B-trees?
B-tree inserts modify a random page on disk per write. LSM buffers writes in RAM and flushes them in big sequential batches. Modern SSDs handle sequential writes at 5-7 GB/s vs. 500 MB/s for small random writes. The 10× gap that justifies LSM in write-heavy workloads.
What is write amplification in LSM?
Each byte you write at the API gets rewritten multiple times as it migrates through compaction levels. With leveled compaction and a 10× size ratio between levels, write amplification is roughly 10× per level × number of levels. RocksDB's typical write amp is 10-30×; that's the LSM tax for converting random writes into sequential ones.
Why use Bloom filters with LSM?
A read can in the worst case probe every SSTable in every level. A per-SSTable Bloom filter answers 'definitely not here' in microseconds without a disk read. With a 1% false-positive rate and 10 bits per key, reads usually touch only the one SSTable that actually contains the key — even across hundreds of SSTables.
When would you NOT pick an LSM?
Read-heavy workloads with strict tail-latency requirements, especially if data fits in memory. Single-host OLTP with low concurrent writes. Workloads requiring large range scans of cold data — compaction storms can interrupt scan progress. B+ trees, hash indexes, or in-memory structures often win these.