Storage Systems

Memtable Flush

The atomic write-path step that turns random inserts into one sequential SSTable

A memtable is the in-RAM sorted map that absorbs writes in an LSM engine. When it fills (typically 64 MB), it's frozen and flushed as an immutable SSTable on disk.

  • StructureSkip list (RocksDB, Cassandra)
  • Flush threshold64 MB default (RocksDB)
  • DurabilityWAL records every write
  • Flush typeBackground, non-blocking
  • On-disk outputImmutable sorted SSTable
  • Used inRocksDB, Cassandra, LevelDB

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a memtable flush works

In an LSM-tree engine like RocksDB or Cassandra, every put(key, value) follows a two-stage path:

  1. Append to the WAL. The change is recorded in the write-ahead log on disk — sequential, durable, replayable. This is what makes COMMIT survive a crash.
  2. Insert into the memtable. The in-memory sorted structure (usually a skip list) accepts the key-value pair. O(log n) insert, immediately visible to subsequent reads.

The memtable grows. New keys go in, existing keys get overwritten in place, deletes insert tombstones. Reads check the memtable first (in RAM, fastest), then fall back to on-disk SSTables.

When the memtable size crosses a threshold — typically 64 MB in RocksDB, 256 MB to 2 GB in Cassandra — the engine performs a flush:

  1. The current memtable becomes immutable — readable, no longer writable.
  2. A new mutable memtable is created instantly; new writes go there.
  3. The immutable memtable is walked in sorted order, producing a single SSTable file on disk: keys, values, index block, Bloom filter, footer.
  4. Once the SSTable is durable, the immutable memtable is freed and the corresponding WAL segments are deleted.

The actual write-path stall during flush is microseconds — long enough to swap pointers, not long enough to drain to disk. The disk write happens entirely in the background.

[CLIENT] put(k, v)
   │
   ├─→ WAL append (sequential, fsync at commit)
   │
   └─→ memtable.insert(k, v)
         skip list, O(log n)
         │
         │ size > threshold? (64 MB)
         ▼
       mark immutable; spawn new memtable
         │
         │ (background flush thread)
         ▼
       walk in sorted order → write SSTable
         │
         ▼
       free immutable memtable + delete WAL segments

Why hold writes in RAM at all

Without the memtable, every put would be a small random write to whichever on-disk page held the target key. On an NVMe SSD that's roughly 50-100 µs per write; on a spinning disk it's milliseconds. The memtable buffers thousands of writes and flushes them as one long sequential SSTable, which the disk handles at near-bandwidth speed.

The memtable is the sequentialization trick. Disks reward sequential writes — modern NVMe SSDs hit 5-7 GB/s sequentially but only ~500 MB/s on small random writes. The memtable turns the random write workload at the API into a sequential write workload at the disk.

Numbers from production engines

  • RocksDB default flush threshold: 64 MB via write_buffer_size. With min_write_buffer_number_to_merge=1, each flush produces one L0 SSTable.
  • Cassandra flush threshold: depends on heap. memtable_heap_space_in_mb often 1024-2048 MB; memtable_flush_writers controls flush parallelism.
  • LevelDB flush threshold: 4 MB via write_buffer_size. Small memtables = more L0 files = more compaction.
  • Memtable RAM cost ≈ 1.5-2× the data size due to skip list pointer overhead (typically 4-8 pointers per entry).
  • Flush rate on a moderately busy ingest: ~1 flush per 5-30 seconds at default sizes. RocksDB sustains millions of writes/sec by flushing 64 MB SSTables continuously.

Memtable structures compared

Skip listRed-black treeB+ treeHash tableSorted vectorTrie
InsertO(log n) avgO(log n)O(log n)O(1) avgO(n)O(L) — key length
Sorted scanO(n) — built-inO(n) — in-orderO(n) — leaf scanNot supportedO(n) — already sortedO(n) — DFS
ConcurrencyLock-free easyHard (rebalances)Latching pagesPer-bucket locksSingle lockSub-tree locks
Memory overhead2-4× data3× data1.5× data1.5-2×~1×1-2× depending on prefix overlap
Used byRocksDB, Cassandrasome C++ STL memtablesInnoDB redoRedis hash typeToy projectsSome prefix-heavy stores
Why chosenLock-free, range scansTight memoryDisk-friendlyPoint lookups onlyRead-mostlyCommon prefixes

Skip lists win in production because they support cheap lock-free concurrent inserts and produce already-sorted iteration — exactly what the flush needs.

Python — minimal memtable + flush

from sortedcontainers import SortedDict
import threading, time

class SSTable:
    """Immutable sorted file. In real engines, persisted to disk with index + Bloom filter."""
    def __init__(self, name: str, items: list[tuple]):
        self.name = name
        self.items = items                          # sorted list of (key, value)
        self.min_key = items[0][0] if items else None
        self.max_key = items[-1][0] if items else None
        self.size = sum(len(k) + len(v) for k, v in items)

class LSMStore:
    FLUSH_THRESHOLD = 64 * 1024 * 1024              # 64 MB like RocksDB

    def __init__(self):
        self.active = SortedDict()                  # mutable memtable (RAM)
        self.immutable = None                       # frozen, draining to disk
        self.active_size = 0
        self.sstables = []                          # newest first
        self.wal_records = []                       # write-ahead log
        self.lock = threading.RLock()
        self.flush_counter = 0

    def put(self, key: str, value: str):
        with self.lock:
            self.wal_records.append(('PUT', key, value))     # 1. append WAL
            old = self.active.get(key, '')
            self.active[key] = value
            self.active_size += len(key) + len(value) - len(old)
            if self.active_size >= self.FLUSH_THRESHOLD and self.immutable is None:
                self._trigger_flush()

    def _trigger_flush(self):
        self.immutable = self.active
        self.active = SortedDict()
        self.active_size = 0
        threading.Thread(target=self._flush, daemon=True).start()

    def _flush(self):
        with self.lock:
            items = list(self.immutable.items())
            self.flush_counter += 1
            name = f'sst-{self.flush_counter:04d}'
        # walk in sorted order, write SSTable (simulated as in-memory list)
        sst = SSTable(name, items)
        time.sleep(0.05)                            # simulate disk write
        with self.lock:
            self.sstables.insert(0, sst)
            self.immutable = None
            # WAL segments older than this flush can be deleted

    def get(self, key: str) -> str | None:
        with self.lock:
            if key in self.active: return self.active[key]
            if self.immutable and key in self.immutable: return self.immutable[key]
            for sst in self.sstables:
                if sst.min_key is None or not (sst.min_key <= key <= sst.max_key):
                    continue
                # binary search the sorted items
                lo, hi = 0, len(sst.items) - 1
                while lo <= hi:
                    mid = (lo + hi) // 2
                    k, v = sst.items[mid]
                    if k == key: return v
                    if k < key: lo = mid + 1
                    else: hi = mid - 1
            return None

JavaScript — memtable flush sketch

class LSMStore {
  constructor(flushThreshold = 64 * 1024 * 1024) {
    this.active = new Map();          // ordered if JS engine cooperates; use SortedMap in prod
    this.immutable = null;
    this.activeSize = 0;
    this.sstables = [];               // newest first
    this.wal = [];
    this.flushThreshold = flushThreshold;
  }

  async put(key, value) {
    this.wal.push(['PUT', key, value]);          // WAL append
    const old = this.active.get(key) || '';
    this.active.set(key, value);
    this.activeSize += key.length + value.length - old.length;
    if (this.activeSize >= this.flushThreshold && this.immutable === null) {
      this.triggerFlush();
    }
  }

  triggerFlush() {
    this.immutable = this.active;
    this.active = new Map();
    this.activeSize = 0;
    // background flush
    queueMicrotask(() => this.flush());
  }

  async flush() {
    const items = [...this.immutable.entries()]
      .sort((a, b) => (a[0] > b[0]) - (a[0] < b[0]));
    const sst = { name: `sst-${this.sstables.length + 1}`, items };
    await new Promise(r => setTimeout(r, 50));   // simulate disk write
    this.sstables.unshift(sst);
    this.immutable = null;
    // drop WAL segments older than this flush boundary
  }

  get(key) {
    if (this.active.has(key)) return this.active.get(key);
    if (this.immutable && this.immutable.has(key)) return this.immutable.get(key);
    for (const sst of this.sstables) {
      // sorted items → binary search omitted for brevity
      for (const [k, v] of sst.items) if (k === key) return v;
    }
    return undefined;
  }
}

Variants — flush strategies in production

Single-memtable flush (RocksDB default). One active memtable, one immutable while flushing. The simplest case.

Multi-memtable buffering. RocksDB max_write_buffer_number=4 allows up to 4 immutable memtables queued for flushing concurrently — buffers ingest spikes without write stalls.

Merge-on-flush. If multiple immutable memtables accumulate, the flush can merge them into a single larger SSTable, reducing L0 file count and downstream compaction work.

Parallel flush threads. Cassandra's memtable_flush_writers can flush multiple memtables in parallel, useful with many column families.

Off-heap memtables. Cassandra and Scylla offer off-heap memtable allocators to avoid GC pressure on multi-GB memtables. Native memory; data structures keep skip-list semantics.

Manual flush. Most engines expose a flush() RPC for backup or snapshot consistency — forces the active memtable to disk now, regardless of size.

When to tune the flush threshold

  • Larger memtable = fewer flushes, lower L0 file count, less compaction pressure — but more RAM and longer recovery time (more WAL to replay).
  • Smaller memtable = more L0 files, more compaction CPU, but lower RAM use and faster crash recovery.
  • Write-heavy ingest workloads benefit from larger memtables (64 MB → 256 MB to 1 GB) to amortize flush overhead.
  • Latency-sensitive workloads tune down the memtable size to keep flush pauses short and predictable.
  • Embedded / mobile deployments use small memtables (LevelDB default is 4 MB) to keep memory footprint bounded.

Common memtable bugs

  • Write stalls. If flushes can't keep up with ingest, the active memtable can't be swapped out. Writes block. RocksDB exposes level0_slowdown_writes_trigger to throttle clients before they hit a hard stall.
  • WAL retention bug. Forgetting to delete WAL segments after flush completes inflates disk usage; conversely, deleting too early loses durability for keys still only in the memtable.
  • Flush during shutdown. Without a synchronous final flush, recently-written keys live only in the WAL until next startup replay. Most engines force a flush on graceful shutdown.
  • Skip list contention. Lock-based memtables collapse under heavy concurrency. Lock-free skip lists (RocksDB) and CAS-based variants exist for this reason.
  • Memory leak via prepared statements. If the engine retains a snapshot pointer at every memtable, immutable memtables can't be freed even after flush. Snapshot lifecycle is a frequent source of GC pressure.
  • Per-CF flush ordering. Engines with multiple column families can flush one CF while another is full; cross-CF transaction atomicity can break if not coordinated.

Frequently asked questions

Why hold writes in memory at all — why not write straight to disk?

Disks reward sequential writes and punish random ones. The memtable batches recent writes into one sorted block; flushing produces one long sequential write of an immutable file. Without the memtable, each insert would be a small random write to whichever on-disk page held that key — orders of magnitude slower.

What data structure is a memtable usually?

Almost always a skip list or a balanced BST. RocksDB and Cassandra default to skip lists — lock-free, O(log n) insert and iteration in sorted order. The sorted iteration is what makes the flush produce an already-sorted SSTable in a single pass.

What size triggers a flush?

RocksDB defaults to 64 MB (write_buffer_size). Cassandra defaults to a configurable fraction of heap (memtable_heap_space_in_mb, often 256-2048 MB). LevelDB uses 4 MB. The size trades flush frequency against memory cost — bigger memtables mean fewer flushes but more RAM and longer pauses on flush.

What happens to writes during a flush?

The full memtable becomes 'immutable' (readable, not writable). A new mutable memtable is created instantly. New writes hit the new memtable. The immutable one drains to disk in the background. Read paths check both — newest first. Total write-path stall is microseconds.

Is the WAL replayed back into a memtable on restart?

Yes. After a crash, the engine walks the write-ahead log and reinserts every record into a fresh memtable until catching up to the last on-disk SSTable. Then it flushes that recovered memtable. The WAL segments older than the latest flush can be deleted safely.

How does the memtable affect read latency?

Reads check the active memtable first (in-RAM, fastest). If miss, then the immutable memtable in progress of flushing, then each on-disk SSTable level. The newest data is always in RAM; recently written keys read in microseconds. Cold reads pay the disk cost of the SSTable hierarchy.