Storage Systems

SSTable Format

An immutable, sorted, block-indexed, Bloom-filtered file — the page format every modern key-value store writes to disk

An SSTable is an immutable sorted file of key-value pairs, packed into compressed blocks with a sparse index and a per-file Bloom filter — the on-disk format powering BigTable, Cassandra, and RocksDB.

Block size4 KB to 64 KB target
IndexSparse, in RAM; one entry per block
FilterBloom, ~10 bits/key, ~1% FP
Lookup cost0 or 1 disk IO per file
MutationNone — file is immutable
Used inBigTable, HBase, Cassandra, LevelDB, RocksDB

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How an SSTable is laid out

An SSTable is a flat byte sequence on disk. Reading the file from top to bottom you see: a stream of data blocks holding the sorted key-value pairs, then a Bloom-filter block, then an index block pointing at every data block, then a small footer with magic bytes and pointers to the index and meta-index. Nothing in the file is updated after it lands — the whole layout is designed around append-once, read-many.

SSTable on disk
┌─────────────────────────────┐  offset 0
│  Data block 1               │   keys [aardvark .. apple]
│  Data block 2               │   keys [apricot  .. banana]
│  Data block 3               │   keys [berry    .. cherry]
│  ...                        │
│  Data block N               │   keys [yam      .. zebra]
├─────────────────────────────┤
│  Meta block: Bloom filter   │   ~10 bits/key, 1% false positive
│  Meta block: properties     │   #keys, min/max, codec, MVCC
├─────────────────────────────┤
│  Index block                │   last-key + offset per data block
│  Meta-index block           │   names → offsets of meta blocks
├─────────────────────────────┤
│  Footer (53 bytes)          │   ptr(index), ptr(meta-index), magic
└─────────────────────────────┘  offset = file_size

A reader opening the file performs one seek to file_size - 53, parses the footer, then reads the index and Bloom filter into the block cache. Every subsequent lookup probes Bloom first, binary-searches the index second, and only then reads the one data block that could contain the key.

The three accelerators

Bloom filter. A bit array hashed by the key — answers "definitely not in this file" or "maybe in this file". Tuned to ~10 bits/key for a 1 percent false-positive rate. For a 64 MB SSTable holding a million keys, the filter is ~1.2 MB and lives in the block cache. The hot path that says "skip this file" takes a handful of cache-resident hashes.

Sparse index. One entry per data block, stored as the last key in that block plus its file offset. A 100 MB SSTable with 16 KB blocks has roughly 6,400 index entries. Binary search across those entries narrows a lookup down to a single block in O(log B) memory ops, with no disk IO.

Block restarts. Inside a data block, keys are stored prefix-compressed (each entry stores only the suffix differing from the previous). To allow binary search inside the block, every 16 entries restart with a full key and the block ends with an array of restart offsets. Binary search on restarts plus a short linear scan inside the picked restart group finds any key in microseconds.

Where SSTables win

Write-heavy workloads. Data lands in a sorted memtable first; flushes produce one big sequential file. No in-place mutation, so the storage engine never re-reads a page just to update it.
Compaction-driven engines. Background workers merge several smaller SSTables into one larger SSTable — sequential read plus sequential write. The format is designed for that streaming merge.
Cold storage tiers. Files can be compressed (Snappy, Zstd, LZ4) per block; cold SSTables on object storage stay tiny.
Snapshots and replication. Pinning an immutable file is free — replication just copies bytes. Cassandra's snapshot is a hard link; HBase's HFile replication ships finished files.

SSTables lose to B-trees when the workload is dominated by random updates that don't accumulate enough writes to flush as one sorted batch — then write amplification spikes during compaction.

SSTable vs other on-disk row stores

	SSTable	B-tree page	Heap file	Columnar (Parquet)	WAL segment	Hash file
Layout	Sorted blocks + index + Bloom	Sorted node tree	Unordered rows	Column chunks + page index	Append-only entries	Hash buckets
Mutability	Immutable	In-place updates	In-place updates	Immutable	Append-only	In-place buckets
Point lookup	Bloom + 1 IO	O(log N) IO	Full scan	Page index + IO	Full scan	O(1) average
Range scan	Sequential block read	Linked leaves	Full scan	Column scan	Sequential	Not supported
Write pattern	One big sequential	Random	Append or seek	One big sequential	Sequential	Random
Compression	Per-block	Per-page	None typical	Per-column-chunk	Optional	Limited
Used by	RocksDB, Cassandra, BigTable	InnoDB, Postgres	Postgres heap, MyISAM	Spark, Iceberg, Trino	Kafka, Postgres WAL	Berkeley DB hash, Riak Bitcask

The SSTable's central trade is: pay write amplification (the same byte may be rewritten through 5–7 LSM levels) and read amplification (lookups may touch several files), in exchange for never doing a random disk write. On NVMe SSDs and especially on object storage (S3, GCS), that trade is paying off harder than ever — random writes dominate the cost model and SSTables avoid them entirely.

What an SSTable actually costs

Block size 4-64 KB. RocksDB defaults to 4 KB; Cassandra to 64 KB. Smaller blocks reduce read amplification but inflate index and Bloom-filter memory. Production tuning is workload-driven.
Sparse index in memory, full data on disk. A 256 MB SSTable with 16 KB blocks holds 16,384 index entries (~1 MB of pinned index memory). The 255 MB of data stays on disk; reads pull in one block per hit.
Bloom filter ~10 bits/key for 1% FP. A million-key SSTable has a ~1.2 MB filter. RocksDB's full-filter mode keeps it in the block cache; partitioned filters split it across pages.
Compression ratio 2-5x. Per-block Snappy or LZ4 typically halves disk space at gigabytes-per-second decompression. Zstd improves the ratio at higher CPU cost.
Write amplification 10-30x. A key written once may flow through L0 → L1 → ... → L6, getting rewritten in a new SSTable at every level boundary. That's the LSM tax for never doing a random write.
Compaction throughput. RocksDB sustains 100-500 MB/s of compaction per node on commodity SSDs; that ceiling, not raw write throughput, is what limits ingest at scale.

JavaScript implementation (sketch)

// A toy in-memory SSTable. Real ones serialize to disk with prefix-compressed
// blocks, restart arrays, CRCs, and a binary footer. The structure is the same.

class BloomFilter {
  constructor(numKeys, bitsPerKey = 10) {
    this.size  = Math.max(64, numKeys * bitsPerKey);
    this.bits  = new Uint8Array(Math.ceil(this.size / 8));
    this.hashK = 7;                            // 10 bits/key → ~7 hashes
  }
  _hash(key, seed) {                            // FNV-style
    let h = seed | 0;
    for (let i = 0; i < key.length; i++) h = Math.imul(h ^ key.charCodeAt(i), 16777619);
    return (h >>> 0) % this.size;
  }
  add(key) {
    for (let i = 0; i < this.hashK; i++) {
      const b = this._hash(key, 2166136261 + i * 0x9e3779b1);
      this.bits[b >> 3] |= 1 << (b & 7);
    }
  }
  mightContain(key) {
    for (let i = 0; i < this.hashK; i++) {
      const b = this._hash(key, 2166136261 + i * 0x9e3779b1);
      if (!(this.bits[b >> 3] & (1 << (b & 7)))) return false;
    }
    return true;
  }
}

class SSTable {
  // entries: sorted [[key, value], ...]
  // blockSize: rough target — emit a new block after this many entries
  constructor(entries, blockSize = 16) {
    this.blocks = [];
    this.index  = [];                           // [lastKeyOfBlock, blockIdx]
    this.bloom  = new BloomFilter(entries.length);

    for (let i = 0; i < entries.length; i += blockSize) {
      const block = entries.slice(i, i + blockSize);
      this.blocks.push(block);
      this.index.push([block[block.length - 1][0], this.blocks.length - 1]);
      for (const [k] of block) this.bloom.add(k);
    }
    this.minKey = entries[0][0];
    this.maxKey = entries[entries.length - 1][0];
  }

  // Point lookup: Bloom → sparse index → one block scan.
  get(key) {
    if (key < this.minKey || key > this.maxKey) return undefined;
    if (!this.bloom.mightContain(key)) return undefined;          // skip the file

    // Binary search the sparse index for the first block whose lastKey ≥ key.
    let lo = 0, hi = this.index.length - 1;
    while (lo < hi) {
      const mid = (lo + hi) >> 1;
      if (this.index[mid][0] < key) lo = mid + 1; else hi = mid;
    }
    const block = this.blocks[this.index[lo][1]];

    // Binary search inside the block.
    let i = 0, j = block.length - 1;
    while (i <= j) {
      const m = (i + j) >> 1, k = block[m][0];
      if (k === key) return block[m][1];
      if (k < key) i = m + 1; else j = m - 1;
    }
    return undefined;
  }

  // Range scan: walk blocks in order, yielding entries within [from, to].
  * scan(from, to) {
    for (const block of this.blocks) {
      if (block[block.length - 1][0] < from) continue;
      if (block[0][0] > to) break;
      for (const [k, v] of block) if (k >= from && k <= to) yield [k, v];
    }
  }
}

Real SSTables make every byte count: variable-byte length encoding, prefix-compressed keys, restart arrays inside blocks, per-block CRC32C, a 53-byte footer with a magic number for version detection, and a separately addressable meta-index so engines can grow new metadata without breaking older readers.

Python implementation — block writer with a footer

import struct
import hashlib
from dataclasses import dataclass, field

MAGIC = b'SSTB\x00\x01'                # 6-byte magic
FOOTER_FMT = '<QQQ6s'                  # idx_off, meta_off, bloom_off, magic
FOOTER_SIZE = struct.calcsize(FOOTER_FMT)  # 30 bytes; real RocksDB footer is 53

@dataclass
class SSTableWriter:
    path: str
    block_target: int = 4096           # bytes
    keys: list = field(default_factory=list)
    bloom_bits: bytearray = field(default_factory=lambda: bytearray(1 << 20))

    def _bloom_set(self, key: bytes) -> None:
        for seed in range(7):
            h = int.from_bytes(hashlib.blake2b(key, digest_size=4, person=seed.to_bytes(2,'little')).digest(),'little')
            b = h % (len(self.bloom_bits) * 8)
            self.bloom_bits[b >> 3] |= 1 << (b & 7)

    def write(self, entries: list[tuple[bytes, bytes]]) -> None:
        # entries must be sorted by key.
        with open(self.path, 'wb') as f:
            block_index = []           # [(last_key, block_offset, block_size)]
            block_buf = bytearray()
            block_first = entries[0][0] if entries else b''

            for key, value in entries:
                self.keys.append(key)
                self._bloom_set(key)
                rec = struct.pack('<II', len(key), len(value)) + key + value
                block_buf += rec
                if len(block_buf) >= self.block_target:
                    off = f.tell()
                    f.write(block_buf)
                    block_index.append((key, off, len(block_buf)))
                    block_buf = bytearray()

            if block_buf:                          # flush trailing block
                off = f.tell()
                f.write(block_buf)
                block_index.append((entries[-1][0], off, len(block_buf)))

            bloom_off = f.tell()
            f.write(self.bloom_bits)

            idx_off = f.tell()
            for last_key, off, size in block_index:
                f.write(struct.pack('<HQI', len(last_key), off, size) + last_key)

            meta_off = f.tell()
            f.write(struct.pack('<Q', len(self.keys)))
            f.write(struct.pack('<Q', len(self.bloom_bits)))

            footer = struct.pack(FOOTER_FMT, idx_off, meta_off, bloom_off, MAGIC)
            f.write(footer)

To open the file, a reader seeks to file_size - FOOTER_SIZE, parses pointers, mmaps or reads the index and Bloom filter, and is ready to serve point and range lookups. Every operation is at most one block read past the always-resident metadata.

Format variants you'll meet in production

BigTable / HBase HFile. The original SSTable design (Google, 2006). HFile v3 adds per-block CRC, MVCC timestamps, and partitioned data blocks. The Apache HBase reader is the canonical reference implementation outside Google.

Cassandra Big format / mc / nc. Multi-file unit: Data.db holds the sorted entries, Index.db the primary index, Filter.db the Bloom filter, Summary.db a sparse summary of the index, CompressionInfo.db per-chunk compression offsets, plus Statistics.db and TOC.txt. One logical SSTable, seven files. The 4.0 release moved to a "new" (nc) format with binary numeric encodings.

LevelDB / RocksDB .sst. Single-file format. Block-based table v3 supports prefix Bloom, partitioned indexes, partitioned filters, and zstd dictionaries. RocksDB also ships PlainTable for keys-in-memory, hash-indexed datasets.

ScyllaDB. Cassandra-compatible binary format, plus a parallel "MX" (MX_4) layout aligned to per-shard data partitioning. Format compatibility with Cassandra is enforced to allow online migrations.

CockroachDB Pebble. A Go reimplementation of RocksDB's format with backwards compatibility plus extensions for range deletions ("range tombstones"), range keys, and value separation (similar to WiscKey).

WiscKey / Badger / Titan. Key-value separation. Keys and small metadata live in the SSTable; large values live in a separate value log addressed by file offset. Compaction rewrites only the SSTable, saving bytes when values are large.

Common bugs and edge cases

Footer corruption. The whole file is unreadable if the trailing footer is partial or the magic number is wrong. Production engines double-write the footer or include CRCs on every metadata block.
Bloom-filter version skew. Upgrading the filter hash family between writer and reader makes false negatives possible — and false negatives in a Bloom filter mean lost rows. Engines always tag the filter block with its construction parameters.
Block-size pathology. A block size that's smaller than your average value blows up the per-block fixed cost. A block size much bigger than the read unit wastes IO on every lookup.
Tombstone GC across snapshots. A tombstone in an upper-level SSTable cannot be dropped by compaction while older versions of the same key persist below — otherwise the deletion appears to roll back. RocksDB tracks the "newest snapshot" and gates tombstone GC behind it.
Index memory ceiling. Pinning every SSTable's full index in RAM works at hundreds of GB of data per node; past that, partitioned indexes are mandatory or the block cache evicts hot data blocks to make room.
Large value amplification. A single 1 MB value embedded in an SSTable gets rewritten every time that file gets compacted. WiscKey-style value separation exists exactly for this.
Range deletes on point-Bloom files. Bloom filters only accelerate point lookups. A range tombstone (RocksDB) requires a separate range-deletion block, missed by classic Bloom checks.

Frequently asked questions

Why is an SSTable immutable?

Immutability is what makes LSM-tree storage cheap and safe. Once written, an SSTable is never edited in place. Updates and deletes are recorded as new entries in a fresher SSTable (deletes use a tombstone marker). This means readers and writers never contend for the same bytes, snapshots are free (just pin the file from being deleted), checksum verification is one-shot, and compaction can rewrite a file atomically by writing a new one and renaming. The cost is write amplification — the same key may be rewritten through many SSTables as it migrates down LSM levels.

What is the block layout inside an SSTable?

Sorted key-value pairs are packed into fixed-target-size data blocks, typically 4 KB to 64 KB (RocksDB defaults to 4 KB, Cassandra to 64 KB). Each block ends with a restart-point array so a binary search inside the block can decode prefix-compressed keys. After all data blocks comes an index block listing the last key in each data block plus its offset, then a Bloom-filter block, a properties block (compression codec, count, min/max key, creation time), a meta-index pointing to those metadata blocks, and a footer with a magic number and pointers to the meta-index and index. The footer is fixed-size so a single seek to file_end - 53 bootstraps the whole structure.

How does a point lookup work?

Reader opens the SSTable, reads the footer, loads the index and Bloom filter into memory (or page cache). For a key lookup: probe the Bloom filter — if it says "definitely not" (with ~1 percent false positive rate at 10 bits/key) the file is skipped entirely. Otherwise binary search the in-memory index to find the one data block that could contain the key, read that single 4-64 KB block from disk, decompress it, and binary search the restart points inside. Worst case: one disk read for the block. Bloom plus sparse index turns a multi-gigabyte file lookup into roughly one IO.

Why a sparse index and not a full index?

A full per-key index would be as large as the keys themselves, defeating the point of a compact disk layout. The sparse index stores one entry per block (so a 100 MB SSTable with 16 KB blocks has roughly 6400 index entries, a few hundred KB) and binary-searches that. The on-disk block is what holds the per-key precision, decoded only after the index narrows down which block to read. This three-level layout — Bloom filter, sparse index, dense block — keeps the hot path in RAM and exactly one IO on the cold path.

How does compaction use SSTables?

Compaction is a streaming merge of N input SSTables into M output SSTables. The compactor opens an iterator over each input (block-by-block), heap-merges them by key, drops shadowed older versions and expired tombstones, and writes the merged stream into fresh output SSTables sized to a target. Because input files are sorted and never modified, the merge is sequential reads plus sequential writes — disk-friendly. Output files are written to a temp name, fsync'd, and renamed atomically, so a crash during compaction never corrupts the database.

What are partitioned indexes and filters?

When SSTables grow into the gigabytes — RocksDB's max_bytes_for_level_base regularly reaches 256 MB to 1 GB at L6 — the top-level index itself becomes large enough to cost cache. Partitioned indexes split the index into many small index blocks and add a top-level index pointing at them, paged into the block cache like data blocks. Partitioned Bloom filters do the same for filter memory. This is the difference between "whole SSTable index pinned in memory" and "just the hot 1 percent", and it's how production engines scale to TB-per-node without exhausting RAM.