Systems

Log-Structured File Systems

Treat the whole disk as one append-only log — random writes become sequential ones

A log-structured file system treats the entire disk as one append-only log, turning random small writes into large sequential ones for fast writes — at the cost of a background segment cleaner that reclaims space.

Write patternSequential append
Write throughput≈ full disk bandwidth
Read lookupcheckpoint → imap → inode → data
Cleaning write cost2 / (1 − u)
OriginRosenblum & Ousterhout, 1991

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: stop seeking, start appending

Picture a 1980s file system trying to save your work. It updates a data block here, an inode three tracks away there, a directory entry on the far side of the platter, and a free-space bitmap somewhere else. Each of those is a small write to a different place, and on a spinning disk the head has to physically seek and wait for the platter to rotate under it before every one. A seek plus rotational delay is on the order of 5–10 milliseconds; the actual transfer of a 4 KB block takes tens of microseconds. You are spending 99% of the time moving the head and 1% moving data.

Mendel Rosenblum and John Ousterhout looked at this in 1991 and made a radical observation: memory was getting cheap, so reads would increasingly be served from a large RAM cache, which meant the disk's job was about to become almost entirely writes. If writes dominate, design the whole disk for writing. Their answer was the log-structured file system (LFS): never overwrite anything in place. Instead, buffer all pending changes — data blocks, the inodes that point to them, even the directory and map updates — and write them out together as one big sequential transfer at the end of a log that grows across the disk.

The slogan is: turn random writes into sequential writes. A thousand scattered 4 KB updates that would have cost a thousand seeks instead collapse into a handful of multi-megabyte streaming writes with no seeks at all. The disk runs at nearly its full sequential bandwidth — often an order of magnitude faster than the in-place file system on the same hardware.

How the log is laid out

The disk is divided into large fixed-size segments — typically 512 KB or 1 MB. Writing is always to the current segment; when it fills, you move to the next free segment. LFS buffers updates in memory into a segment-sized chunk and flushes the whole segment in one I/O. This is the key to sequentiality: you never write a partial segment to a random spot.

When a file's data block changes, LFS writes the new block to the log. But that means the inode that pointed at the old block is now stale, so the inode is also written to the log, pointing at the new block. That inode now lives at a new disk address — so anything that pointed at it is stale too. This recursive chase upward is the same copy-on-write discipline used by modern snapshotting file systems: changing a leaf forces a new copy of every block on the path up to the root.

In a normal Unix file system inodes live at fixed, computable locations, so you find inode N by arithmetic. In LFS inodes move every time they're written, so that arithmetic no longer works. LFS solves it with one extra level of indirection — the inode map (imap):

  checkpoint region   (fixed location, two copies)
        │
        ▼
     inode map         imap[N] → disk address of inode N
        │
        ▼
      inode            inode → disk addresses of data blocks
        │
        ▼
     data block

The imap is itself written into the log in small pieces (so it inherits the same fast sequential writes), and a tiny checkpoint region at a fixed, known location on disk records where the latest imap pieces are. To read a file you go: checkpoint region → imap → inode → data. Because the imap is small, the whole thing is cached in RAM, so in practice a read costs the same one-or-two disk I/Os as a traditional file system. The extra indirection costs almost nothing on reads and buys you the freedom to put everything else wherever the log happens to be.

Segment cleaning: the price of never overwriting

An append-only log on a finite disk has an obvious problem: when you overwrite a block, the new version goes to the head of the log and the old version becomes dead — but the dead block still occupies its old segment. Keep going and the disk fills with garbage. LFS reclaims it with a background process called the segment cleaner, which is essentially a copying garbage collector for disk segments:

Read several partially-dead segments into memory.
Identify which blocks are still live (the imap and per-segment segment summary blocks tell you which inode/offset each block belongs to, and whether it's still current).
Write only the live blocks forward, compactly, into a new clean segment at the head of the log.
Mark the now-empty source segments as free for reuse.

This is why "append-only" never literally runs out of room: the disk is a circular array of segments, and the cleaner keeps recycling segments behind the write head. The art is in the policy — which segments to clean and when. The original LFS paper found that a simple "clean the emptiest segments" greedy policy is surprisingly poor, and introduced a cost-benefit policy that also favors segments whose data is cold (unlikely to be overwritten soon), because cold data, once compacted, stays compacted.

When LFS-style design wins — and when it hurts

Write-heavy, small-write workloads. Metadata-intensive workloads (creating/deleting many small files) were LFS's original killer app — exactly where in-place file systems seek themselves to death.
Flash and SSDs. NAND flash can't overwrite a page in place; it must erase a whole block first. Appending sequentially and cleaning in bulk maps perfectly onto flash's erase-block model — which is why every SSD's flash translation layer is essentially a log-structured store, and why F2FS (Flash-Friendly File System, in the Linux kernel) is log-structured.
Fast crash recovery. The disk is already a log, so recovery replays forward from the last checkpoint instead of scanning the whole volume.
Free snapshots. Because old versions are never overwritten until cleaned, a consistent point-in-time snapshot is almost free.

The flip side: on a near-full disk with a read-modify-write transaction workload, the cleaner can dominate. Margo Seltzer's 1993/1995 measurements showed a pure LFS could lose more than half its throughput to cleaning when the disk was full and the data was being constantly overwritten. If your disk stays nearly full and hot, the cleaner thrashes.

LFS vs other on-disk layouts

	Log-structured (LFS / F2FS)	Journaling (ext4, NTFS)	Copy-on-write (ZFS, Btrfs)	Classic in-place (FFS/ext2)
Write pattern	Always sequential append	Journal sequential, then in-place	New blocks anywhere, CoW B-tree	In place, scattered
Overwrites live data?	Never	Yes, after journaling	Never (until freed)	Yes
Space reclamation	Segment cleaner (copying GC)	Implicit (overwrite)	Per-block free, ref-counted	Free bitmap
Double-write penalty	Only on cleaning	Yes (journal + final)	No	No
Crash recovery	Roll forward from checkpoint	Replay journal	Atomic root pointer swap	Full fsck scan
Weak spot	Cleaning on full/hot disk	Write amplification 2×	Fragmentation, RAM	Slow small writes
Real-world use	F2FS, NetBSD LFS, SSD FTLs	ext4, XFS, NTFS	ZFS, Btrfs, APFS, bcachefs	ext2, original BSD FFS

Journaling and LFS solve overlapping problems from opposite ends. Journaling keeps the familiar in-place layout and pays a roughly 2× write-amplification tax to guarantee consistency (write to journal, then write the real location). LFS avoids the double write in the common case by making the log be the file system — but inherits the cleaning tax instead. Copy-on-write file systems are the modern synthesis: log-like never-overwrite semantics, but with a tree structure that frees dead blocks individually rather than running a whole-segment cleaner.

What the numbers actually say

A seek dwarfs a transfer. Mechanical seek + rotational latency ≈ 5–10 ms; a 4 KB transfer ≈ 30–80 µs. So 1,000 random 4 KB writes cost ≈ 5–10 seconds of pure seeking in-place, but ≈ a few tens of milliseconds as one 4 MB sequential write under LFS — a 100×+ difference on write-bound metadata workloads.
The cleaning write-cost formula. To free a segment that is fraction u live, the cleaner reads the whole segment (cost 1) and writes back u of live data, and the freed 1 − u is then filled with new data — so total bytes moved (1 + u + (1 − u) = 2) divided by new data (1 − u) yields an effective write cost of 2 / (1 − u). At u = 0.2 that's 2.5×; at u = 0.8 it explodes to 10×. Keeping segments empty before cleaning is everything.
Segment size. Segments are large (≈512 KB–1 MB) precisely so that the per-write seek amortizes to near zero over the segment's transfer time. Original Sprite LFS used 512 KB or 1 MB segments.
Recovery is bounded by recent work, not disk size. Old BSD fsck scanned the entire disk (minutes on a large volume); LFS roll-forward replays only the segments written since the last checkpoint (typically taken every few seconds), so recovery is seconds.
SSD reality. Because flash erase blocks are large and can't be partially overwritten, the log-structured pattern is not optional on SSDs — it's baked into the firmware's flash translation layer, where the same 2/(1−u) garbage-collection economics drive write-amplification factors of anywhere from ~1.1× to 3×+ depending on free space.

JavaScript: a tiny in-memory LFS

This model captures the three load-bearing ideas — append-only writes, the inode map indirection, and segment cleaning — without any real disk. The "disk" is an array of segments; writing appends to the current segment and updates the imap; cleaning compacts live blocks forward.

const SEG_SIZE = 4;            // blocks per segment (tiny, for illustration)

class MiniLFS {
  constructor(numSegments = 8) {
    this.segs = Array.from({ length: numSegments }, () => []); // each: list of blocks
    this.head = 0;             // current segment being written
    this.imap = new Map();     // inodeId -> { seg, idx }  (current location)
    this.live = new Set();     // "seg:idx" addresses still pointed-to
  }

  _append(block) {
    if (this.segs[this.head].length >= SEG_SIZE) this._advanceHead();
    const seg = this.head, idx = this.segs[seg].length;
    this.segs[seg].push(block);
    return { seg, idx };
  }

  _advanceHead() {
    // find a free (empty) segment ahead of the head — circular
    for (let i = 1; i <= this.segs.length; i++) {
      const cand = (this.head + i) % this.segs.length;
      if (this.segs[cand].length === 0) { this.head = cand; return; }
    }
    this.clean();                       // no room: reclaim, then retry
    this._advanceHead();
  }

  // Overwrite (or create) a file's data. Old version becomes dead automatically.
  write(inodeId, data) {
    const old = this.imap.get(inodeId);
    if (old) this.live.delete(`${old.seg}:${old.idx}`);   // old copy now dead
    const addr = this._append({ inodeId, data });
    this.imap.set(inodeId, addr);
    this.live.add(`${addr.seg}:${addr.idx}`);
    return addr;
  }

  read(inodeId) {
    const a = this.imap.get(inodeId);          // imap indirection
    return a ? this.segs[a.seg][a.idx].data : null;
  }

  // Segment cleaner: pick the segment with the fewest live blocks,
  // copy its live blocks forward, then free it.
  clean() {
    let best = -1, bestLive = Infinity;
    for (let s = 0; s < this.segs.length; s++) {
      if (s === this.head || this.segs[s].length === 0) continue;
      const liveCount = this.segs[s].filter((_, i) => this.live.has(`${s}:${i}`)).length;
      if (liveCount < bestLive) { bestLive = liveCount; best = s; }
    }
    if (best === -1) throw new Error('disk full — nothing to clean');

    const survivors = [];
    this.segs[best].forEach((blk, i) => {
      if (this.live.has(`${best}:${i}`)) survivors.push(blk);
    });
    this.live.forEach(a => { if (a.startsWith(best + ':')) this.live.delete(a); });
    this.segs[best] = [];                       // segment is now free
    for (const blk of survivors) {              // re-append survivors to head
      const addr = this._append(blk);
      this.imap.set(blk.inodeId, addr);
      this.live.add(`${addr.seg}:${addr.idx}`);
    }
  }
}

const fs = new MiniLFS();
fs.write(1, 'hello');
fs.write(2, 'world');
fs.write(1, 'HELLO');     // overwrites inode 1 — old 'hello' is now dead garbage
console.log(fs.read(1));  // 'HELLO'  — imap always points at the newest copy

Notice that write never mutates an existing block: it appends a new one and re-points the imap, leaving the old block as dead weight until the cleaner runs. That single discipline is the entire trick.

Python: the cost-benefit cleaning policy

The original LFS paper's most important practical finding was that the cleaning policy matters more than the mechanism. Greedily cleaning the emptiest segment ignores that cold data, once compacted, won't be dirtied again — so it's worth cleaning a slightly-fuller-but-cold segment. The cost-benefit score Rosenblum and Ousterhout used is below.

def cost_benefit(u, age):
    """Score a segment for cleaning. Higher = better candidate.

    u    : fraction of the segment that is still live (0..1)
    age  : time since the youngest block in the segment was written

    Benefit = free space generated * how long it stays free  = (1 - u) * age
    Cost    = read the whole segment + write back the live part = 1 + u
    """
    if u >= 1.0:          # fully live — cleaning frees nothing
        return 0.0
    benefit = (1 - u) * age
    cost = 1 + u
    return benefit / cost

def write_cost(u):
    """Effective disk writes per unit of useful data, for a segment u live."""
    if u >= 1.0:
        return float('inf')
    return 2.0 / (1.0 - u)   # read 1 + write-back u + ... → 2/(1-u)

# A near-empty COLD segment beats a near-empty HOT one, because the freed
# space stays free longer. Compare a 20%-live cold segment to a 20%-live hot one:
print(round(cost_benefit(u=0.2, age=100), 2))   # cold, old  -> 66.67
print(round(cost_benefit(u=0.2, age=5),   2))   # hot,  young ->  3.33

# The write-cost cliff: cleaning a mostly-live segment is brutal.
for u in (0.1, 0.3, 0.5, 0.8, 0.95):
    print(f"u={u:>4}  write_cost={write_cost(u):.1f}x")
# u= 0.1  write_cost=2.2x
# u= 0.3  write_cost=2.9x
# u= 0.5  write_cost=4.0x
# u= 0.8  write_cost=10.0x
# u=0.95  write_cost=40.0x

The two functions together explain LFS's whole personality: write_cost says "only clean nearly-dead segments," and cost_benefit adds "and prefer the ones whose freed space will last." Get those right and LFS flies; get them wrong and the cleaner eats your bandwidth.

Variants and descendants worth knowing

Sprite LFS (1991). The original, built into the Sprite network operating system at Berkeley. Introduced segments, the imap, the checkpoint/roll-forward recovery scheme, and the cost-benefit cleaner.

BSD-LFS (1993). Margo Seltzer's reimplementation for 4.4BSD, still maintained in NetBSD. Her work also produced the influential critique showing where cleaning costs bite on full, transaction-heavy disks.

F2FS (2012, Linux). "Flash-Friendly File System" from Samsung — a log-structured design tuned for NAND flash. It uses multiple logs segregated by hotness (hot/warm/cold for data and node blocks) and aligns segments to the flash erase block, and it ships in Android and mainline Linux.

NILFS2 (Linux). A continuous-snapshotting log-structured file system: because old versions are kept until cleaned, it can expose checkpoints as mountable read-only snapshots.

LSM-trees. The log-structured merge tree (LevelDB, RocksDB, Cassandra) applies the same "buffer writes, flush sequentially, compact in the background" idea to a key-value index. Compaction is the LSM analog of segment cleaning, and it faces the identical write-amplification economics. See LSM-tree and log-structured merge.

SSD flash translation layers. Not a file system at all, but firmware — yet every modern SSD runs a log-structured store internally, with garbage collection and over-provisioned spare space to keep the live fraction u low.

Common pitfalls and edge cases

Forgetting to update the imap on overwrite. If a write appends a new block but leaves the imap pointing at the old address, reads silently return stale data. The imap re-point is the write.
Treating cleaning as cheap. The killer mistake at design time is ignoring the 2/(1−u) cliff. A benchmark on an empty disk shows LFS flying; the same system on a 90%-full, constantly-overwritten disk can be slower than the in-place file system it replaced.
Cleaning hot data. Compacting a segment whose blocks are about to be overwritten anyway wastes the copy entirely — those survivors die seconds later. Hot/cold segregation (separate logs) avoids it.
The checkpoint is a single point of truth. Corrupt the checkpoint region and you lose the entry point to the latest imap. LFS keeps two checkpoint regions and alternates between them so a crash mid-checkpoint can fall back to the previous good one.
Roll-forward must validate. On recovery you replay segments written after the checkpoint, but a crash may have left a half-written segment. Each segment carries a summary and the recovery code must verify a segment is complete before trusting its blocks.
Read fragmentation. Because related blocks of one file are written whenever they happen to change, a file's blocks can scatter across many segments, hurting sequential read throughput even though writes were fast. The cleaner can re-coalesce them, which is one more reason it earns its keep.

Frequently asked questions

Why is appending to a log faster than overwriting in place?

Overwriting in place scatters small writes across the disk, forcing the head to seek between every block — on a spinning disk a seek plus rotational delay costs several milliseconds, dwarfing the microseconds the transfer itself takes. A log appends every new block to the end of one growing region, so a thousand scattered 4 KB writes collapse into a few large sequential transfers with no seeks. LFS was designed to make a disk run at near its full sequential bandwidth on a write-heavy workload.

If the log only appends, how does it ever run out of free space?

It doesn't append forever. When you overwrite a block, LFS writes the new version at the head of the log and the old copy becomes dead — but the dead block still occupies disk space. A background segment cleaner reads partially-dead segments, copies the few live blocks forward into the log, and frees the whole segment for reuse. The disk is treated as a circular array of fixed-size segments, not an infinite tape.

How does LFS find an inode if inodes move every time they're written?

It adds one level of indirection: the inode map (imap). The imap maps inode number to the current disk address of that inode. The imap itself is written to the log in pieces, and a small fixed checkpoint region at a known location points to the latest imap pieces. So a lookup is: checkpoint region → imap → inode → data block. The imap is small enough to cache in memory, so reads usually cost the same one or two I/Os as a traditional file system.

What is the write-cost problem and why did it doom pure LFS for some workloads?

Cleaning is not free: to free a segment that is fraction u live, the cleaner must read it and write back u worth of live data, so the effective write cost is 2/(1−u). When a disk is nearly full, u stays high and the cleaner thrashes — Seltzer's 1995 measurements showed LFS cleaning could cut throughput by more than half on a transaction-heavy, near-full disk. Modern designs mitigate this with hot/cold segregation, threading, and keeping free space in reserve.

Is a log-structured file system the same as a copy-on-write file system like ZFS or Btrfs?

They share DNA — both never overwrite live data and both update metadata bottom-up — but they are not identical. Pure LFS writes everything into one strictly sequential log and reclaims space with a segment cleaner. CoW file systems like ZFS and Btrfs use a copy-on-write B-tree and free dead blocks individually rather than running a log cleaner. The log idea also lives on inside SSD flash translation layers and in LSM-tree key-value stores, which face the exact same erase-block and compaction problems.

Does LFS recover faster from a crash than a journaling file system?

Usually yes. Because the disk is already a log, recovery just replays forward from the last checkpoint instead of scanning the whole file system like the old fsck. LFS combines a periodic checkpoint with roll-forward of the segments written since that checkpoint, so recovery time is proportional to the work done since the last checkpoint, not the size of the disk — typically seconds rather than minutes.