Databases

Write-Ahead Logging (WAL)

Log the change before you touch the data — so a crash is always replayable

Write-ahead logging (WAL) is a durability protocol that forces a log record describing a change to durable storage before the changed data page is written, so a crash can always be recovered by replaying committed changes (redo) and rolling back uncommitted ones (undo).

  • Core invariantlog fsync before data
  • Commit cost1 sequential fsync
  • Recoveryredo + undo
  • Idempotency keyper-page LSN
  • Canonical algorithmARIES (1992)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The one rule that makes a crash survivable

Picture a database in the middle of crediting your account $100. The change is made to the in-memory copy of a page — fast, but volatile. If the machine loses power right now, that page evaporates. The naive fix is to write the page straight to disk on every commit, but data pages are scattered randomly across the file, and forcing dozens of 8 KB random writes per transaction is brutally slow.

Write-ahead logging breaks the deadlock with one ordering rule: before any dirty data page is flushed to durable storage, the log records that describe its changes must already be durable. Instead of writing the page, you write a tiny record — "page P1, offset 40, old value 0, new value 100" — to a sequential log file and fsync that. The expensive random page writes are deferred and batched; only the cheap sequential log is forced on the commit path.

The payoff is durability without the cost. If the system crashes, the data files may be a mess — some committed changes never made it to disk, some uncommitted changes did. But the log is the source of truth. Recovery replays it: committed changes are redone, uncommitted changes are undone, and the database returns to a consistent state as if the crash never happened. WAL is how COMMIT keeps its promise even when the lights go out.

The precise mechanism: LSNs, redo, undo

Every log record gets a Log Sequence Number (LSN) — a monotonically increasing offset into the log. The LSN is the spine of the whole protocol:

  • Each data page stores a pageLSN: the LSN of the most recent log record that modified it. This is how recovery knows whether a change is already on disk.
  • Each log record stores a prevLSN: a back-pointer to the previous record of the same transaction, forming a per-transaction chain used for rollback.
  • Recovery applies a record's redo to a page only when pageLSN < record.LSN. If pageLSN ≥ record.LSN the change is already reflected, so redo is skipped. This single comparison is what makes redo idempotent — you can replay the log twice (or crash mid-recovery and restart) and get the same answer.

The canonical recovery algorithm is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), published by C. Mohan and colleagues at IBM in 1992. It runs three passes over the log:

  1. Analysis. Scan forward from the last checkpoint to reconstruct the transaction table (who was active at the crash) and the dirty-page table (which pages had un-flushed changes, and from which recLSN onward).
  2. Redo. Replay all changes — even those of transactions that later aborted — starting from the oldest recLSN. This restores the database to its exact pre-crash state, a property ARIES calls "repeating history." The pageLSN ≥ LSN test skips work already on disk.
  3. Undo. Roll back every transaction that was still active at the crash, walking each transaction's prevLSN chain backward. Each undo writes a Compensation Log Record (CLR) describing the rollback, so a crash during recovery never re-undoes the same action.

Cost is dominated by the log scan and the I/O it triggers. Analysis and redo are O(records since checkpoint) in the log; undo is O(work of in-flight transactions). Checkpoints bound the first two, so restart time is proportional to recent activity, not the lifetime of the database.

When WAL is the right tool (and when it isn't)

  • Any system that must survive crashes with committed data intact — relational databases (PostgreSQL, SQLite, MySQL/InnoDB), and the recovery layer beneath LSM-tree key-value stores.
  • Write-heavy OLTP, where deferring random page writes and using group commit turns thousands of transactions into a handful of sequential fsyncs.
  • Replication and point-in-time recovery. The log is a serialized history of every change, so streaming it to a replica gives physical/logical replication for free, and archiving it enables restoring to any past instant.
  • Filesystems, where the same idea is called journaling: write the metadata change to the journal, fsync, then update the inodes and bitmaps.

WAL is the wrong tool when durability simply doesn't matter — an in-memory cache, a scratch table, or analytics over data you can re-derive. It also doubles your write volume in the worst case (every change is written once to the log and once to the data file), so on pure-append, write-once workloads a log-structured design that is the log (an LSM-tree) can be cheaper.

WAL vs other durability strategies

Write-ahead log (WAL)Shadow pagingForce-at-commit (no log)FS journalingLSM-tree
Commit I/O1 sequential fsync of logfsync new pages + atomic pointer swaprandom write every dirty page1 journal fsync (metadata)append to memtable + WAL
Random writes deferred?Yes (no-force, steal)No — copies whole pagesNoYes (data) / journal (meta)Yes (background compaction)
Recovery mechanismredo + undo replaydiscard uncommitted shadownone needed (but no atomicity)replay/discard journalreplay WAL into memtable
Write amplification≈ 2× (log + data)high (copy-on-write pages)1× but slow & unsafe1× metadata, full datahigh (compaction rewrites)
Concurrency friendlinessexcellent (fine-grained records)poor (page-level COW)poorgoodexcellent (sequential)
Real-world usePostgreSQL, SQLite, InnoDBold SQLite, CouchDB-ishtoy/embedded onlyext4, XFS, NTFSRocksDB, Cassandra, LevelDB

WAL's "no-force, steal" buffer policy is the headline. No-force means a commit does not force its dirty pages to disk (redo covers them later). Steal means the buffer manager is free to evict and flush a dirty page of an uncommitted transaction to make room (undo covers it later). Together they decouple commit latency from page-write latency — the entire point of the log.

What the numbers actually say

  • fsync latency dominates commit. A durable flush costs roughly tens of microseconds on an enterprise SSD with power-loss protection, but closer to ~1 ms on a consumer NVMe drive without it (NAND program time dominates), and 1–10 ms on a cloud network disk. Group commit batches concurrent transactions into one fsync — so at 10,000 TPS with a 1 ms flush, you do ~1,000 fsyncs/s, not 10,000, and each transaction waits ~1 ms regardless of load.
  • Sequential vs random I/O. A modern NVMe SSD sustains multiple GB/s sequentially but is far slower on scattered 8 KB random writes (and an HDD is ~100× worse on random). Funneling commits through one sequential log instead of N random page writes is the bulk of WAL's speedup.
  • Write amplification ≈ 2×. Each change is written once to the log and (eventually) once to the data file. PostgreSQL's full_page_writes adds more: the first change to a page after a checkpoint logs the entire 8 KB page to defend against torn writes, so post-checkpoint WAL volume spikes.
  • Recovery is bounded by checkpoint interval. PostgreSQL's checkpoint_timeout defaults to 5 minutes; restart only replays WAL generated since the last checkpoint, so recovery is seconds-to-minutes, not hours.

JavaScript implementation

A minimal redo/undo WAL over an in-memory "disk." It demonstrates the write-ahead rule, the per-page LSN idempotency test, and crash recovery.

// "Durable" storage (survives a crash); data pages are volatile until flushed.
class Storage {
  constructor() { this.log = []; this.disk = {}; }       // disk: pageId -> { val, pageLSN }
  appendLog(rec) { rec.lsn = this.log.length; this.log.push(rec); return rec.lsn; } // + fsync IRL
}

class Database {
  constructor(storage) {
    this.s = storage;
    this.pages = {};            // volatile buffer pool: pageId -> { val, pageLSN }
    this.activeTxn = new Map(); // txnId -> lastLSN (prevLSN chain head)
  }
  _page(id) {
    if (!this.pages[id]) this.pages[id] = this.s.disk[id]
      ? { ...this.s.disk[id] } : { val: 0, pageLSN: -1 };
    return this.pages[id];
  }
  begin(txnId) { this.activeTxn.set(txnId, -1); }

  // WRITE-AHEAD RULE: log the change (with before/after images) BEFORE touching the page.
  write(txnId, pageId, newVal) {
    const p = this._page(pageId);
    const lsn = this.s.appendLog({
      type: 'update', txnId, pageId,
      before: p.val, after: newVal,
      prevLSN: this.activeTxn.get(txnId),
    });
    this.activeTxn.set(txnId, lsn);
    p.val = newVal; p.pageLSN = lsn;   // now safe to mutate the in-memory page
  }
  commit(txnId) {
    const lsn = this.s.appendLog({ type: 'commit', txnId, prevLSN: this.activeTxn.get(txnId) });
    this.activeTxn.delete(txnId);      // fsync here = transaction is durable
    return lsn;
  }
  // STEAL policy: a dirty page may be flushed any time, even mid-transaction.
  flush(pageId) { this.s.disk[pageId] = { ...this.pages[pageId] }; }
}

// Crash + ARIES-style recovery (redo all, then undo losers).
function recover(storage) {
  const disk = storage.disk, log = storage.log;
  const committed = new Set(log.filter(r => r.type === 'commit').map(r => r.txnId));

  // REDO pass: repeat history. Idempotent via pageLSN < lsn test.
  for (const r of log) {
    if (r.type !== 'update') continue;
    const pg = disk[r.pageId] || (disk[r.pageId] = { val: 0, pageLSN: -1 });
    if (pg.pageLSN < r.lsn) { pg.val = r.after; pg.pageLSN = r.lsn; }
  }
  // UNDO pass: roll back transactions with no commit record (walk newest->oldest).
  for (let i = log.length - 1; i >= 0; i--) {
    const r = log[i];
    if (r.type === 'update' && !committed.has(r.txnId)) {
      disk[r.pageId].val = r.before;   // restore before-image (a CLR would be logged here)
    }
  }
  return disk;
}

Two details mirror real systems. First, write() appends the log record before mutating the page — reverse the two lines and a crash between them loses a change with no log evidence. Second, redo runs over every update, including those of transactions that later aborted ("repeating history"), and the pageLSN < lsn guard makes replaying an already-applied change a no-op.

Python implementation

The same protocol, plus a rollback that walks the prevLSN chain — the path an explicit ROLLBACK or an aborted transaction takes at runtime, distinct from crash recovery.

class Storage:
    def __init__(self):
        self.log = []          # durable, sequential
        self.disk = {}         # pageId -> {"val": int, "pageLSN": int}
    def append(self, rec):
        rec["lsn"] = len(self.log)
        self.log.append(rec)   # + fsync() to be truly durable
        return rec["lsn"]

class Database:
    def __init__(self, storage):
        self.s = storage
        self.pages = {}        # volatile buffer pool
        self.active = {}       # txnId -> lastLSN

    def _page(self, pid):
        if pid not in self.pages:
            self.pages[pid] = dict(self.s.disk.get(pid, {"val": 0, "pageLSN": -1}))
        return self.pages[pid]

    def begin(self, txn): self.active[txn] = -1

    def write(self, txn, pid, new):
        p = self._page(pid)
        lsn = self.s.append({"type": "update", "txn": txn, "pid": pid,
                             "before": p["val"], "after": new,
                             "prevLSN": self.active[txn]})   # LOG FIRST
        self.active[txn] = lsn
        p["val"], p["pageLSN"] = new, lsn                    # then touch the page

    def commit(self, txn):
        self.s.append({"type": "commit", "txn": txn, "prevLSN": self.active[txn]})
        del self.active[txn]

    def rollback(self, txn):
        lsn = self.active[txn]
        while lsn is not None and lsn >= 0:
            rec = self.s.log[lsn]
            if rec["type"] == "update":
                self.s.append({"type": "clr", "txn": txn, "pid": rec["pid"],
                               "after": rec["before"], "undoNext": rec["prevLSN"]})
                self._page(rec["pid"])["val"] = rec["before"]
            lsn = rec.get("prevLSN")
        del self.active[txn]


def recover(storage):
    log, disk = storage.log, storage.disk
    committed = {r["txn"] for r in log if r["type"] == "commit"}
    for r in log:                                    # REDO (repeat history)
        if r["type"] in ("update", "clr"):
            pg = disk.setdefault(r["pid"], {"val": 0, "pageLSN": -1})
            if pg["pageLSN"] < r["lsn"]:
                pg["val"], pg["pageLSN"] = r["after"], r["lsn"]
    for r in reversed(log):                          # UNDO losers
        if r["type"] == "update" and r["txn"] not in committed:
            disk[r["pid"]]["val"] = r["before"]
    return disk

Note the asymmetry between rollback() and recover(). Runtime rollback walks one transaction's chain backward and logs CLRs; crash recovery first redoes everything to rebuild state, then undoes all losers. ARIES treats rollback and crash-undo as the same code path precisely because both emit CLRs.

Variants worth knowing

Physical vs logical vs physiological logging. Physical logging stores raw byte images ("page 5, bytes 40–43 = …") — simple and idempotent but verbose. Logical logging stores operations ("insert key 42") — compact but hard to make idempotent across a torn page. ARIES uses physiological logging: physical-to-a-page, logical-within-a-page, getting the best of both.

Group commit. Instead of one fsync per commit, the engine collects commit records arriving within a small window and flushes them with a single fsync. This is why throughput can climb with concurrency even though each fsync is slow — the cost is shared across the batch.

full_page_writes / double-write buffer. A crash can leave a page torn — half old, half new — because 8 KB DB pages exceed the device's atomic 512 B/4 KB sector. PostgreSQL logs the full page image on its first post-checkpoint change; MySQL/InnoDB writes pages twice (the doublewrite buffer) so a clean copy always exists.

Logical/streaming replication. Because the WAL is a total order of changes, you can decode it into a logical change stream and ship it to replicas or downstream systems. PostgreSQL's logical decoding and MySQL's binlog are WAL-derived.

WAL-mode in SQLite. SQLite's default rollback journal copies original pages out before overwriting; its newer WAL mode appends new page versions to a side file and lets readers run concurrently with a writer — a different, append-the-new-version flavor of the same durability idea.

Common bugs and edge cases

  • Touching the page before logging. The whole protocol is the order: append + fsync the log first. Flip it and a crash in the gap leaves an on-disk change with no log record to redo or undo.
  • Trusting a lying fsync. If the storage stack acknowledges fsync before bytes are on stable media (a drive's volatile write cache, an OS that drops the flush), an acknowledged commit can vanish. PostgreSQL's 2018 "fsyncgate" — where a failed fsync could be silently swallowed by the kernel — forced a panic-on-fsync-error change.
  • Non-idempotent redo. Forgetting the pageLSN ≥ LSN test (or using logical redo that isn't replay-safe) double-applies a change when recovery itself crashes and restarts. Redo must be safe to run any number of times.
  • Undo without compensation records. If rollback doesn't log CLRs, a crash mid-rollback re-runs the undo from scratch and can over-correct. CLRs make undo idempotent and bound by their undoNext pointer.
  • Torn pages ignored. Assuming an 8 KB page write is atomic. It isn't — you need full-page-writes, a doublewrite buffer, or atomic-write hardware, or recovery can apply a delta to a half-written page.
  • Unbounded log growth. The WAL can only be truncated once a checkpoint guarantees its changes are on the data files and no replica still needs it. Forgetting to recycle WAL segments fills the disk — a classic PostgreSQL outage when an archiver or replication slot stalls.

Frequently asked questions

What exactly is the write-ahead rule?

Before a modified (dirty) data page is written to durable storage, the log records describing that modification must already be durable. Equivalently: the log record's LSN must hit disk no later than the page that depends on it. This single ordering constraint is what makes every committed change recoverable and every uncommitted change reversible.

Why is WAL faster than flushing data pages on every commit?

A commit only needs to fsync the log — a single sequential append — instead of writing each randomly-scattered dirty page to disk. Sequential I/O on an SSD runs at gigabytes per second while random 8 KB page writes are far slower, and group commit lets one fsync durably persist hundreds of concurrent transactions, amortizing the ~1 ms flush cost.

What is the difference between redo and undo logging?

Redo logging stores the after-image so a committed change that never reached the data file can be re-applied. Undo logging stores the before-image so an uncommitted change that did reach the data file can be rolled back. ARIES uses both — physiological redo plus logical undo via compensation log records — which is why it supports a no-force, steal buffer policy.

What is an LSN and why does every page store one?

A Log Sequence Number is a monotonically increasing address of a record in the log. Each data page stores the LSN of the last log record that modified it (its pageLSN). During recovery, redo is skipped for a page whenever pageLSN ≥ the record's LSN, because the change is already reflected on disk — that test makes redo idempotent.

What does a checkpoint do in WAL recovery?

A checkpoint records which transactions are active and which dirty pages are pending, then lets recovery start scanning from a recent log position instead of the beginning of time. Without checkpoints, recovery would have to replay the entire log; with them, redo starts at the oldest dirty-page recLSN, bounding restart time to the work since the last checkpoint.

Can WAL lose committed data?

Only if the durability assumption breaks. If fsync returns before the bytes are truly on stable media — because a disk lies about its write cache, or the OS buffers the flush — a crash can lose an acknowledged commit. This is why databases use fsync/fdatasync (or O_DSYNC) and why fsync correctness bugs, like the 2018 "fsyncgate", are treated as critical.