Databases

Write-Ahead Log (WAL)

Log first, mutate later, recover always

A write-ahead log records every change a database is about to make to durable storage before that change is applied to data files. It's the mechanism that makes COMMIT mean something even when the lights go out — and it's the backbone of replication, point-in-time recovery, and group commit.

  • Sequential write cost~1ms fsync (SSD)
  • Group commit speedup~100×
  • Postgres WAL segment16 MB default
  • Recovery modelARIES (redo + undo)
  • DrivesReplication, PITR

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a write-ahead log works

The rule is one sentence: write the log before you write the data. When a transaction modifies a row, the database first appends a log record describing the change to the WAL, then (eventually) updates the in-memory page, then (much later) flushes that page to the data file. COMMIT blocks only on the WAL fsync — not on the data file write.

This rearrangement does three things at once:

  1. Cheap durability. The WAL is one file written sequentially. One fsync makes a transaction durable, regardless of how many random pages it touched.
  2. Crash recovery. If the server dies before data pages are flushed, replay the WAL from the last checkpoint. Each redo record reapplies its change idempotently.
  3. Replication. Stream the WAL to a replica, replay it there. Postgres streaming replication, MySQL binary log, Kafka log compaction — all variants of the same idea.

The protocol that ties it together is called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics, IBM 1992). ARIES log records carry an LSN (log sequence number); each data page tracks the LSN of its last applied change. Recovery walks the log from the last checkpoint, redoing only changes whose LSN exceeds the page's current LSN. Then a second pass undoes any in-flight transactions that hadn't committed.

WAL vs shadow paging vs in-place writes

Write-ahead logShadow pagingIn-place (no log)
Commit cost1 sequential fsyncPage rewrites + pointer flipRandom fsyncs of all dirty pages
Crash safetyReplay log to last checkpointOld pages still valid until commitNone — corruption likely
Random write amplificationLowHigh (full-page copies)High
Replication friendlyStream the logHard — page-state diffsNone
Storage costWAL retention2× page spaceNone
Used byPostgres, InnoDB, SQLite WAL modeSQLite rollback journal, Lotus NotesMemory caches, ephemeral stores
Production fitOLTP, replicated databasesSingle-host embeddedCaches that tolerate data loss

Redo + undo logging in JavaScript

// Toy WAL with redo + undo records and group commit.
class WAL {
  constructor(fsync) {
    this.records = [];           // appended sequentially
    this.lsn = 0;
    this.flushedLsn = 0;
    this.pending = [];           // commits awaiting fsync
    this.fsync = fsync;          // a function: returns a promise once durable
  }

  append(record) {
    record.lsn = ++this.lsn;
    this.records.push(record);
    return record.lsn;
  }

  // Group commit: many concurrent commits batch into one fsync.
  commit(txnId) {
    const lsn = this.append({ type: 'COMMIT', txnId });
    return new Promise((resolve) => this.pending.push({ lsn, resolve }))
      .then(() => this.maybeFlush());
  }

  async maybeFlush() {
    if (this.pending.length === 0) return;
    const batch = this.pending;
    this.pending = [];
    await this.fsync();          // single disk sync for the entire batch
    this.flushedLsn = batch[batch.length - 1].lsn;
    batch.forEach(p => p.resolve(p.lsn));
  }

  // Crash recovery: redo committed work, undo in-flight work.
  recover(pages) {
    const committed = new Set();
    for (const r of this.records) {
      if (r.type === 'COMMIT') committed.add(r.txnId);
    }
    // Pass 1 — REDO: re-apply committed changes whose page LSN is older.
    for (const r of this.records) {
      if (r.type === 'UPDATE' && committed.has(r.txnId)
          && (pages[r.key]?.lsn ?? 0) < r.lsn) {
        pages[r.key] = { value: r.newValue, lsn: r.lsn };
      }
    }
    // Pass 2 — UNDO: roll back transactions that never committed.
    for (let i = this.records.length - 1; i >= 0; i--) {
      const r = this.records[i];
      if (r.type === 'UPDATE' && !committed.has(r.txnId)) {
        pages[r.key] = { value: r.oldValue, lsn: r.lsn };
      }
    }
    return pages;
  }
}

// Usage: many writers, one fsync per ~10ms window
const wal = new WAL(() => new Promise(r => setTimeout(r, 1)));  // ~1ms fsync
wal.append({ type: 'UPDATE', txnId: 7, key: 'x',
             oldValue: 'a', newValue: 'b' });
await wal.commit(7);

The key trick is maybeFlush: every concurrent caller of commit piles into pending, and the next fsync drains the entire queue. With 1000 concurrent writers and 1ms fsync, you get ~1000 commits per fsync window — group commit's ~100× throughput win in five lines of code.

Redo + undo logging in Python

import asyncio
from dataclasses import dataclass
from typing import Any

@dataclass
class LogRecord:
    lsn: int
    type: str            # 'UPDATE' or 'COMMIT'
    txn_id: int
    key: str | None = None
    old_value: Any = None
    new_value: Any = None

class WAL:
    def __init__(self, fsync_ms: float = 1.0):
        self.records: list[LogRecord] = []
        self.lsn = 0
        self.flushed_lsn = 0
        self.pending: list[tuple[int, asyncio.Future]] = []
        self.fsync_ms = fsync_ms
        self._flush_task: asyncio.Task | None = None

    def append(self, **fields) -> int:
        self.lsn += 1
        self.records.append(LogRecord(lsn=self.lsn, **fields))
        return self.lsn

    async def commit(self, txn_id: int) -> int:
        lsn = self.append(type='COMMIT', txn_id=txn_id)
        fut: asyncio.Future = asyncio.get_running_loop().create_future()
        self.pending.append((lsn, fut))
        if self._flush_task is None or self._flush_task.done():
            self._flush_task = asyncio.create_task(self._flush())
        return await fut

    async def _flush(self):
        await asyncio.sleep(self.fsync_ms / 1000)   # simulate fsync
        batch, self.pending = self.pending, []
        if batch:
            self.flushed_lsn = batch[-1][0]
            for lsn, fut in batch:
                fut.set_result(lsn)

    def recover(self, pages: dict) -> dict:
        committed = {r.txn_id for r in self.records if r.type == 'COMMIT'}
        # REDO pass — replay committed work whose page LSN trails the log.
        for r in self.records:
            if r.type == 'UPDATE' and r.txn_id in committed:
                if pages.get(r.key, {}).get('lsn', 0) < r.lsn:
                    pages[r.key] = {'value': r.new_value, 'lsn': r.lsn}
        # UNDO pass — revert transactions that never reached COMMIT.
        for r in reversed(self.records):
            if r.type == 'UPDATE' and r.txn_id not in committed:
                pages[r.key] = {'value': r.old_value, 'lsn': r.lsn}
        return pages

Variants — how engines implement WAL

  • PostgreSQL WAL: redo-only logging, 16MB segment files in pg_wal/. Streaming replication tails the WAL byte-for-byte. archive_command ships completed segments to remote storage for point-in-time recovery. Full-page writes after each checkpoint protect against torn pages.
  • InnoDB redo log: fixed-size circular log files (default 48MB × 2 in MySQL 8). Plus a separate undo log in the rollback segment for transaction rollback. Group commit is on by default; innodb_flush_log_at_trx_commit=1 is the durable setting.
  • SQLite journal mode: the legacy "rollback journal" actually does undo logging — copy original pages to a side file, mutate the data file, delete journal on success. Rolls back on crash. SQLite's newer "WAL mode" inverts this to a true write-ahead log and unlocks concurrent readers.
  • MongoDB / WiredTiger: per-collection journal with checkpoints every 60s; group commit window is 50ms by default.
  • Etcd / Raft-based stores: the Raft log doubles as the WAL. Each entry is replicated to a quorum before the leader applies it.
  • RocksDB / LSM trees: WAL fronts an in-memory memtable; on flush the memtable becomes an immutable SSTable, after which the log segment can be deleted.

What WAL costs (and saves)

The fsync is the headline. On a typical NVMe SSD, an fsync takes ~0.5-1ms; on a cloud-attached EBS volume with sync barriers, more like 5-10ms. Without group commit, that bounds throughput at 1000-2000 tx/s. With group commit batching hundreds of transactions per fsync, the same hardware sustains 100k+ tx/s — a clean two-order-of-magnitude win for free, just by waiting a millisecond before flushing.

Storage cost: the WAL grows at roughly 1.5-3× the data churn (overheads + full-page writes). A database doing 100GB of inserts per day generates 150-300GB of WAL. Postgres deletes WAL segments past the last checkpoint and any active replication slot; an orphaned replication slot is the classic way to fill a disk with WAL.

Recovery time scales with the distance from the last checkpoint. Postgres defaults to a checkpoint every 5 minutes or 1GB of WAL, whichever first. Tuning checkpoint_timeout and max_wal_size trades steady-state I/O smoothness against worst-case recovery time.

Common bugs and edge cases

  • Torn pages. An OS page is typically 4KB; a database page is 8KB. A crash mid-write can leave half-old, half-new bytes on disk. Postgres mitigates with full-page-image WAL records after each checkpoint; ext4 with data=ordered covers most of the rest.
  • fsync ordering. Some filesystems and controllers reorder writes; the WAL fsync must complete before the corresponding commit acknowledgement returns. The 2018 "fsyncgate" Postgres bug exposed Linux silently dropping fsync errors.
  • Disk-full during WAL write. Many systems handle this by panicking — write the log, or stop everything. Better than the alternative of returning success on a non-durable commit.
  • Replication slot orphaned. Postgres holds onto WAL until every active replication slot has consumed it. A dropped replica with a forgotten slot fills the WAL volume in days.
  • Long checkpoint pauses. Old engines did "stop-the-world" checkpoints. Modern fuzzy/incremental checkpoints spread page flushes across the interval but can starve foreground I/O if tuned aggressively.
  • Skipping fsync for speed. synchronous_commit=off trades a few seconds of committed data for throughput. Acceptable for telemetry, dangerous for ledgers.

When you need a WAL

  • Anything that calls itself a database and persists user data.
  • Replicated systems where followers must apply the leader's changes in order.
  • Systems offering point-in-time recovery: ship and replay WAL up to a chosen LSN.

Skip WAL only when data loss is genuinely fine — pure caches, ephemeral session stores, message brokers running with acks=none. For everything else, WAL is the smallest unit of database that can survive a crash.

Frequently asked questions

Why is a write-ahead log faster than writing directly to the data file?

Because the WAL is appended sequentially while data files are scattered random writes. A single fsync of a sequential log is dramatically cheaper than fsyncing dozens of randomly placed pages. Plus, the actual data pages can be flushed lazily in the background — the commit only needs the log durable.

What's the difference between redo and undo log records?

A redo record contains enough info to re-apply a change if the data page wasn't flushed before crash. An undo record contains the prior value, so an in-flight transaction can be rolled back. ARIES-style recovery uses both. PostgreSQL uses redo only; it relies on MVCC version chains for rollback.

What is group commit?

Instead of fsyncing the WAL on every transaction's commit, the database batches multiple in-flight commits into a single fsync. With 1ms fsync cost, group commit raises throughput from ~1000 tx/s to 100,000+ tx/s under concurrency by amortizing the disk sync.

Can I disable fsync to make commits faster?

You can — and you'll lose durability. PostgreSQL's fsync=off and synchronous_commit=off both make commits faster but a power loss can lose the last few seconds of committed transactions. It's the right knob for non-durable caches and replicas; never for the primary record of money.

How does crash recovery use the WAL?

On startup the database scans the WAL from the last checkpoint forward. For each committed transaction, redo log records are applied to bring data pages up to date. For incomplete transactions, undo records (if any) roll back partial changes. ARIES formalizes this as redo-then-undo recovery.