Databases

MVCC — Multi-Version Concurrency Control

Readers don't block writers, writers don't block readers

MVCC (multi-version concurrency control) lets a database run readers and writers concurrently without blocking each other by keeping multiple versions of every row, each tagged with the transaction ID that created or deleted it. Readers walk the version chain to find the version visible to their snapshot.

  • Read latencyLock-free
  • Bloat per dead row~24-32 bytes header + payload
  • Default inPostgres, Oracle, MySQL InnoDB
  • Garbage collectorvacuum / undo
  • Conflicts onWrites only

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How MVCC works

The core idea is simple: don't overwrite rows, append new versions. Every row carries two hidden columns — call them xmin (the transaction that created this version) and xmax (the transaction that deleted or replaced it). When a transaction starts, it captures a snapshot: the set of transaction IDs that have already committed.

To answer "is this row version visible to me?" the database asks two questions:

  1. Did the transaction that created this version (xmin) commit before my snapshot was taken? If no, skip.
  2. Did the transaction that deleted this version (xmax) commit before my snapshot? If yes, skip; otherwise it's still alive for me.

An UPDATE is implemented as DELETE + INSERT: stamp xmax on the old version, write a fresh version with xmin = current transaction ID. A concurrent reader from an older snapshot still sees the old version, because its snapshot took xmin from a transaction that committed before the snapshot started, and the new xmax hasn't committed yet (or its transaction ID is too high).

The cost: the table now holds two copies of the row. Repeat across millions of updates and you get table bloat. The fix is a background process — vacuum in PostgreSQL, undo segments in Oracle — that reclaims versions no active transaction could still see.

MVCC vs 2PL vs OCC

MVCCTwo-phase locking (2PL)Optimistic (OCC)
Reader blocks writerNoYes (shared lock vs exclusive)No
Writer blocks readerNoYesNo
Conflict detectionPer-row version compareLock acquisitionValidation phase at commit
Storage overheadOld versions until vacuumLock tableRead/write sets per txn
Behavior under contentionWrite-write conflicts waitLots of blocking, deadlock riskHigh abort rate
Best forRead-heavy mixed workloadsPredictable, write-heavyLow contention, distributed
Used byPostgres, Oracle, InnoDBSQL Server (default), DB2FoundationDB, Spanner-like

Snapshot read in JavaScript

// Toy MVCC engine — illustrates xmin/xmax visibility, not production code.
class MVCC {
  constructor() {
    this.rows = new Map();        // key -> [{xmin, xmax, value}]
    this.committed = new Set();   // committed txn IDs
    this.nextTxnId = 1;
  }

  begin() {
    return {
      id: this.nextTxnId++,
      snapshot: new Set(this.committed)   // freeze visible txns
    };
  }

  read(txn, key) {
    const versions = this.rows.get(key) || [];
    // Walk versions newest-to-oldest; return the first one visible to txn.
    for (let i = versions.length - 1; i >= 0; i--) {
      const v = versions[i];
      const visibleCreate = v.xmin === txn.id || txn.snapshot.has(v.xmin);
      const visibleDelete = v.xmax === 0
                          || (v.xmax !== txn.id && !txn.snapshot.has(v.xmax));
      if (visibleCreate && visibleDelete) return v.value;
    }
    return undefined;
  }

  write(txn, key, value) {
    const versions = this.rows.get(key) || [];
    const live = versions[versions.length - 1];
    // Write-write conflict: another committed txn already replaced what we read.
    if (live && live.xmax !== 0 && live.xmax !== txn.id) {
      throw new Error('serialization failure');
    }
    if (live) live.xmax = txn.id;
    versions.push({ xmin: txn.id, xmax: 0, value });
    this.rows.set(key, versions);
  }

  commit(txn) { this.committed.add(txn.id); }

  // Reclaim versions no active transaction could still see.
  vacuum(activeSnapshots) {
    const oldestSnapshot = Math.min(...activeSnapshots.map(s => Math.min(...s)));
    for (const versions of this.rows.values()) {
      while (versions.length > 1 && versions[0].xmax !== 0
             && versions[0].xmax < oldestSnapshot) {
        versions.shift();   // version superseded before any live snapshot
      }
    }
  }
}

Real engines store version chains as a linked list of tuples on disk pages, not arrays in memory, but the visibility logic is the same.

Snapshot read in Python

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Version:
    xmin: int
    xmax: int            # 0 if alive
    value: Any

class MVCC:
    def __init__(self):
        self.rows: dict[str, list[Version]] = {}
        self.committed: set[int] = set()
        self.next_txn = 1

    def begin(self):
        txn = {'id': self.next_txn, 'snapshot': frozenset(self.committed)}
        self.next_txn += 1
        return txn

    def read(self, txn, key):
        for v in reversed(self.rows.get(key, [])):
            visible_create = v.xmin == txn['id'] or v.xmin in txn['snapshot']
            visible_delete = v.xmax == 0 or (
                v.xmax != txn['id'] and v.xmax not in txn['snapshot']
            )
            if visible_create and visible_delete:
                return v.value
        return None

    def write(self, txn, key, value):
        chain = self.rows.setdefault(key, [])
        if chain and chain[-1].xmax != 0 and chain[-1].xmax != txn['id']:
            raise RuntimeError('serialization failure')
        if chain:
            chain[-1].xmax = txn['id']
        chain.append(Version(xmin=txn['id'], xmax=0, value=value))

    def commit(self, txn):
        self.committed.add(txn['id'])

    def vacuum(self, active_snapshots):
        if not active_snapshots:
            return
        horizon = min(min(s) for s in active_snapshots if s) if active_snapshots else 0
        for chain in self.rows.values():
            while len(chain) > 1 and chain[0].xmax and chain[0].xmax < horizon:
                chain.pop(0)

Variants — engine differences

  • PostgreSQL: in-place version chains. Old versions sit in the heap until autovacuum reclaims them. Bloat is the operational issue; VACUUM FULL rewrites the whole table. Visibility uses xmin, xmax, and a per-snapshot xip_list (transactions in flight at snapshot time).
  • Oracle: stores old versions in a separate UNDO tablespace, not in the main table. Reads reconstruct prior versions by walking undo. Undo retention controls how far back snapshots can read; the famous "ORA-01555: snapshot too old" fires when undo is reclaimed before a long query finishes.
  • MySQL InnoDB: hybrid. Latest version lives in the clustered index; old versions live in undo logs in the rollback segment. Purge thread is InnoDB's vacuum equivalent.
  • SQL Server: opt-in via READ COMMITTED SNAPSHOT or ALLOW_SNAPSHOT_ISOLATION; old versions live in tempdb's version store.
  • FoundationDB / TiDB / CockroachDB: distributed MVCC keyed on (key, commit timestamp). Garbage collection is configurable per-keyspace, often defaulting to ~24 hours of history.
  • SQLite WAL mode: a stripped-down MVCC where readers see the database as of when their transaction began; the WAL holds newer writes invisible to old readers.

What MVCC costs

The win is read throughput: a Postgres replica serves thousands of concurrent read transactions while a writer churns through updates, because reads never wait for write locks. The catch is bloat. A row updated 100 times has 100 versions; until vacuum runs, every index lookup may walk through dead tuples to find the live one.

Numbers from a typical Postgres workload: an update-heavy table with autovacuum throttled can balloon to 5-10× its live size, slowing sequential scans by the same factor. On a busy 1TB table a full vacuum run touches hundreds of GB of pages over several hours. Tuning autovacuum_vacuum_scale_factor and autovacuum_naptime is one of the most common Postgres hot spots.

A second cost: 32-bit transaction ID wraparound. Postgres's XID cycles every ~4 billion transactions; if autovacuum can't freeze old rows fast enough, the database hits emergency shutdown to prevent silent data loss.

Common bugs and edge cases

  • Long-running transaction blocks vacuum. A 6-hour analytics query on a 1TB OLTP database can prevent vacuum from reclaiming any version newer than 6 hours, bloating tables disastrously. Set idle_in_transaction_session_timeout and watch pg_stat_activity.
  • Index bloat. Postgres indexes also accumulate dead entries because every update creates a new heap tuple and a new index entry. REINDEX CONCURRENTLY or HOT updates (when the indexed column doesn't change) are the mitigations.
  • Hot-row write contention. MVCC eliminates read-write blocking but not write-write. A counter row updated by every transaction becomes a serialization bottleneck regardless of isolation level.
  • Snapshot too old (Oracle / Postgres). When undo / dead versions are reclaimed before a long-running query finishes, the query aborts because it can no longer reconstruct its snapshot.
  • Write skew at snapshot isolation. Each transaction reads a consistent snapshot and writes a different row; both commit even though serializability would have forbidden it. Postgres SSI catches this; plain snapshot isolation does not.
  • XID wraparound. Underprovisioned vacuum on a high-write Postgres cluster can hit the 2-billion-transaction warning level, forcing emergency single-user-mode vacuum.

When MVCC is the right choice

  • Read-heavy workloads where blocking reads on writes would crater latency (analytics over OLTP).
  • Workloads needing snapshot isolation — repeatable reads without locking out writers.
  • Replication: streaming replicas trivially tail the version chain.

Skip MVCC and prefer pure 2PL when bloat is more painful than blocking, when writes vastly outnumber reads on the same rows, or when the workload is a single-writer (SQLite-style) embedded use case.

Frequently asked questions

Why does MVCC make readers not block writers?

Because every read sees a consistent snapshot of row versions that existed at the moment it started. A concurrent UPDATE creates a new row version with a higher transaction ID, but the reader's snapshot ignores it. No locks are needed for the read — readers and writers operate on different physical rows.

What is xmin and xmax in PostgreSQL?

Every row in PostgreSQL has hidden xmin (the transaction ID that inserted it) and xmax (the transaction ID that deleted or replaced it) columns. A row is visible to your transaction if xmin committed before your snapshot started AND xmax is either zero or hasn't committed yet.

What is vacuum and why does it matter?

Vacuum is the garbage collector that reclaims dead row versions — versions that are no longer visible to any active transaction. Without vacuum, the table grows forever (table bloat), index lookups slow down, and disk fills up. PostgreSQL runs autovacuum continuously; tuning it is one of the most common production hot spots.

Does MVCC eliminate locks entirely?

No. Two transactions trying to update the same row still conflict — the second waits or aborts. MVCC eliminates read-write conflicts but not write-write conflicts. SELECT FOR UPDATE explicitly takes row locks for read-modify-write patterns.

Why is a long-running transaction expensive in MVCC?

Vacuum can only remove a row version if no active transaction could still see it. A 10-hour analytics query holds the snapshot horizon open for 10 hours, so 10 hours of dead row versions accumulate and can't be reclaimed. Tables bloat, queries slow, and emergency intervention follows.

Is MVCC the same as snapshot isolation?

MVCC is the storage and visibility mechanism. Snapshot isolation is the isolation level you can implement on top of it. Most MVCC databases use it to provide snapshot isolation, but the same machinery can also support read-committed (with a fresh snapshot per statement) or serializable (Postgres SSI layers cycle detection on top).