Databases
MVCC — Multi-Version Concurrency Control
Readers don't block writers, writers don't block readers
MVCC (multi-version concurrency control) lets a database run readers and writers concurrently without blocking each other by keeping multiple versions of every row, each tagged with the transaction ID that created or deleted it. Readers walk the version chain to find the version visible to their snapshot.
- Read latencyLock-free
- Bloat per dead row~24-32 bytes header + payload
- Default inPostgres, Oracle, MySQL InnoDB
- Garbage collectorvacuum / undo
- Conflicts onWrites only
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How MVCC works
The core idea is simple: don't overwrite rows, append new versions. Every row carries two hidden columns — call them xmin (the transaction that created this version) and xmax (the transaction that deleted or replaced it). When a transaction starts, it captures a snapshot: the set of transaction IDs that have already committed.
To answer "is this row version visible to me?" the database asks two questions:
- Did the transaction that created this version (
xmin) commit before my snapshot was taken? If no, skip. - Did the transaction that deleted this version (
xmax) commit before my snapshot? If yes, skip; otherwise it's still alive for me.
An UPDATE is implemented as DELETE + INSERT: stamp xmax on the old version, write a fresh version with xmin = current transaction ID. A concurrent reader from an older snapshot still sees the old version, because its snapshot took xmin from a transaction that committed before the snapshot started, and the new xmax hasn't committed yet (or its transaction ID is too high).
The cost: the table now holds two copies of the row. Repeat across millions of updates and you get table bloat. The fix is a background process — vacuum in PostgreSQL, undo segments in Oracle — that reclaims versions no active transaction could still see.
MVCC vs 2PL vs OCC
| MVCC | Two-phase locking (2PL) | Optimistic (OCC) | |
|---|---|---|---|
| Reader blocks writer | No | Yes (shared lock vs exclusive) | No |
| Writer blocks reader | No | Yes | No |
| Conflict detection | Per-row version compare | Lock acquisition | Validation phase at commit |
| Storage overhead | Old versions until vacuum | Lock table | Read/write sets per txn |
| Behavior under contention | Write-write conflicts wait | Lots of blocking, deadlock risk | High abort rate |
| Best for | Read-heavy mixed workloads | Predictable, write-heavy | Low contention, distributed |
| Used by | Postgres, Oracle, InnoDB | SQL Server (default), DB2 | FoundationDB, Spanner-like |
Snapshot read in JavaScript
// Toy MVCC engine — illustrates xmin/xmax visibility, not production code.
class MVCC {
constructor() {
this.rows = new Map(); // key -> [{xmin, xmax, value}]
this.committed = new Set(); // committed txn IDs
this.nextTxnId = 1;
}
begin() {
return {
id: this.nextTxnId++,
snapshot: new Set(this.committed) // freeze visible txns
};
}
read(txn, key) {
const versions = this.rows.get(key) || [];
// Walk versions newest-to-oldest; return the first one visible to txn.
for (let i = versions.length - 1; i >= 0; i--) {
const v = versions[i];
const visibleCreate = v.xmin === txn.id || txn.snapshot.has(v.xmin);
const visibleDelete = v.xmax === 0
|| (v.xmax !== txn.id && !txn.snapshot.has(v.xmax));
if (visibleCreate && visibleDelete) return v.value;
}
return undefined;
}
write(txn, key, value) {
const versions = this.rows.get(key) || [];
const live = versions[versions.length - 1];
// Write-write conflict: another committed txn already replaced what we read.
if (live && live.xmax !== 0 && live.xmax !== txn.id) {
throw new Error('serialization failure');
}
if (live) live.xmax = txn.id;
versions.push({ xmin: txn.id, xmax: 0, value });
this.rows.set(key, versions);
}
commit(txn) { this.committed.add(txn.id); }
// Reclaim versions no active transaction could still see.
vacuum(activeSnapshots) {
const oldestSnapshot = Math.min(...activeSnapshots.map(s => Math.min(...s)));
for (const versions of this.rows.values()) {
while (versions.length > 1 && versions[0].xmax !== 0
&& versions[0].xmax < oldestSnapshot) {
versions.shift(); // version superseded before any live snapshot
}
}
}
}
Real engines store version chains as a linked list of tuples on disk pages, not arrays in memory, but the visibility logic is the same.
Snapshot read in Python
from dataclasses import dataclass, field
from typing import Any
@dataclass
class Version:
xmin: int
xmax: int # 0 if alive
value: Any
class MVCC:
def __init__(self):
self.rows: dict[str, list[Version]] = {}
self.committed: set[int] = set()
self.next_txn = 1
def begin(self):
txn = {'id': self.next_txn, 'snapshot': frozenset(self.committed)}
self.next_txn += 1
return txn
def read(self, txn, key):
for v in reversed(self.rows.get(key, [])):
visible_create = v.xmin == txn['id'] or v.xmin in txn['snapshot']
visible_delete = v.xmax == 0 or (
v.xmax != txn['id'] and v.xmax not in txn['snapshot']
)
if visible_create and visible_delete:
return v.value
return None
def write(self, txn, key, value):
chain = self.rows.setdefault(key, [])
if chain and chain[-1].xmax != 0 and chain[-1].xmax != txn['id']:
raise RuntimeError('serialization failure')
if chain:
chain[-1].xmax = txn['id']
chain.append(Version(xmin=txn['id'], xmax=0, value=value))
def commit(self, txn):
self.committed.add(txn['id'])
def vacuum(self, active_snapshots):
if not active_snapshots:
return
horizon = min(min(s) for s in active_snapshots if s) if active_snapshots else 0
for chain in self.rows.values():
while len(chain) > 1 and chain[0].xmax and chain[0].xmax < horizon:
chain.pop(0)
Variants — engine differences
- PostgreSQL: in-place version chains. Old versions sit in the heap until autovacuum reclaims them. Bloat is the operational issue;
VACUUM FULLrewrites the whole table. Visibility usesxmin,xmax, and a per-snapshotxip_list(transactions in flight at snapshot time). - Oracle: stores old versions in a separate UNDO tablespace, not in the main table. Reads reconstruct prior versions by walking undo. Undo retention controls how far back snapshots can read; the famous "ORA-01555: snapshot too old" fires when undo is reclaimed before a long query finishes.
- MySQL InnoDB: hybrid. Latest version lives in the clustered index; old versions live in undo logs in the rollback segment. Purge thread is InnoDB's vacuum equivalent.
- SQL Server: opt-in via
READ COMMITTED SNAPSHOTorALLOW_SNAPSHOT_ISOLATION; old versions live in tempdb's version store. - FoundationDB / TiDB / CockroachDB: distributed MVCC keyed on (key, commit timestamp). Garbage collection is configurable per-keyspace, often defaulting to ~24 hours of history.
- SQLite WAL mode: a stripped-down MVCC where readers see the database as of when their transaction began; the WAL holds newer writes invisible to old readers.
What MVCC costs
The win is read throughput: a Postgres replica serves thousands of concurrent read transactions while a writer churns through updates, because reads never wait for write locks. The catch is bloat. A row updated 100 times has 100 versions; until vacuum runs, every index lookup may walk through dead tuples to find the live one.
Numbers from a typical Postgres workload: an update-heavy table with autovacuum throttled can balloon to 5-10× its live size, slowing sequential scans by the same factor. On a busy 1TB table a full vacuum run touches hundreds of GB of pages over several hours. Tuning autovacuum_vacuum_scale_factor and autovacuum_naptime is one of the most common Postgres hot spots.
A second cost: 32-bit transaction ID wraparound. Postgres's XID cycles every ~4 billion transactions; if autovacuum can't freeze old rows fast enough, the database hits emergency shutdown to prevent silent data loss.
Common bugs and edge cases
- Long-running transaction blocks vacuum. A 6-hour analytics query on a 1TB OLTP database can prevent vacuum from reclaiming any version newer than 6 hours, bloating tables disastrously. Set
idle_in_transaction_session_timeoutand watchpg_stat_activity. - Index bloat. Postgres indexes also accumulate dead entries because every update creates a new heap tuple and a new index entry.
REINDEX CONCURRENTLYor HOT updates (when the indexed column doesn't change) are the mitigations. - Hot-row write contention. MVCC eliminates read-write blocking but not write-write. A counter row updated by every transaction becomes a serialization bottleneck regardless of isolation level.
- Snapshot too old (Oracle / Postgres). When undo / dead versions are reclaimed before a long-running query finishes, the query aborts because it can no longer reconstruct its snapshot.
- Write skew at snapshot isolation. Each transaction reads a consistent snapshot and writes a different row; both commit even though serializability would have forbidden it. Postgres SSI catches this; plain snapshot isolation does not.
- XID wraparound. Underprovisioned vacuum on a high-write Postgres cluster can hit the 2-billion-transaction warning level, forcing emergency single-user-mode vacuum.
When MVCC is the right choice
- Read-heavy workloads where blocking reads on writes would crater latency (analytics over OLTP).
- Workloads needing snapshot isolation — repeatable reads without locking out writers.
- Replication: streaming replicas trivially tail the version chain.
Skip MVCC and prefer pure 2PL when bloat is more painful than blocking, when writes vastly outnumber reads on the same rows, or when the workload is a single-writer (SQLite-style) embedded use case.
Frequently asked questions
Why does MVCC make readers not block writers?
Because every read sees a consistent snapshot of row versions that existed at the moment it started. A concurrent UPDATE creates a new row version with a higher transaction ID, but the reader's snapshot ignores it. No locks are needed for the read — readers and writers operate on different physical rows.
What is xmin and xmax in PostgreSQL?
Every row in PostgreSQL has hidden xmin (the transaction ID that inserted it) and xmax (the transaction ID that deleted or replaced it) columns. A row is visible to your transaction if xmin committed before your snapshot started AND xmax is either zero or hasn't committed yet.
What is vacuum and why does it matter?
Vacuum is the garbage collector that reclaims dead row versions — versions that are no longer visible to any active transaction. Without vacuum, the table grows forever (table bloat), index lookups slow down, and disk fills up. PostgreSQL runs autovacuum continuously; tuning it is one of the most common production hot spots.
Does MVCC eliminate locks entirely?
No. Two transactions trying to update the same row still conflict — the second waits or aborts. MVCC eliminates read-write conflicts but not write-write conflicts. SELECT FOR UPDATE explicitly takes row locks for read-modify-write patterns.
Why is a long-running transaction expensive in MVCC?
Vacuum can only remove a row version if no active transaction could still see it. A 10-hour analytics query holds the snapshot horizon open for 10 hours, so 10 hours of dead row versions accumulate and can't be reclaimed. Tables bloat, queries slow, and emergency intervention follows.
Is MVCC the same as snapshot isolation?
MVCC is the storage and visibility mechanism. Snapshot isolation is the isolation level you can implement on top of it. Most MVCC databases use it to provide snapshot isolation, but the same machinery can also support read-committed (with a fresh snapshot per statement) or serializable (Postgres SSI layers cycle detection on top).