Distributed Systems
Three-Phase Commit
An extra round trip to make sure a dead coordinator can't freeze the whole cluster
Three-phase commit (3PC) adds a pre-commit phase between the vote and the commit of two-phase commit, so a coordinator crash leaves a timeout-recoverable state instead of blocking every participant forever — at the cost of an extra round trip and no safety under network partitions.
- Message rounds3 (vs 2 for 2PC)
- Extra forced log write / node1 (pre-commit record)
- Blocking on single crashNo (timeout recovers)
- Safe under partitionNo
- Designed byDale Skeen, 1981
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The problem 3PC was built to fix
Two-phase commit has one fatal liveness flaw. After a participant votes yes in phase one, it is locked: it has promised to commit if asked, so it holds its locks and waits for the coordinator's verdict. If the coordinator crashes after collecting the votes but before broadcasting the decision, every participant that voted yes is stranded. It cannot commit (maybe someone voted no), it cannot abort (maybe everyone voted yes and others already committed), and it cannot ask its peers, because in 2PC a participant has no idea what anyone else voted. The transaction's locks stay held until the coordinator comes back — potentially minutes, potentially never. This is the blocking problem, and on a busy database it can cascade into a cluster-wide stall.
Dale Skeen analysed this formally in 1981 (and, with Michael Stonebraker, in the 1983 formal crash-recovery model) and proved a clean structural result: a commit protocol can be non-blocking only if no reachable state is simultaneously "committable" by one node and "abortable" by another. 2PC violates this because the state right after voting yes is both — a yes voter could end up committing or aborting depending on votes it can't see. Skeen's fix was to insert a new state that removes that ambiguity. That new state is the pre-commit phase, and the protocol built around it is three-phase commit.
How the three phases work
3PC keeps 2PC's vote phase and splits the second phase into two. The coordinator drives all three rounds:
- Phase 1 — canCommit (the vote). The coordinator sends
canCommit?to every participant. Each repliesyesorno. Unlike 2PC, a participant does not yet make its yes durable by force-writing — it just promises. If anyone says no (or times out), the coordinator aborts. - Phase 2 — preCommit (the new buffer). If all votes were yes, the coordinator sends
preCommitto everyone. A participant receivingpreCommitforce-writes a pre-commit record and repliesack. The crucial invariant: a participant only enters pre-commit if it knows every participant voted yes. So once you are in pre-commit, the outcome can only be commit. - Phase 3 — doCommit. After collecting the acks, the coordinator sends
doCommit. Each participant commits, releases locks, and repliesdone.
The pre-commit state is the entire point. It separates "we have decided to commit" (pre-commit) from "we have committed" (do-commit). Because every node reaches pre-commit before any node reaches do-commit, the dangerous overlapping state of 2PC simply does not exist.
The recovery rule — where the non-blocking magic lives
When the coordinator dies, the surviving participants run a termination protocol: they elect a new coordinator (or each consults the others) and decide based on the most advanced state anyone is in. The rule is deliberately simple:
- If any reachable participant is in pre-commit, the new coordinator pushes everyone to pre-commit and then commits. Safe, because pre-commit can only have been entered after a unanimous yes.
- If no participant is in pre-commit (they are all still in the uncertain "voted yes, waiting" state), the new coordinator aborts. Safe, because no node could have committed yet — do-commit is only reachable through pre-commit.
The timeout actions baked into each participant mirror this: a node that times out waiting in the uncertain state aborts; a node that times out waiting in pre-commit commits. No participant ever has to sit and wait indefinitely for the coordinator — every state has a unilateral, safe default action. That is what "non-blocking" means here: progress is always possible without the failed node.
When 3PC is the right tool (and when it isn't)
- Use it as a teaching scaffold. 3PC is the clearest illustration of why non-blocking atomic commit needs an extra state. It is on nearly every distributed-systems syllabus for exactly this reason.
- Use it inside a synchronous, well-monitored cluster where you can bound message delay and trust your failure detector — e.g. machines on a single fast switch with tight, accurate timeouts and no expectation of partitions. There, the fail-stop assumptions roughly hold.
- Do not use it across a WAN or anywhere partitions are plausible. The moment the network can split, 3PC can produce a split-brain decision. Reach for Paxos Commit, Raft-replicated 2PC, or a Saga instead.
- Do not use it when 2PC's blocking is acceptable. If your coordinator is itself highly available (replicated), plain 2PC on top of it gives you most of the benefit with one fewer round trip.
3PC versus the alternatives
| 2PC | 3PC | Paxos / Raft Commit | Saga | |
|---|---|---|---|---|
| Message rounds | 2 | 3 | 2 (replicated) | N steps + compensations |
| Forced log writes / node | 1–2 | 2–3 | O(1) per Paxos instance | per local txn |
| Non-blocking on coordinator crash | No | Yes (fail-stop) | Yes | Yes |
| Safe under network partition | Yes (but blocks) | No (split-brain) | Yes (majority quorum) | Yes (eventual) |
| Atomicity guarantee | Atomic | Atomic if no partition | Atomic | Eventual, not isolated |
| Failure model | Async-safe, blocks | Synchronous / fail-stop | Async, ≤ minority failures | Async |
| Real-world use | XA, distributed DBs | Essentially none | Spanner, CockroachDB, etcd | Microservices |
The honest summary: 3PC fixes 2PC's liveness problem but introduces a safety problem, and in distributed systems safety almost always wins over liveness. That trade is why 3PC is famous in textbooks and absent from production. Paxos Commit gets you non-blocking and partition-safe by replacing the timeout-based failure detector with a quorum that never lies — the cost is needing a majority alive to make progress.
What the extra phase actually costs
- ~50% more latency. 2PC needs 2 sequential round trips (prepare, commit); 3PC needs 3 (canCommit, preCommit, doCommit). On a 1 ms intra-datacenter link that is roughly 3 ms versus 2 ms of network time before the coordinator's
fsyncdominates — but on a 50 ms cross-region link the third round adds a full 50 ms of commit latency. - One extra forced disk write per participant. Each node fsyncs a pre-commit record it would not write under 2PC. With N participants that is N additional
fsynccalls — and an fsync on a spinning disk is ~10 ms, on enterprise SSD ~0.1–1 ms. - 2N extra messages per transaction. The preCommit round is N sends plus N acks on top of 2PC's traffic. For a 5-participant transaction that is 10 messages added to the ~20 a 2PC already sends.
- Blocking window shrinks but never reaches zero. 3PC's recovery still stalls during failure detection — typically one timeout interval (often 5–30 s in real deployments). It removes indefinite blocking, not all blocking.
JavaScript implementation
A compact, message-passing model of the coordinator and a participant. The states and timeout transitions are the load-bearing part — the network is abstracted as send.
const State = Object.freeze({
INIT: 'init', WAITING: 'waiting', // voted yes, uncertain
PRECOMMIT: 'precommit', // knows everyone voted yes
COMMITTED: 'committed', ABORTED: 'aborted'
});
class Participant {
constructor(id, willVoteYes = true) {
this.id = id; this.state = State.INIT; this.willVoteYes = willVoteYes;
}
// Coordinator -> participant messages
onCanCommit() { // Phase 1
if (!this.willVoteYes) { this.state = State.ABORTED; return 'no'; }
this.state = State.WAITING; return 'yes';
}
onPreCommit() { // Phase 2 — force-write pre-commit record
if (this.state !== State.WAITING) return 'ignored';
this.log('PRECOMMIT'); // durable fsync
this.state = State.PRECOMMIT; return 'ack';
}
onDoCommit() { // Phase 3
if (this.state !== State.PRECOMMIT) return 'ignored';
this.log('COMMIT'); this.state = State.COMMITTED; return 'done';
}
// Timeout: the non-blocking heart of 3PC
onTimeout() {
if (this.state === State.WAITING) this.state = State.ABORTED; // never saw preCommit -> abort
if (this.state === State.PRECOMMIT) this.state = State.COMMITTED; // everyone agreed -> commit
return this.state;
}
log(rec) { /* append rec to durable WAL, then fsync */ }
}
async function coordinatorCommit(participants, send) {
// Phase 1 — collect votes
const votes = await Promise.all(participants.map(p => send(p, 'canCommit')));
if (votes.some(v => v !== 'yes')) {
await Promise.all(participants.map(p => send(p, 'abort')));
return 'ABORTED';
}
// Phase 2 — pre-commit: tell everyone the vote was unanimous
await Promise.all(participants.map(p => send(p, 'preCommit'))); // wait for acks
// Phase 3 — do-commit
await Promise.all(participants.map(p => send(p, 'doCommit')));
return 'COMMITTED';
}
Notice the asymmetry in onTimeout: the same event (the coordinator went silent) produces opposite outcomes depending on which state the participant is in. A node still in WAITING aborts because it cannot prove anyone agreed; a node in PRECOMMIT commits because reaching pre-commit is itself the proof of unanimous agreement.
Python — the termination protocol
The piece students usually skip is what the survivors do when the coordinator dies mid-flight. This sketch elects the most-advanced survivor and drives everyone to a consistent outcome.
from enum import Enum
class S(Enum):
WAITING = 1 # voted yes, uncertain
PRECOMMIT = 2 # knows the vote was unanimous
COMMITTED = 3
ABORTED = 4
def terminate(survivors):
"""Run after the coordinator is suspected dead.
`survivors` is the list of reachable participants and their states."""
states = {p.id: p.state for p in survivors}
# If anyone already finished, copy that decision — it is final.
if any(s is S.COMMITTED for s in states.values()):
return commit_all(survivors)
if any(s is S.ABORTED for s in states.values()):
return abort_all(survivors)
# Otherwise decide by the most advanced *reachable* state.
if any(s is S.PRECOMMIT for s in states.values()):
# Someone reached pre-commit => the vote was unanimous yes.
# Push laggards to PRECOMMIT, then commit everyone.
for p in survivors:
if p.state is S.WAITING:
p.state = S.PRECOMMIT # force-write pre-commit
return commit_all(survivors)
# No one reached pre-commit => no one can have committed => safe to abort.
return abort_all(survivors)
def commit_all(ps):
for p in ps: p.state = S.COMMITTED
return "COMMITTED"
def abort_all(ps):
for p in ps: p.state = S.ABORTED
return "ABORTED"
The reason this is safe under a single crash is that the survivors can see every other survivor's state. The reason it is unsafe under a partition is that they cannot: a partition hides the pre-commit nodes from the waiting nodes, so the two groups run terminate on different inputs and reach different verdicts.
Variants and the protocols that replaced it
Skeen's quorum-based 3PC. In a follow-up work (the 1982 "A Quorum-Based Commit Protocol") Skeen pairs the protocol with a quorum requirement during recovery: a survivor may only commit if it can assemble a commit-quorum and abort if it can assemble an abort-quorum, with the quorums chosen so the two can't both form. This patches the partition hole at the cost of needing a majority — at which point you are most of the way to a consensus protocol.
Paxos Commit. Gray and Lamport (2006) reframed atomic commit as one consensus instance per participant: each participant's vote is decided by a Paxos run, and a transaction commits iff every per-participant instance decides "prepared." This is genuinely non-blocking and partition-safe, needing only a majority of acceptors alive. It is the conceptual basis for how Spanner commits.
Raft / Multi-Paxos-replicated 2PC. The pragmatic modern answer: keep ordinary 2PC, but make the coordinator a replicated state machine via Raft so its decision log survives crashes. You get 2PC's two rounds plus consensus latency, with no blocking and full partition safety as long as a majority is up. CockroachDB, TiDB, and YugabyteDB all work this way.
Saga pattern. A different escape hatch entirely: give up atomic isolation, run each step as a local transaction, and undo with compensating actions on failure. No coordinator to block on, at the price of intermediate states being visible.
Common bugs and edge cases
- Treating 3PC as partition-safe. The number-one misconception. 3PC's correctness proof assumes a synchronous network and fail-stop nodes. Deploy it where partitions happen and you will eventually get a transaction that commits on some nodes and aborts on others.
- Wrong default on timeout. The timeout action is state-dependent: abort from
WAITING, commit fromPRECOMMIT. Flip those and you violate atomicity on the very failure 3PC exists to handle. - Forgetting to force-write the pre-commit record. If a participant enters pre-commit in memory but crashes before fsync, on restart it has lost the knowledge that the vote was unanimous and may incorrectly abort during recovery.
- The "false suspicion" cascade. A too-aggressive timeout suspects a live but slow coordinator, triggers a termination protocol, and now two coordinators are issuing conflicting commands. Recovery must fence the old coordinator (e.g. with an epoch/term number) before taking over.
- Total cluster failure is still blocking. If every participant crashes, no survivor can run the termination protocol; the transaction's fate waits for a node to recover and read its log — 3PC only removes blocking caused by the coordinator, not blocking caused by losing everyone.
- Assuming the extra round is free. Engineers sometimes add 3PC "for safety" and are surprised by the 50% latency hit and the extra fsync per node on the hot commit path.
Frequently asked questions
What exactly does the pre-commit phase add over two-phase commit?
It inserts a buffer state between "everyone voted yes" and "everyone has committed." Once a participant enters pre-commit, it knows for certain that every other participant also voted yes — so any node can finish the transaction on its own if the coordinator vanishes. In 2PC a participant that voted yes has no idea whether the others agreed, so it must block until it hears back.
Is three-phase commit actually non-blocking?
It is non-blocking only under the fail-stop model: nodes crash cleanly, the network is synchronous, and timeouts reliably detect failure. Under those assumptions any single failure is recoverable by timeout. But on a real asynchronous network 3PC is not safe — a network partition can split the participants into two groups that independently time out into opposite decisions, committing on one side and aborting on the other. That is why production systems use Paxos or Raft instead.
Why does a network partition break 3PC when a single crash doesn't?
The recovery rule is "if you timed out in pre-commit, commit; if you timed out before pre-commit, abort." A partition can put some nodes in pre-commit and others not. The pre-commit group times out and commits; the wait group times out and aborts. Both followed the rules, yet the cluster disagrees — an atomicity violation. 3PC trades 2PC's liveness problem for a safety problem under partitions.
How many message rounds and disk writes does 3PC cost versus 2PC?
3PC uses three round trips (canCommit, preCommit, doCommit) versus 2PC's two, so latency is roughly 50% higher — about 3 network round trips plus the coordinator's fsync. Each participant performs one extra forced log write (the pre-commit record), so for N participants the cluster does N additional fsyncs and 2N additional messages per transaction.
Do any real systems use three-phase commit in production?
Almost none use textbook 3PC. The blocking problem it solves is real, but its unsafety under partitions made it a dead end. Modern systems instead replicate the coordinator's decision with a consensus protocol — Google Spanner runs Paxos-replicated 2PC, and many databases use Raft for the same purpose. 3PC survives mainly as a teaching device and as the conceptual ancestor of Skeen's quorum-based non-blocking commit.
What is the difference between 3PC and Paxos Commit?
3PC relies on synchronous timeouts to detect coordinator failure and is unsafe when timeouts lie (partitions). Paxos Commit replaces the single coordinator with a fault-tolerant replicated log: each participant's vote is stored via a Paxos instance, so the decision survives any minority of failures with no timing assumptions. It needs a majority quorum to make progress but never produces inconsistent decisions.