Distributed Systems
Two-Phase Commit (2PC)
A coordinator polls participants in a vote phase, then commits or aborts in unison
Two-phase commit (2PC) is a distributed consensus protocol that achieves atomic commit across multiple participants — either all of them commit or all abort. Phase 1 (prepare): the coordinator sends prepare to all participants; each replies yes (durably logged) or no. Phase 2 (commit/abort): if all said yes, coordinator sends commit; otherwise abort. Each participant's response is durably logged and replayed on crash. Designed by Jim Gray in the 1970s. Used in XA distributed transactions, distributed databases (Spanner uses Paxos-replicated 2PC), and message brokers. The fatal flaw: if the coordinator crashes between phases, participants block forever waiting — solved by 3PC, Paxos Commit, or replicated coordinators.
- Phasesprepare, commit/abort
- Network msgs~3n one-way (2 rounds × n)
- Blockingyes if coordinator crashes
- XA standard1991
- Spanner usesPaxos-replicated coord + 2PC
- 3PC variantnon-blocking but more chatty
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why 2PC matters
- Distributed database transactions. A transaction that updates two shards (or two databases via XA) needs atomicity — either both updates land or neither. 2PC is the canonical mechanism. CockroachDB, YugabyteDB, TiDB, and Spanner all use a 2PC variant under the hood.
- Banking transfers. Move $100 from account A in bank A's database to account B in bank B's database. Without 2PC, a crash between the debit and the credit creates or destroys money. With 2PC, the prepared state holds locks until the coordinator decides.
- Message broker exactly-once. Kafka transactions, RabbitMQ XA, and JMS XA use 2PC to guarantee that a message produced and a database row updated commit together — no duplicate processing on crash.
- Atomic config rollouts. A configuration change that must apply to a fleet of servers in lockstep can use 2PC: prepare on all, commit on all. Useful when partial rollout would leave the system inconsistent.
- Two-database joins. Legacy enterprise apps with one database for orders and another for inventory rely on XA + 2PC to maintain referential integrity across both.
- Filesystem snapshots. Some clustered filesystems (GPFS, Lustre) use 2PC-like protocols to take a coherent snapshot across all storage nodes simultaneously.
The protocol step by step
Coordinator C, participants P1..Pn. T = transaction ID.
- Begin. Each Pi has done its tentative work (writes are in its undo/redo log but not yet committed). C decides T should commit; sends
prepare(T)to every Pi. - Vote. Each Pi: acquires locks; force-writes prepare record + undo/redo to its log; replies
vote-yes(T). If anything fails, force-writes abort record and repliesvote-no(T). - Decide. C waits for votes (with a timeout). If all yes, C force-writes
commit(T)to its log and sendscommit(T)to every Pi. Otherwise C writesabort(T)and sendsabort(T). - Apply. Each Pi force-writes the commit/abort record, applies (or rolls back), releases locks, sends ack to C.
- Forget. When C has all acks, it can forget T (garbage-collect its log entries).
Counts: 2 message rounds, ~3n one-way messages on the happy path, plus one forced log write per participant per phase (so 4 fsyncs per participant on the happy path: prepare-write, vote, commit-write, ack). For 5 participants on NVMe (50 µs fsync) the protocol's lower bound is ~250 µs plus network round trips. On geo-distributed deployments the network dominates: 100-150 ms per cross-region round trip × 2 rounds = 200-300 ms minimum.
Failure modes and recovery
- Participant crashes before vote. No state on disk. On recovery, it has no record of T; C will time out, abort the transaction, and re-send abort. Safe.
- Participant crashes after vote-yes, before commit. Prepared record is on disk, locks are conceptually held. On recovery, participant queries C for the outcome ("presumed-abort" optimization: if no record, assume abort; "presumed-commit" assumes commit). Locks held until decision arrives.
- Coordinator crashes before sending decision. The blocking case. Surviving participants in prepared state cannot proceed. They can run cooperative termination (ask peers if any has heard the decision) — works iff the answer happens to be on a live peer. Else they wait for C to recover.
- Coordinator crashes after sending some commits. On recovery, C reads its commit record from log and re-sends commit to any participant that has not yet acked.
- Network partition between participants and coordinator. Same as coordinator crash from the participants' viewpoint.
2PC vs Paxos / Raft
People conflate them. 2PC is an atomic-commit protocol — it guarantees agreement on a single decision (commit or abort) for one transaction, but the coordinator is a single point of failure. Paxos and Raft are consensus protocols — they guarantee agreement on a sequence of values across a quorum, tolerating up to f failures with 2f+1 nodes. They solve different problems: you can use Paxos to replicate the coordinator's state in a 2PC, getting fault-tolerant 2PC. That is exactly what Spanner, CockroachDB, and Calvin do. The 2PC structure stays; the coordinator's single-point-of-failure is removed.
Common misconceptions
- "2PC = ACID across services." It only gives atomicity (and via locks, isolation against other 2PC transactions). Durability is each participant's responsibility. Consistency is application-level.
- "Blocking is rare in practice." In a single datacenter, coordinator failure is rare. Across regions or with public-cloud network blips, it happens often enough that operators prefer Paxos-replicated coordinators or saga patterns.
- "More participants is fine." Probability of any participant failing scales with n; latency is bounded by the slowest participant; lock-hold time is at minimum 2 round trips. 2PC degrades quickly past 5–10 participants.
- "Saga is just 2PC done worse." They have different guarantees. Saga gives no isolation (intermediate states are visible) but full liveness; 2PC gives isolation but can block. The right choice depends on whether your business invariants tolerate intermediate states.
- "3PC is widely used." 3PC is a textbook protocol; production systems mostly skip it for Paxos-Commit because 3PC fails under partitions, while Paxos-Commit handles them.
- "Two-phase locking and 2PC are the same." 2PL is a per-participant concurrency control mechanism (acquire locks during growing phase, release in shrinking phase). 2PC is a multi-participant agreement protocol. They are often used together but distinct.
XA in 30 lines, conceptually
- Application calls
tm.begin()to start a global transaction. TM allocates an XID. - Application calls into RM1 (database) and RM2 (queue); each RM enlists in the global transaction (the JDBC/JMS driver does
xa_start(XID)). - Application does its writes. Each RM stages them locally with the global XID attached.
- Application calls
tm.commit(). TM iterates:rm.xa_prepare(XID)for each RM. Each RM forces its prepare log entry, replies XA_OK or XA_RBROLLBACK. - If all OK, TM forces its commit record, then iterates:
rm.xa_commit(XID). If any prepare failed,xa_rollback(XID)on all RMs. - If TM crashes mid-decision: on restart, it re-reads its log; for any in-doubt XID, it re-issues commit or rollback to RMs (the RMs replay their prepared state via
xa_recover()).
Frequently asked questions
Why is 2PC blocking?
Once a participant has voted yes in phase 1 and durably logged its prepared state, it cannot unilaterally abort or commit — only the coordinator's phase-2 decision can resolve it. If the coordinator crashes after collecting some yes votes but before sending commit/abort, every prepared participant must hold its locks and wait, possibly forever, until the coordinator recovers. Surviving participants can ask each other (cooperative termination) but if any prepared peer is also unreachable, they cannot decide unilaterally without risking inconsistency. This is the canonical failure mode of 2PC and the reason 'blocking' is the protocol's defining flaw.
What is the prepare phase actually doing?
Prepare is a promise to be able to commit. When a participant receives prepare, it: (1) acquires all locks needed for the transaction's writes; (2) writes the redo and undo log entries to durable storage with fsync; (3) writes a 'prepared' record naming the transaction and the coordinator; only then does it reply yes. After this point, the participant must be able to either commit or abort on demand even after a crash — the prepared record will be replayed during recovery. The cost is real: each prepare incurs a forced log write (1-2 ms on spinning disk, 50-200 µs on NVMe), times every participant. Saying yes is a contract.
How does 3PC fix the blocking problem?
Three-phase commit (Skeen, 1982) inserts a pre-commit phase between prepare and commit. After all participants vote yes, the coordinator sends pre-commit. Each participant acknowledges, then waits for the final commit. The added phase ensures that if the coordinator fails after enough participants have entered pre-commit, the survivors can elect a new coordinator and decide based on whether any peer has reached pre-commit (in which case all survivors commit) or none have (all abort). 3PC is non-blocking under fail-stop assumptions but vulnerable to network partitions. It also doubles the message rounds (6n vs 3n for 2PC). In practice, real systems prefer Paxos Commit (Gray and Lamport 2006) which uses Paxos to make the coordinator role itself fault-tolerant.
What is XA in JTA / java.transaction?
XA is the X/Open standard from 1991 that defines the interface between a transaction manager (TM) and a resource manager (RM, e.g. a database or message queue). It encodes the 2PC protocol as a set of C functions: xa_prepare, xa_commit, xa_rollback, xa_recover. The TM acts as the coordinator; each XA-compliant RM is a participant. In Java this surfaces as java.transaction.UserTransaction (JTA), with implementations in JBoss/Wildfly, Atomikos, Bitronix, and Oracle WebLogic. XA is the reason your enterprise app can put a JMS send and a JDBC update in the same transaction. It is also why those apps can hang in 'in-doubt transaction' state when a coordinator dies — recovering them by hand from RM logs is a real operational chore.
How does Spanner combine 2PC and Paxos?
Google Spanner gets the best of both: 2PC for atomicity across shards, Paxos for fault tolerance of each shard's state. Each shard is a Paxos group (typically 5 replicas across data centers) that internally agrees on its own log; one replica is the leader. For a multi-shard transaction, Spanner picks one shard's Paxos group as the 2PC coordinator and the others as participants. The coordinator's prepared state is replicated through Paxos, so if the leader crashes, a new leader can resume the 2PC decision — the protocol no longer blocks on a single coordinator. TrueTime gives global timestamp ordering on top. The cost is real: a multi-shard write is a 2PC over Paxos, taking tens of milliseconds round-trip.
Why is 2PC discouraged in microservices (saga pattern)?
2PC requires every participant to expose a prepare/commit/abort protocol, durable logging, and to hold locks across phase boundaries. That is fine for two databases under a transaction manager but doesn't compose for HTTP-based microservices: services are owned by different teams, may have private storage technologies (some without prepared-state semantics), can be slow or temporarily down, and locking across them holds resources for human-perceptible durations. The saga pattern replaces atomicity with eventual consistency: each step is a local transaction, with a compensating action defined for each forward step (debit-account → refund-account). On failure, you run the compensations backward. You lose isolation but gain liveness and operational simplicity.