Distributed Systems

Paxos Consensus

Two phases. One value. The algorithm that taught us consensus is hard.

Paxos is a family of consensus protocols that lets a set of unreliable processes agree on a single value despite crashes and partitions. Single-decree Paxos uses a two-phase prepare/accept handshake, and Multi-Paxos amortizes that cost over a log.

  • Single-decree latency2 RTTs
  • Multi-Paxos steady state1 RTT per decree
  • Fault tolerancef crashes with 2f+1 nodes
  • Authored byLeslie Lamport, 1989/1998
  • Used inChubby, Spanner, Megastore

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Paxos works

Paxos solves a precisely stated problem: a group of processes must agree on a single value, and once any value is chosen, no other value can be chosen, even if the network drops messages or processes crash and restart with stale state. The original paper splits the participants into three roles: proposers propose values, acceptors vote, and learners learn the outcome. In practice every node usually plays all three.

The protocol runs in two phases, each a request/response round trip:

  1. Phase 1 — Prepare/Promise. A proposer picks a globally unique, monotonically increasing proposal number n and sends prepare(n) to a majority of acceptors. An acceptor that has not seen a higher number replies with promise(n), attaching the highest-numbered proposal it has previously accepted (if any).
  2. Phase 2 — Accept/Accepted. If a majority promises, the proposer picks a value: the value attached to the highest-numbered promise it received, or — if no acceptor had previously accepted anything — its own proposed value. It then sends accept(n, value). An acceptor accepts unless it has already promised a higher number. Once a majority accepts, the value is chosen.

The genius of Paxos lies in two invariants. Invariant 1: any quorum (majority) intersects every other quorum in at least one acceptor — that overlapping acceptor is the witness that prevents conflicting values. Invariant 2: if a value v has been chosen, every higher-numbered proposal will also propose v, because the prepare phase forces the new proposer to inherit any previously accepted value. Together, these guarantee that once consensus is reached, no later round can disturb it.

When to use Paxos

  • You're maintaining or extending an existing Multi-Paxos deployment (Chubby, Spanner, Megastore-style metadata stores). Don't switch a working production cluster to Raft just to feel modern.
  • You need leaderless commits across geographically distant replicas. Egalitarian Paxos lets any node drive a decree, avoiding the cross-region hop a Raft leader would impose.
  • You want asymmetric quorum sizes — a fast common-case quorum and a larger reconfiguration quorum. Flexible Paxos formalizes this trade-off.
  • You're building a research system or teaching the foundations of consensus, where Paxos's minimal core (two phases, three roles) is pedagogically valuable.

For a brand-new replicated state machine, Raft is almost always the better engineering choice — better-specified, easier to debug, and well-served by mature open-source libraries.

Paxos vs Multi-Paxos vs Raft vs Zab vs PBFT vs EPaxos

Single-decree PaxosMulti-PaxosRaftZabPBFTEPaxos
Steady-state RTTs per commit21112 (3 phases)1 fast / 2 slow
Distinguished proposer / leaderOptionalYesYes (strong)Yes (strong)RotatingNone
Failure modelCrashCrashCrashCrashByzantineCrash
Min nodes for f failures2f+12f+12f+12f+13f+12f+1
Membership change specImplicitReconfiguration via decreeJoint consensusFastLeaderElectionView changeConfiguration change
Original publication1989 (preprint), 199820012014201019992013
Production examples(rarely shipped raw)Chubby, Spanner, Megastoreetcd, Consul, CockroachDBZooKeeperHyperledgerResearch, MongoDB-style optims

Multi-Paxos and Raft converge on the same steady-state shape: stable leader, one RTT per commit, majority quorum. Where they differ is documentation. Raft's spec is a 16-page paper; Multi-Paxos is a constellation of papers and folklore.

What Paxos actually costs

Single-decree Paxos pays two RTTs for every value: prepare/promise, then accept/accepted. With LAN latencies of 0.5 ms that's 1 ms per decree; cross-region (50–100 ms) it's 100–200 ms. For any system that needs to commit thousands of decrees per second, raw single-decree Paxos is unusable.

Multi-Paxos collapses this. Once a leader is established, phase 1 is run once for the entire log; subsequent decrees only execute phase 2. Steady state is one RTT per commit, matching Raft. The phase-1 cost is paid only on leader change. This optimization is what makes Paxos viable in production — and also what makes the pedagogy confusing, because most descriptions mix single-decree safety arguments with Multi-Paxos performance numbers.

Quorum size scales with cluster size. A 5-node cluster (f=2) needs majorities of 3 — every commit waits for 3 acks; the 2 slowest nodes don't matter. A 7-node cluster (f=3) waits for 4. Doubling the cluster doesn't double throughput; it raises commit latency by waiting for one more replica. The real cost lives in the tail latency of the slowest majority follower.

Failure-recovery cost is asymmetric. A new leader must run phase 1 across the entire uncommitted log range to learn what was already chosen — O(log_gap) RPCs the first time. Production implementations cache these results aggressively to avoid redoing work after every transient leader change.

JavaScript: classic single-decree Paxos

// A minimal acceptor with stable-storage semantics
class Acceptor {
  constructor() {
    this.promised = -Infinity;        // highest n we promised not to accept below
    this.accepted = null;             // { n, value } or null
  }

  // Phase 1: respond to prepare(n)
  prepare(n) {
    if (n <= this.promised) return { ok: false };
    this.promised = n;                // persist BEFORE replying in real code
    return { ok: true, accepted: this.accepted };
  }

  // Phase 2: respond to accept(n, value)
  accept(n, value) {
    if (n < this.promised) return { ok: false };
    this.promised = n;
    this.accepted = { n, value };     // persist BEFORE replying in real code
    return { ok: true };
  }
}

async function propose(proposer, acceptors, myValue) {
  const n = proposer.nextProposalNumber();
  // PHASE 1
  const promises = await Promise.all(acceptors.map(a => a.prepare(n)));
  const granted = promises.filter(p => p.ok);
  if (granted.length <= acceptors.length / 2) return { chosen: false };

  // Inherit the highest-numbered previously accepted value, if any
  const prior = granted
    .map(p => p.accepted)
    .filter(Boolean)
    .sort((a, b) => b.n - a.n)[0];
  const value = prior ? prior.value : myValue;

  // PHASE 2
  const accepts = await Promise.all(acceptors.map(a => a.accept(n, value)));
  const accepted = accepts.filter(r => r.ok);
  if (accepted.length <= acceptors.length / 2) return { chosen: false };

  return { chosen: true, value };     // value is now CHOSEN
}

Note the inherited value in phase 2. If any acceptor reports "I previously accepted (n', v')", the proposer is no longer free to push myValue — it must propose v' from the highest-numbered previous accept. This is the rule that makes once-chosen values immutable.

Python: proposal-number generation and dueling-proposer livelock

def next_proposal_number(node_id: int, last: int, peer_max: int) -> int:
    """
    Proposal numbers must be globally unique and monotonically increasing.
    Common scheme: high bits are a logical counter, low bits are the node id.
    Bumping past the highest seen number prevents stale rounds from winning.
    """
    counter = max(last >> 16, peer_max >> 16) + 1
    return (counter << 16) | (node_id & 0xFFFF)


def propose_with_backoff(proposer, acceptors, value, max_attempts=10):
    """
    Naive retry causes livelock when two proposers preempt each other
    forever. Exponential backoff with jitter breaks the symmetry.
    """
    import random, time
    delay = 0.001                          # 1 ms initial
    for attempt in range(max_attempts):
        result = proposer.run_round(acceptors, value)
        if result.chosen:
            return result
        time.sleep(delay + random.random() * delay)
        delay = min(delay * 2, 1.0)        # cap at 1 s
    raise RuntimeError("Could not reach consensus — livelock?")

The retry loop is the practical defence against the dueling-proposer livelock. Without it, two proposers issuing increasing prepare numbers will preempt each other indefinitely — Paxos's safety holds, but no value ever gets chosen. FLP impossibility (Fischer–Lynch–Paterson, 1985) proves you can't fix this with timing alone in an asynchronous system; you need a leader election or randomization to break ties. Multi-Paxos picks the former.

Variants and extensions

  • Multi-Paxos. Run phase 1 once per leader epoch; reuse the promise across all subsequent decrees. Drops steady-state cost from 2 RTTs to 1. Every production "Paxos" deployment is really Multi-Paxos.
  • Fast Paxos (Lamport, 2005). Lets a client send its proposal directly to acceptors, saving one RTT. Trade-off: requires a larger fast-quorum (⌈3N/4⌉ instead of N/2+1) and falls back to classic Paxos on collision.
  • Egalitarian Paxos (EPaxos). Leaderless. Any node can drive a decree by piggy-backing dependency information on its prepare. Non-conflicting commits commit in 1 RTT regardless of geography; only interfering commits pay the slow path.
  • Flexible Paxos (Howard et al., 2016). The only intersection requirement is between phase-1 quorums and phase-2 quorums. You can shrink phase-2 quorums to 2 nodes if you grow phase-1 quorums to N-1, optimizing for steady-state commit latency at the cost of leader-change cost.
  • Cheap Paxos. Use auxiliary "witness" nodes that participate only when an active node fails — reduces hardware cost while preserving fault tolerance.
  • Disk Paxos / Stoppable Paxos. Variants for shared-storage setups and graceful reconfiguration respectively.

Common bugs and edge cases

  • Dueling-proposer livelock. Two proposers indefinitely preempt each other. Fix: stable distinguished proposer (Multi-Paxos) plus exponential backoff with jitter.
  • Forgetting to persist promised before replying. If an acceptor crashes after sending a promise but before durably writing it, recovery can let the same acceptor promise a lower number, breaking the safety invariant. Stable storage must precede the network reply.
  • Reusing proposal numbers. Two proposers must never issue the same number. The standard fix encodes the node id in the low bits — but failing to update last_proposal_seen from the highest peer_max seen in promises causes silent rounds that always lose.
  • Confusing "accepted" with "chosen". An acceptor that accepts a value is not enough — a majority must accept the same (n, value) pair before the value is chosen. A learner reading from one acceptor sees accepted values that may still be overwritten.
  • Membership change without joint consensus. Naively swapping the acceptor set risks two non-overlapping quorums during the transition, allowing two different values to be chosen. Use a transitional configuration that requires acks from both old and new quorums.
  • Stale leader continuing to commit. A Multi-Paxos leader that has been silently superseded keeps appending entries to its local log. Without a leader-lease check, those entries get rejected on the next phase-1, wasting an RTT and looking like a flapping cluster.

Frequently asked questions

Why is Paxos so hard to implement?

Lamport's original paper specifies the safety properties of single-decree Paxos but leaves the path to a working replicated log as an exercise. Real systems need Multi-Paxos, leader stability, log compaction, membership change, and crash recovery. Each is its own subprotocol with its own invariants. Google's Chubby paper famously documented bug after bug in their first implementation, and concluded that "between the description of the Paxos algorithm and the needs of a real-world system there is a significant gap."

What is the difference between Paxos and Multi-Paxos?

Single-decree Paxos agrees on one value with two RTTs: phase-1 prepare/promise, then phase-2 accept/accepted. Multi-Paxos elects a stable distinguished proposer and skips phase 1 for subsequent decrees, dropping the steady-state cost to one RTT per decision. Without that optimization, every commit pays the prepare round-trip — a 2× latency penalty plus extra contention risk.

Can two proposers cause a livelock?

Yes — this is the classic "dueling proposers" failure. Proposer A's prepare(n) is promised, but before A sends accept, proposer B's prepare(n+1) preempts the acceptors, who refuse A's accept. A then prepares with n+2, preempts B, and the cycle repeats. The fix is a stable distinguished proposer (Multi-Paxos) and exponential backoff on the loser. FLP impossibility means you cannot eliminate livelock with timing alone.

Does a Paxos majority have to be a strict majority?

It has to be any quorum that overlaps with every other quorum used by the protocol. Classic Paxos uses simple majorities (⌈(N+1)/2⌉) because they are the smallest sets with that property. Flexible Paxos generalizes this: you only need phase-1 quorums and phase-2 quorums to overlap, not each phase's quorums to overlap with themselves. That allows asymmetric trade-offs — a small phase-2 quorum for cheap commits at the cost of larger phase-1 quorums during leader change.

Why doesn't Paxos prevent decided values from being lost?

It doesn't lose them — that's the safety guarantee. Once any value is chosen (accepted by a majority), every future round is forced to propose that same value. The "promise" phase rule is precisely this: when an acceptor promises proposal n, it returns the highest-numbered value it previously accepted, and the new proposer must use that value if any was returned. Decided values cannot be overwritten.

Should I use Paxos in 2026?

Probably not from scratch. The big production deployments — Spanner, Megastore, Chubby — use Multi-Paxos because they predate Raft. New systems almost always pick Raft for its better-specified leader election and membership change. Paxos remains valuable when you need leaderless commits (EPaxos) or asymmetric quorum sizes (Flexible Paxos). For a vanilla replicated log, Raft is the safer engineering choice.