Distributed Systems

Quorum Consensus

Pick any read set R and write set W. If R + W > N, they must overlap.

Quorum consensus guarantees consistency by requiring any read quorum and any write quorum to share at least one replica. Choose R and W on a per-call basis to slide between strict consistency (R + W > N) and high availability (R = W = 1). Dynamo, Cassandra, and Riak made it a household idiom.

  • Core invariantR + W > N ⇒ overlap
  • Strict consistencyR = W = ⌈(N+1)/2⌉
  • Max availability (AP)R = W = 1
  • Read latencymax of slowest R replica RTTs
  • Originated byGifford 1979 (weighted voting)
  • Used inDynamo, Cassandra, Riak, Voldemort

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The overlap invariant

Place N replicas in a row. Pick any W of them for a write; pick any R of them for a read. If R + W > N, the two subsets are too big to be disjoint — by the pigeonhole principle, they must share at least one replica. That shared replica saw the write (it was in the write quorum) and is now visible to the reader (it's in the read quorum). The reader is guaranteed to find at least one copy of the freshest acknowledged value.

Concretely, for N = 5: a write to W = 3 replicas and a read from R = 3 replicas. R + W = 6 > 5, so the read set must intersect the write set in at least one node. The reader collects three responses, finds the response with the highest timestamp (or version vector), and returns it.

If you set R + W ≤ N, overlap is not guaranteed: a write can land on replicas {1, 2} while a read hits {3, 4} — disjoint sets, no shared replica, stale read. That's the AP knob: maximum availability, eventual consistency only.

The three knobs: N, R, W

  • N — replication factor. The total number of replicas that hold each key. Set per keyspace or table (Cassandra's replication_factor). Common values: 3 in single-datacenter, 5 across two datacenters, 9 across three regions.
  • W — write quorum. Number of replicas that must acknowledge a write before the client gets success. W = 1: fastest writes, weakest guarantee. W = N: strict, slowest.
  • R — read quorum. Number of replicas a read must collect responses from before returning. Same trade-off: R = 1 fast, R = N slow.

The genius of Dynamo's API was exposing all three to the application — and letting each call pick its own. A "post the comment" call might run W = ALL (don't lose it); the subsequent "fetch comments" can run R = ONE (stale is fine for a refresh). One database, two consistency models, chosen at the call site.

Configuration variants

ConfigurationCassandra levelPropertyUse case
R = W = 1ONE / ANYEventual consistency (AP)High-volume telemetry, counters
R = 1, W = NALL writes, ONE readsRead-optimized — every read freshRead-heavy workloads where writes are rare
R = N, W = 1ONE writes, ALL readsWrite-optimized — writes never blockBursty ingestion with tolerable read latency
R + W > N (both quorums)QUORUMRead-your-writes, balancedGeneral-purpose strong-ish consistency
R = W = NALLStrongest — every replica agreesCritical config; rare in practice
R + W > N + DC_localLOCAL_QUORUMStrong consistency within one DCGeo-replicated apps with regional reads

Cassandra's LOCAL_QUORUM is the most common production setting: a quorum within the local datacenter (not the whole cluster), so reads and writes don't cross regional latency. Cross-region consistency is sacrificed for speed; eventual consistency between regions is handled by replication.

When quorum consensus is the right pick

  • You want per-call consistency control. A single keyspace serving both critical user data and bulk telemetry — tune R and W at the call site rather than partitioning into separate databases.
  • You can tolerate sibling resolution. Two clients writing concurrently to the same key produce siblings (vector clock concurrent versions). Your application must reconcile (LWW, set union, CRDT) or accept whatever the database picks.
  • You need partition tolerance plus tunable consistency. Unlike Paxos/Raft (which block under partition), quorum stores stay up and accept writes to whatever majority is reachable.
  • You're storing high-volume, low-criticality data. Time-series metrics, session data, user-generated content where loss of single events is tolerable but throughput is paramount.

Quorum consensus is the wrong choice for linearizable transactions across multiple keys (use Spanner, CockroachDB, or a Raft-based store), for strong consistency with a single source of truth (use Postgres), or when sibling resolution would be impossible to define (financial ledger entries).

Quorum vs Paxos/Raft vs leader-based

Quorum (Dynamo)Paxos / RaftSingle-leader (PG-style)
Consistency modelTunable (eventual to strong)LinearizableLinearizable (writes) + read-your-writes
Per-call tunabilityYesNoNo
Write availability under partitionIf any W replicas reachableOnly on majority sideOnly on primary side
Read latency (typical)max of R RTTsLocal on follower (with bounded staleness)Round-trip to primary
Write latency (typical)max of W RTTs2 round-trips (propose + accept)fsync + replication ack
Conflict resolutionApp-level (siblings)Built-in (consensus)None needed (single writer)
Production examplesCassandra, DynamoDB, Riaketcd, ZooKeeper, CockroachDBPostgreSQL, MySQL with sync replicas

The mental model: Paxos/Raft is "vote to pick a single value, all replicas converge." Quorum is "let everyone write whatever, but make sure read and write sets overlap so the latest acknowledged value is visible." Different problems, different trade-offs — and worth understanding both before reaching for either.

Pseudo-code

// Quorum write — coordinator-side
write(key, value, W):
    timestamp = clock.now()
    replicas  = replicas_for(key)         // N nodes
    futures   = [send_async(r, key, value, timestamp) for r in replicas]
    successes = 0
    errors    = []
    for f in await_any(futures, count=W, timeout=write_timeout):
        if f.ok: successes++
        else:    errors.append(f.err)
        if successes >= W:
            return OK
    if successes < W:
        return TIMEOUT_OR_PARTIAL(errors)

// Quorum read — coordinator-side
read(key, R):
    replicas = replicas_for(key)
    futures  = [send_async(r, key) for r in replicas]
    responses = []
    for f in await_any(futures, count=R, timeout=read_timeout):
        if f.ok: responses.append(f.value_with_timestamp)
    // Pick freshest. For LWW: max by timestamp.
    // For vector clocks: union of concurrent versions → app resolves.
    winner = resolve_conflict(responses)
    // Optional: read-repair — push winner to lagging replicas
    for r in replicas:
        if r.version < winner.version: send_async(r, key, winner)
    return winner.value

Python implementation sketch

import asyncio
from dataclasses import dataclass
from typing import List

@dataclass
class Response:
    node: str
    value: str
    timestamp: int

class QuorumClient:
    def __init__(self, replicas: List[str], N: int):
        self.replicas = replicas        # all N replica nodes
        self.N = N

    async def write(self, key, value, W, timestamp):
        sends = [self._write_one(r, key, value, timestamp) for r in self.replicas]
        successes = 0
        for coro in asyncio.as_completed(sends, timeout=2.0):
            try:
                if await coro:
                    successes += 1
                    if successes >= W:
                        return True
            except Exception:
                pass
        raise QuorumTimeout(f"only {successes} of {W} writes succeeded")

    async def read(self, key, R):
        sends = [self._read_one(r, key) for r in self.replicas]
        responses: List[Response] = []
        for coro in asyncio.as_completed(sends, timeout=1.5):
            try:
                resp = await coro
                if resp is not None:
                    responses.append(resp)
                    if len(responses) >= R:
                        break
            except Exception:
                pass
        if len(responses) < R:
            raise QuorumTimeout(f"only {len(responses)} of {R} replicas responded")
        # Last-write-wins resolution
        winner = max(responses, key=lambda r: r.timestamp)
        # Read repair (fire and forget)
        for r in responses:
            if r.timestamp < winner.timestamp:
                asyncio.create_task(self._write_one(r.node, key, winner.value, winner.timestamp))
        return winner.value

Common pitfalls

  • Treating QUORUM as linearizable. Quorum reads guarantee you see some latest acknowledged write, not the very latest write that completed in real-time order. Two clients reading at the same wall time can see different values if a write is in flight. For linearizability you need Paxos/Raft or read-with-write-fence protocols.
  • Setting W = 1 thinking it's "fast and probably consistent." R + W = R + 1 ≤ N for any reasonable R, so overlap is not guaranteed. You'll read stale values until anti-entropy (Merkle tree comparison, hinted handoff) catches up.
  • Confusing strict and sloppy quorums. Cassandra's QUORUM is strict — counts only the original N replicas. ANY accepts hinted writes to nodes that aren't replicas at all. Both call themselves quorums; only one preserves the invariant in real-time.
  • Missing tail latency. Quorum read latency = max of the R slowest replica RTTs. If your p99 replica latency is 10x p50, your QUORUM read p99 is much worse than you'd guess from R replica averages. Always benchmark with realistic tail behavior.
  • Forgetting cross-DC LOCAL_QUORUM semantics. A LOCAL_QUORUM read in DC1 doesn't see writes that have only landed in DC2 — even ones acknowledged at DC2's LOCAL_QUORUM. Use EACH_QUORUM (write into a quorum in every DC) when you need cross-DC consistency.

Performance characteristics

  • Quorum read latency = max of R replica RTTs. For N=5, R=3, intra-DC: typically p50 ~2 ms, p99 ~15 ms. Cross-DC R=3: p50 ~50 ms (intercontinental). Always tail-latency-dominated.
  • Quorum write latency = max of W replica fsync times. For commodity SSDs at p99 fsync ~10 ms, W=3 sees p99 ~30 ms. Battery-backed write cache cuts this to ~3 ms.
  • Throughput scales with N × W^-1. Each write uses W replicas of disk bandwidth out of N total. W=3, N=5 means each write occupies 60% of cluster write capacity per key.
  • Sibling overhead. With R = W = 1 and concurrent writers, you'll see siblings on roughly 0.1–1% of reads under contention — each requires application-level resolution. With QUORUM, siblings vanish.
  • Read repair amplification. Inline read repair turns 10–30% of reads into writes (to lagging replicas), adding 5–15% to total write load. Cassandra's read_repair_chance = 0.1 bounds this.

Frequently asked questions

What does R + W > N actually guarantee?

By the pigeonhole principle, any subset of W replicas and any subset of R replicas from a total of N must share at least one node when R + W is strictly greater than N. That shared node has both the latest write (it was in the write quorum) and is visible to the current reader (it's in the read quorum). So the reader is guaranteed to see at least one replica carrying the latest acknowledged write — last-write-wins consistency without consensus. For N=5, configurations like (R=3, W=3), (R=4, W=2), or (R=2, W=4) all satisfy the rule.

What is sloppy quorum and how does it differ?

A strict quorum requires W writes to land on the N replicas originally responsible for the key. A sloppy quorum (Dynamo's term) accepts any W successful writes — even to non-replica nodes that happen to be alive — and stores them as hints to be forwarded later. Sloppy quorums sacrifice the R + W > N invariant to maintain write availability during partitions; correctness is restored later via hinted handoff. Cassandra's ANY consistency level is sloppy; QUORUM and ALL are strict.

Is R = W = N/2 + 1 the same as Paxos or Raft?

No. Paxos and Raft are consensus algorithms that pick a single value that all replicas eventually agree on, using majority quorums for both prepare and accept phases. Quorum consensus in Dynamo-style stores has no agreement protocol — concurrent writes coexist as siblings until a read merges them (LWW by timestamp, vector clock comparison, or application-supplied resolver). They share the term "quorum" but the guarantees differ: Paxos/Raft guarantee linearizability; quorum reads-writes guarantee read-your-writes within a session and eventual consistency overall.

What does R = W = 1 mean in practice?

R = W = 1 is the cheapest, fastest, weakest setting: a write acknowledges as soon as any single replica accepts it; a read returns whatever the first replica says. R + W = 2 ≤ N=3, so the overlap property fails — a reader can hit a replica that missed the latest write. You get AP behavior: full availability under partitions, weak consistency. Use it for high-write throughput workloads where staleness is tolerable (counters, view counts, telemetry) — Cassandra's ONE consistency level.

What does R = W = N mean?

R = W = N is the strictest knob: every read and every write must reach every replica. Reads see the absolute latest value because they touch every node; writes succeed only if every node accepts. Availability is at the mercy of the slowest replica: if any one is down, all writes (or reads) fail. Cassandra's ALL consistency level. Read latency = max(N replica RTTs); write latency = max(N replica fsyncs). Usually overkill — R + W > N with R or W set to a quorum buys consistency at much lower latency.

How does quorum latency scale with cluster size?

Quorum read latency = max of the slowest R replica RTTs (the call returns when R responses are in). For R = N/2 + 1, that's the (N/2 + 1)-th fastest replica — typically dominated by tail latency. With N=5 and replicas at p50=2ms / p99=20ms, a QUORUM (R=3) read returns around the p60 latency of one replica — maybe 4-6ms. A read with R=5 (ALL) would wait for the slowest, often 30+ms. The tail-latency penalty is why most systems default to QUORUM, not ALL.

Where does the quorum idea originate?

Quorum-based replication predates Dynamo by decades. Gifford's 1979 paper Weighted Voting for Replicated Data introduced the R + W > N rule for read-write consistency. Thomas's 1979 majority consensus algorithm came in parallel. Dynamo (DeCandia et al., Amazon, 2007) popularized the R, W, N notation as user-tunable knobs and brought it into the operational vocabulary of cloud-scale systems. Cassandra inherited the model directly; Riak adopted it with vector-clock-based conflict resolution.