Distributed Systems

Quorum Consensus

Pick any read set R and write set W. If R + W > N, they must overlap.

Quorum consensus guarantees consistency by requiring any read quorum and any write quorum to share at least one replica. Choose R and W on a per-call basis to slide between strict consistency (R + W > N) and high availability (R = W = 1). Dynamo, Cassandra, and Riak made it a household idiom.

Core invariantR + W > N ⇒ overlap
Strict consistencyR = W = ⌈(N+1)/2⌉
Max availability (AP)R = W = 1
Read latencymax of slowest R replica RTTs
Originated byGifford 1979 (weighted voting)
Used inDynamo, Cassandra, Riak, Voldemort

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The overlap invariant

Place N replicas in a row. Pick any W of them for a write; pick any R of them for a read. If R + W > N, the two subsets are too big to be disjoint — by the pigeonhole principle, they must share at least one replica. That shared replica saw the write (it was in the write quorum) and is now visible to the reader (it's in the read quorum). The reader is guaranteed to find at least one copy of the freshest acknowledged value.

Concretely, for N = 5: a write to W = 3 replicas and a read from R = 3 replicas. R + W = 6 > 5, so the read set must intersect the write set in at least one node. The reader collects three responses, finds the response with the highest timestamp (or version vector), and returns it.

If you set R + W ≤ N, overlap is not guaranteed: a write can land on replicas {1, 2} while a read hits {3, 4} — disjoint sets, no shared replica, stale read. That's the AP knob: maximum availability, eventual consistency only.

The three knobs: N, R, W

N — replication factor. The total number of replicas that hold each key. Set per keyspace or table (Cassandra's replication_factor). Common values: 3 in single-datacenter, 5 across two datacenters, 9 across three regions.
W — write quorum. Number of replicas that must acknowledge a write before the client gets success. W = 1: fastest writes, weakest guarantee. W = N: strict, slowest.
R — read quorum. Number of replicas a read must collect responses from before returning. Same trade-off: R = 1 fast, R = N slow.

The genius of Dynamo's API was exposing all three to the application — and letting each call pick its own. A "post the comment" call might run W = ALL (don't lose it); the subsequent "fetch comments" can run R = ONE (stale is fine for a refresh). One database, two consistency models, chosen at the call site.

Configuration variants

Configuration	Cassandra level	Property	Use case
R = W = 1	ONE / ANY	Eventual consistency (AP)	High-volume telemetry, counters
R = 1, W = N	ALL writes, ONE reads	Read-optimized — every read fresh	Read-heavy workloads where writes are rare
R = N, W = 1	ONE writes, ALL reads	Write-optimized — writes never block	Bursty ingestion with tolerable read latency
R + W > N (both quorums)	QUORUM	Read-your-writes, balanced	General-purpose strong-ish consistency
R = W = N	ALL	Strongest — every replica agrees	Critical config; rare in practice
R + W > N + DC_local	LOCAL_QUORUM	Strong consistency within one DC	Geo-replicated apps with regional reads

Cassandra's LOCAL_QUORUM is the most common production setting: a quorum within the local datacenter (not the whole cluster), so reads and writes don't cross regional latency. Cross-region consistency is sacrificed for speed; eventual consistency between regions is handled by replication.

When quorum consensus is the right pick

You want per-call consistency control. A single keyspace serving both critical user data and bulk telemetry — tune R and W at the call site rather than partitioning into separate databases.
You can tolerate sibling resolution. Two clients writing concurrently to the same key produce siblings (vector clock concurrent versions). Your application must reconcile (LWW, set union, CRDT) or accept whatever the database picks.
You need partition tolerance plus tunable consistency. Unlike Paxos/Raft (which block under partition), quorum stores stay up and accept writes to whatever majority is reachable.
You're storing high-volume, low-criticality data. Time-series metrics, session data, user-generated content where loss of single events is tolerable but throughput is paramount.

Quorum consensus is the wrong choice for linearizable transactions across multiple keys (use Spanner, CockroachDB, or a Raft-based store), for strong consistency with a single source of truth (use Postgres), or when sibling resolution would be impossible to define (financial ledger entries).

Quorum vs Paxos/Raft vs leader-based

	Quorum (Dynamo)	Paxos / Raft	Single-leader (PG-style)
Consistency model	Tunable (eventual to strong)	Linearizable	Linearizable (writes) + read-your-writes
Per-call tunability	Yes	No	No
Write availability under partition	If any W replicas reachable	Only on majority side	Only on primary side
Read latency (typical)	max of R RTTs	Local on follower (with bounded staleness)	Round-trip to primary
Write latency (typical)	max of W RTTs	2 round-trips (propose + accept)	fsync + replication ack
Conflict resolution	App-level (siblings)	Built-in (consensus)	None needed (single writer)
Production examples	Cassandra, DynamoDB, Riak	etcd, ZooKeeper, CockroachDB	PostgreSQL, MySQL with sync replicas

The mental model: Paxos/Raft is "vote to pick a single value, all replicas converge." Quorum is "let everyone write whatever, but make sure read and write sets overlap so the latest acknowledged value is visible." Different problems, different trade-offs — and worth understanding both before reaching for either.

Pseudo-code

// Quorum write — coordinator-side
write(key, value, W):
    timestamp = clock.now()
    replicas  = replicas_for(key)         // N nodes
    futures   = [send_async(r, key, value, timestamp) for r in replicas]
    successes = 0
    errors    = []
    for f in await_any(futures, count=W, timeout=write_timeout):
        if f.ok: successes++
        else:    errors.append(f.err)
        if successes >= W:
            return OK
    if successes < W:
        return TIMEOUT_OR_PARTIAL(errors)

// Quorum read — coordinator-side
read(key, R):
    replicas = replicas_for(key)
    futures  = [send_async(r, key) for r in replicas]
    responses = []
    for f in await_any(futures, count=R, timeout=read_timeout):
        if f.ok: responses.append(f.value_with_timestamp)
    // Pick freshest. For LWW: max by timestamp.
    // For vector clocks: union of concurrent versions → app resolves.
    winner = resolve_conflict(responses)
    // Optional: read-repair — push winner to lagging replicas
    for r in replicas:
        if r.version < winner.version: send_async(r, key, winner)
    return winner.value

Python implementation sketch

import asyncio
from dataclasses import dataclass
from typing import List

@dataclass
class Response:
    node: str
    value: str
    timestamp: int

class QuorumClient:
    def __init__(self, replicas: List[str], N: int):
        self.replicas = replicas        # all N replica nodes
        self.N = N

    async def write(self, key, value, W, timestamp):
        sends = [self._write_one(r, key, value, timestamp) for r in self.replicas]
        successes = 0
        for coro in asyncio.as_completed(sends, timeout=2.0):
            try:
                if await coro:
                    successes += 1
                    if successes >= W:
                        return True
            except Exception:
                pass
        raise QuorumTimeout(f"only {successes} of {W} writes succeeded")

    async def read(self, key, R):
        sends = [self._read_one(r, key) for r in self.replicas]
        responses: List[Response] = []
        for coro in asyncio.as_completed(sends, timeout=1.5):
            try:
                resp = await coro
                if resp is not None:
                    responses.append(resp)
                    if len(responses) >= R:
                        break
            except Exception:
                pass
        if len(responses) < R:
            raise QuorumTimeout(f"only {len(responses)} of {R} replicas responded")
        # Last-write-wins resolution
        winner = max(responses, key=lambda r: r.timestamp)
        # Read repair (fire and forget)
        for r in responses:
            if r.timestamp < winner.timestamp:
                asyncio.create_task(self._write_one(r.node, key, winner.value, winner.timestamp))
        return winner.value

Common pitfalls

Treating QUORUM as linearizable. Quorum reads guarantee you see some latest acknowledged write, not the very latest write that completed in real-time order. Two clients reading at the same wall time can see different values if a write is in flight. For linearizability you need Paxos/Raft or read-with-write-fence protocols.
Setting W = 1 thinking it's "fast and probably consistent." R + W = R + 1 ≤ N for any reasonable R, so overlap is not guaranteed. You'll read stale values until anti-entropy (Merkle tree comparison, hinted handoff) catches up.
Confusing strict and sloppy quorums. Cassandra's QUORUM is strict — counts only the original N replicas. ANY accepts hinted writes to nodes that aren't replicas at all. Both call themselves quorums; only one preserves the invariant in real-time.
Missing tail latency. Quorum read latency = max of the R slowest replica RTTs. If your p99 replica latency is 10x p50, your QUORUM read p99 is much worse than you'd guess from R replica averages. Always benchmark with realistic tail behavior.
Forgetting cross-DC LOCAL_QUORUM semantics. A LOCAL_QUORUM read in DC1 doesn't see writes that have only landed in DC2 — even ones acknowledged at DC2's LOCAL_QUORUM. Use EACH_QUORUM (write into a quorum in every DC) when you need cross-DC consistency.

Performance characteristics

Quorum read latency = max of R replica RTTs. For N=5, R=3, intra-DC: typically p50 ~2 ms, p99 ~15 ms. Cross-DC R=3: p50 ~50 ms (intercontinental). Always tail-latency-dominated.
Quorum write latency = max of W replica fsync times. For commodity SSDs at p99 fsync ~10 ms, W=3 sees p99 ~30 ms. Battery-backed write cache cuts this to ~3 ms.
Throughput scales with N × W^-1. Each write uses W replicas of disk bandwidth out of N total. W=3, N=5 means each write occupies 60% of cluster write capacity per key.
Sibling overhead. With R = W = 1 and concurrent writers, you'll see siblings on roughly 0.1–1% of reads under contention — each requires application-level resolution. With QUORUM, siblings vanish.
Read repair amplification. Inline read repair turns 10–30% of reads into writes (to lagging replicas), adding 5–15% to total write load. Cassandra's read_repair_chance = 0.1 bounds this.

Frequently asked questions

What does R + W > N actually guarantee?

By the pigeonhole principle, any subset of W replicas and any subset of R replicas from a total of N must share at least one node when R + W is strictly greater than N. That shared node has both the latest write (it was in the write quorum) and is visible to the current reader (it's in the read quorum). So the reader is guaranteed to see at least one replica carrying the latest acknowledged write — last-write-wins consistency without consensus. For N=5, configurations like (R=3, W=3), (R=4, W=2), or (R=2, W=4) all satisfy the rule.

What is sloppy quorum and how does it differ?

A strict quorum requires W writes to land on the N replicas originally responsible for the key. A sloppy quorum (Dynamo's term) accepts any W successful writes — even to non-replica nodes that happen to be alive — and stores them as hints to be forwarded later. Sloppy quorums sacrifice the R + W > N invariant to maintain write availability during partitions; correctness is restored later via hinted handoff. Cassandra's ANY consistency level is sloppy; QUORUM and ALL are strict.

Is R = W = N/2 + 1 the same as Paxos or Raft?

No. Paxos and Raft are consensus algorithms that pick a single value that all replicas eventually agree on, using majority quorums for both prepare and accept phases. Quorum consensus in Dynamo-style stores has no agreement protocol — concurrent writes coexist as siblings until a read merges them (LWW by timestamp, vector clock comparison, or application-supplied resolver). They share the term "quorum" but the guarantees differ: Paxos/Raft guarantee linearizability; quorum reads-writes guarantee read-your-writes within a session and eventual consistency overall.

What does R = W = 1 mean in practice?

R = W = 1 is the cheapest, fastest, weakest setting: a write acknowledges as soon as any single replica accepts it; a read returns whatever the first replica says. R + W = 2 ≤ N=3, so the overlap property fails — a reader can hit a replica that missed the latest write. You get AP behavior: full availability under partitions, weak consistency. Use it for high-write throughput workloads where staleness is tolerable (counters, view counts, telemetry) — Cassandra's ONE consistency level.

What does R = W = N mean?

R = W = N is the strictest knob: every read and every write must reach every replica. Reads see the absolute latest value because they touch every node; writes succeed only if every node accepts. Availability is at the mercy of the slowest replica: if any one is down, all writes (or reads) fail. Cassandra's ALL consistency level. Read latency = max(N replica RTTs); write latency = max(N replica fsyncs). Usually overkill — R + W > N with R or W set to a quorum buys consistency at much lower latency.

How does quorum latency scale with cluster size?

Quorum read latency = max of the slowest R replica RTTs (the call returns when R responses are in). For R = N/2 + 1, that's the (N/2 + 1)-th fastest replica — typically dominated by tail latency. With N=5 and replicas at p50=2ms / p99=20ms, a QUORUM (R=3) read returns around the p60 latency of one replica — maybe 4-6ms. A read with R=5 (ALL) would wait for the slowest, often 30+ms. The tail-latency penalty is why most systems default to QUORUM, not ALL.

Where does the quorum idea originate?

Quorum-based replication predates Dynamo by decades. Gifford's 1979 paper Weighted Voting for Replicated Data introduced the R + W > N rule for read-write consistency. Thomas's 1979 majority consensus algorithm came in parallel. Dynamo (DeCandia et al., Amazon, 2007) popularized the R, W, N notation as user-tunable knobs and brought it into the operational vocabulary of cloud-scale systems. Cassandra inherited the model directly; Riak adopted it with vector-clock-based conflict resolution.