Distributed Systems

CAP Theorem

The forced choice every distributed database has to make

The CAP theorem says a distributed data store cannot simultaneously offer linearizable consistency, total availability, and tolerance to network partitions. Once a partition occurs, the system must choose between answering with stale data or refusing to answer at all.

  • Stated byEric Brewer, 2000 (proven 2002)
  • Choices during a partitionCP or AP — not both
  • CP examplesetcd, ZooKeeper, Spanner, HBase
  • AP examplesDynamoDB, Cassandra, Riak, Voldemort
  • Modern refinementPACELC (2010)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

What CAP actually says

Eric Brewer stated the conjecture in a 2000 PODC keynote; Gilbert and Lynch proved it formally in 2002. The three properties:

  • Consistency (C) — every read returns the most recent write or an error. Formally, this is linearizability: there exists some total order of operations consistent with real time, and every read sees a value at least as recent as that order requires.
  • Availability (A) — every request to a non-failing node returns a response (not necessarily the latest one). No timeouts, no error replies, no "I don't know."
  • Partition tolerance (P) — the system continues to operate when arbitrary messages between nodes are dropped or delayed indefinitely. The cluster is split into two groups that can't communicate.

The theorem says no system that replicates data across nodes can guarantee all three at once when a partition is in progress. The proof is a one-paragraph thought experiment: partition the cluster into halves A and B. Client writes value v1 to A, then immediately reads from B. If the read returns the old value, you broke C. If the read blocks waiting for A, you broke A. If the system claims partition tolerance, you can't avoid the partition. So one of {C, A, P} must give.

The crucial subtlety: P is not a choice. In any wide-area distributed system, partitions happen — link flaps, switch reboots, dropped packets, region-wide outages. Brewer himself clarified in 2012 that the real CAP choice is how to behave during a partition: prioritize C (refuse some requests) or prioritize A (return possibly stale data).

How to apply CAP in design

  • Pick CP when correctness matters more than uptime: financial ledgers, inventory counters, configuration metadata, distributed locks. Refuse a write rather than commit a duplicate.
  • Pick AP when uptime matters more than freshness: shopping cart, social feed, user profile, sensor telemetry. A slightly stale response is better than no response.
  • Reach for stronger eventual-consistency guarantees (read-your-writes, monotonic reads, causal) before falling back to plain "eventual." They cost only metadata bytes per request.
  • Use CRDTs when you need automatic, deterministic merge of concurrent updates without human intervention.

The same product often needs both shapes. A bank stores accounts in a CP system but caches the home-page balance in an AP cache. The AP cache says "balance approximately $1,234, refreshed 8 seconds ago"; the transfer button consults the CP store. Don't pick one religion — pick per data type.

CAP triangle: where do real systems sit?

TypeConsistency offeredDuring partitionNotes
etcd / ZooKeeperCPLinearizable (Raft / Zab quorum)Minority side returns errorsConfiguration / lock service; small data, few writers
Google SpannerCPExternal consistency via TrueTimeMinority side stops accepting writesGlobal SQL with bounded clock skew; PACELC PC/EC
HBase / BigtableCPStrong per rowRegion offline if RegionServer unreachableOne server per region; partition = region outage
CockroachDB / TiKVCPSerializable per range (Raft)Minority range unavailableSharded Raft groups; failure scope is per range, not cluster-wide
Amazon DynamoDBAP (default)Eventual or strongly-consistent reads (opt-in)Replicas serve last-known valueStrong-read mode trades partition availability for read consistency
Apache CassandraAPTunable (ONE / QUORUM / ALL)Behaviour follows the tunable levelQUORUM reads + QUORUM writes approximate CP at the cost of unavailability when half is down
Riak / VoldemortAPEventual + version vectors / sibling resolutionAll replicas always writableConflict resolution pushed to client; sibling values returned on read
Single-node Postgres"CA" (degenerate)LinearizableNo partition possibleReplication adds CAP trade-offs to the picture

Two dimensions matter beyond the triangle. First, granularity: Cassandra's "AP" applies per-key, not per-cluster — you can run QUORUM operations that behave CP for individual reads. Second, what consistency level the AP system provides during partitions — eventual, monotonic, causal, or strong-eventual-via-CRDT — covers a wide spectrum of programmer experience.

What the trade-off actually costs

CP under partition: the minority side of a partition is unavailable. If your 5-node cluster splits 3-2, the 3 keep working at full throughput; the 2 reject every write and most reads. For a globally distributed service, that means whatever fraction of users hit the minority partition see errors until the network heals. With well-tuned timeouts, partitions visible to clients are typically 30–60 seconds — but cross-region partitions in 2025 cloud incidents have lasted hours.

AP under partition: every node accepts writes locally, but the same key may now have conflicting versions on different sides of the partition. When the partition heals, the system has to merge them. Last-writer-wins discards updates silently. Vector clocks or CRDTs preserve them but cost ~16–64 bytes of metadata per record (vector clocks scale with the number of writers). Riak famously returns "siblings" — multiple concurrent values — and forces the application to resolve them. That's a real engineering tax on the read path.

Even outside partitions, the consistency choice has steady-state cost (PACELC). A linearizable read in a 3-region Spanner deployment requires reading from a quorum of regions — 80–150 ms. A DynamoDB eventual-consistency read hits the local replica in 5–10 ms. The CP cost is paid on every operation, not just during outages. That's why latency-sensitive services like ad serving and personalization almost always run on AP infrastructure.

JavaScript: simulating partition and merge

// Two-replica AP store; simulate a partition, then merge with last-writer-wins
class APReplica {
  constructor(id) { this.id = id; this.data = new Map(); }

  // Each entry: { value, timestamp }
  write(key, value) {
    this.data.set(key, { value, timestamp: Date.now() });
  }
  read(key) { return this.data.get(key)?.value; }

  // Merge another replica's state — last writer wins by timestamp
  merge(other) {
    for (const [key, theirs] of other.data) {
      const mine = this.data.get(key);
      if (!mine || theirs.timestamp > mine.timestamp) {
        this.data.set(key, theirs);
      }
    }
  }
}

const a = new APReplica('A');
const b = new APReplica('B');

// === HEALTHY: both replicas in sync ===
a.write('counter', 1); b.merge(a);            // both see 1

// === PARTITION: A and B accept conflicting writes ===
a.write('counter', 5);                        // A says 5
b.write('counter', 7);                        // B says 7
console.log(a.read('counter'));               // 5 — A's view
console.log(b.read('counter'));               // 7 — B's view (diverged!)

// === HEAL: gossip the deltas ===
a.merge(b); b.merge(a);
console.log(a.read('counter'), b.read('counter'));   // both 7 (later timestamp wins)
// Note: the "5" write is silently lost. LWW is convergent but not lossless.

The simulation makes the LWW pathology concrete. Both a.write(5) and b.write(7) happened, but only the later timestamp survives. If you actually wanted to preserve both updates — say, two users adding items to a shopping cart — you need version vectors to detect concurrency, or a CRDT that merges values rather than overwriting them.

Python: probing a CP store during partition

import time, random

class CPCluster:
    """3 nodes; writes require a 2-of-3 quorum."""
    def __init__(self):
        self.nodes = {0: None, 1: None, 2: None}
        self.partition = set()                # node ids that are unreachable

    def write(self, value, timeout=0.5):
        reachable = [i for i in self.nodes if i not in self.partition]
        if len(reachable) < 2:                # no majority — refuse
            raise TimeoutError("CP store unavailable: no majority reachable")
        for i in reachable[:2]:
            self.nodes[i] = value
        return True

    def read(self):
        # Linearizable read: ask a majority and return the most recent value
        reachable = [i for i in self.nodes if i not in self.partition]
        if len(reachable) < 2:
            raise TimeoutError("CP store unavailable: no majority reachable")
        return self.nodes[reachable[0]]


cluster = CPCluster()
cluster.write("hello")                        # OK
print(cluster.read())                         # "hello"

# === PARTITION: nodes 1 and 2 unreachable from this client ===
cluster.partition = {1, 2}
try:
    cluster.write("world")                    # raises — only 1 node reachable
except TimeoutError as e:
    print(f"CP behaviour: {e}")               # refuses rather than commit

# An AP store would have written to node 0 alone and silently risked divergence.

The CP behaviour is what surprises engineers: the cluster could physically write to the one node it can reach, but doing so would risk creating a value that conflicts with whatever the majority side is doing. So the system refuses, exposing a hard error to the client instead of accepting an inconsistent write. That refusal is the "no A" half of the CAP trade-off, in code.

Refinements and extensions

  • PACELC (Daniel Abadi, 2010). Adds the steady-state question: when there is no partition (Else), do you trade Latency for Consistency? Spanner is PC/EC (strict everywhere). DynamoDB is PA/EL (loose everywhere). MongoDB with majority-write/local-read is PC/EL. PACELC captures the everyday cost most CAP discussions miss.
  • Tunable consistency. Cassandra and DynamoDB let each request specify its own quorum size. ONE/ONE is fastest but weakest; QUORUM/QUORUM gives sequential consistency with reduced availability; ALL/ALL is essentially CP with no fault tolerance. The trade-off becomes a parameter rather than a system property.
  • Strong eventual consistency. A property of CRDTs: replicas that have received the same set of updates are guaranteed to converge to the same state, regardless of order. Stronger than plain eventual consistency, weaker than linearizability. Lets you build AP systems where conflicts resolve mathematically without operator intervention.
  • Bounded staleness. Azure Cosmos DB's "bounded staleness" consistency: reads can lag the most recent write by at most K versions or T seconds. A middle ground that's enforceable and useful for many applications.
  • Harvest and yield (Fox & Brewer, 1999). A pre-CAP framing: harvest = fraction of data reflected in the result; yield = fraction of requests answered. Lets you reason about partial answers ("returned 80% of the search shards within the timeout") instead of strict A or C.
  • Probabilistically Bounded Staleness (Bailis et al., 2012). Quantifies how stale eventually-consistent reads actually are in practice — typically <10 ms even on AP systems with normal latency.

Common confusions and edge cases

  • Confusing consistency levels. "Strong consistency" in DynamoDB is per-item linearizability, not cross-item — a transaction across two keys is not atomic at strong consistency, only at strict transaction mode. Always check what the database vendor's "consistency" means for the specific guarantee you need.
  • Treating CAP as a static label. Many "AP" systems can be configured to behave CP per request (Cassandra QUORUM); many "CP" systems offer relaxed reads from local replicas (Spanner stale reads). The label on the box is the default, not the only mode.
  • Forgetting that partitions are usually asymmetric. Real partitions don't cleanly split the cluster; they often manifest as one slow link, one flapping NIC, or unidirectional packet loss. The theorem still applies, but choosing 'partition' as a binary state hides operational pain.
  • Assuming CA systems exist on networks. If you can name a CA distributed system, it's either secretly CP (refuses writes during partition) or has redefined "consistency" to something weaker than linearizability.
  • Using LWW under the assumption it's safe. Last-writer-wins silently drops concurrent updates. For a counter, a shopping cart, or a presence list, LWW corrupts data. Use CRDTs or vector-clock siblings.
  • Reading "eventually consistent" as "soon." Eventually means at some unspecified future time. Production AP systems typically converge in milliseconds, but during a partition convergence is delayed indefinitely. Design clients to tolerate read-write inconsistency during partitions, not just at steady state.

Frequently asked questions

Is CAP a strict trichotomy — can I really only pick two?

No. The pop-sci "pick two" framing is misleading. Network partitions in a wide-area system are not optional — they will happen. So you are really choosing what to do during a partition: stay consistent (CP, refuse some requests) or stay available (AP, serve possibly stale data). When the network is healthy, well-built systems offer both consistency and availability. CAP is a constraint that activates only during partitions.

What does the C in CAP actually mean?

Linearizability: every operation appears to happen instantaneously between its invocation and response, in a single total order consistent with real time. It is strictly stronger than serializability (transactions) and stronger than sequential consistency. The C in CAP is the strongest consistency model — most production "CP" systems offer slightly weaker variants like sequential or causal consistency to recover performance.

Are there really CA systems?

Only in non-distributed settings. A single-machine database is trivially CA because there are no partitions. Some replicated systems claim CA by treating any partition as a fatal cluster-wide outage — but that's just relabeling unavailability. Within a single rack with redundant networking, partitions are rare enough that CA is a defensible operational stance, but it is not a principled CAP choice.

How is PACELC different from CAP?

PACELC (Daniel Abadi, 2010) extends CAP with the question "what about when the network is fine?" During a partition (P), a system trades availability (A) for consistency (C), as in CAP. Else (E), in the absence of a partition, it trades latency (L) for consistency (C). DynamoDB is PA/EL — eventual everywhere. Spanner is PC/EC — strict consistency at the cost of higher tail latency. PACELC captures the everyday cost of consistency, not just the partition cost.

Does CAP apply to non-replicated databases?

No. CAP is about replication across nodes that can be partitioned. A single-node database has no replicas to disagree, so its trade-off is between availability and durability (does a crash mean data loss?), not consistency vs availability. Sharding without replication is also outside CAP — a partition takes a shard offline entirely, but the surviving shards never see conflicting writes.

Is eventual consistency the same as AP?

Eventual consistency is one common AP behavior: replicas may diverge during a partition but converge once it heals. AP systems can also offer stronger guarantees — read-your-writes, monotonic reads, causal consistency — by tracking metadata such as version vectors. Conversely, eventual consistency is not the strongest model an AP system can provide; CRDTs deliver strong eventual consistency where convergence is mathematically guaranteed regardless of message order.