Distributed Systems

Chain Replication

A straight line that turns N replicas into one strongly-consistent store

Chain replication arranges replicas in a linear chain: writes enter at the head and propagate node-by-node to the tail, while all reads are served by the tail, giving strong consistency with throughput that scales across the chain.

ConsistencyLinearizable
Write pathHead → … → Tail
Read pathTail only
Write latencyO(N) hops
Fault toleranceN − 1 failures

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How chain replication works

Most replication schemes broadcast: a primary fans every write out to all its backups at once and waits for them to answer. Chain replication does the opposite. It lines the replicas up in a fixed order — call them R₁, R₂, … , Rₙ — and pushes each write down the line one hop at a time. The first node is the head, the last is the tail, and the protocol splits responsibilities cleanly between them:

Writes (any update operation) go only to the head. The head applies the update locally, then forwards it to its successor, which applies it and forwards again, all the way to the tail.
Reads (any query) go only to the tail. The tail answers from its local state.
Acknowledgement. When the tail applies a write, it both replies to the client and sends an ack back up the chain. Only when a node receives that ack does it consider the update committed and free to discard from its pending log.

The protocol was introduced by Robbert van Renesse and Fred Schneider in their 2004 OSDI paper "Chain Replication for Supporting High Throughput and Availability." Their key observation: because the tail is both the last node to apply a write and the only node that answers reads, a value is never visible to a reader until it lives on every replica. That single ordering fact is what buys strong consistency — no quorums, no version vectors, no read-repair.

Formally, each node keeps two pieces of state per object: the latest applied value, and a queue of pending updates it has forwarded but not yet seen acknowledged. The tail's queue is always empty (it is the source of acks), so the tail's value is by definition the committed value. Reads from the tail are therefore guaranteed to reflect all writes that have completed.

The update-propagation invariant

The correctness of chain replication rests on one invariant, which van Renesse and Schneider call the Update Propagation Invariant. Number the updates a node has processed; for any node, the set of updates it has seen is a prefix of the set its predecessor has seen. Written out for nodes i and i+1:

Hist(R_{i+1})  is a prefix of  Hist(R_i)

In plain terms: a downstream node can never be ahead of an upstream one. The head has seen the most updates, the tail the fewest, and every node in between sits at a consistent point in that linear history. Because updates flow strictly in order and only forward, no node ever needs to reconcile conflicting versions — the chain order is the total order.

Two derived facts follow immediately:

The tail's history is a prefix of every other node's history, so the tail holds exactly the set of committed updates. Reading the tail is reading the committed state.
If node R_i fails, its successor R_{i+1} already holds a prefix of R_i's history. The predecessor R_{i-1} holds a superset. So recovery is just "send R_{i+1} the updates it is missing, then splice it onto R_{i-1}" — no data is lost.

Failures and reconfiguration

Chain replication assumes fail-stop nodes (they crash cleanly and detectably; they don't lie) and relies on a separate, fault-tolerant master — typically a small Paxos or Raft group — that monitors liveness and tells nodes who their neighbours are. The master is on the control path, not the data path, so it never becomes a throughput bottleneck. Three failure cases:

Head fails. The master promotes R₂ to head. Any updates the old head accepted but never forwarded are simply lost — they were never acknowledged, so no client ever saw them, and consistency holds.
Tail fails. The master promotes R_{n-1} to tail. By the invariant the old tail's history was a prefix of the new tail's, so the new tail already holds every update the old tail had committed — plus, possibly, updates it received but had not yet seen acknowledged. Those extra updates simply become committed under the new tail and are acked back up the chain. No committed write is lost.
Middle node fails. The master splices it out: it tells R_{i-1} that its new successor is R_{i+1}, and has R_{i-1} re-send any pending updates that R_{i+1} had not yet received (the difference of their pending queues). The chain shortens by one and keeps running.

The system stays available for writes and reads as long as at least one replica survives, tolerating up to N − 1 simultaneous failures — strictly stronger than a majority-quorum system, which stops accepting writes once it loses a majority (it tolerates only ⌊(N−1)/2⌋ failures).

When to choose chain replication

Read-heavy strongly-consistent stores where you want linearizable reads without the cost of quorum reads. The tail (or, with CRAQ, every node) answers reads from local state.
High write throughput on commodity networks — each node sends to exactly one successor, so no single machine's NIC has to fan out to the whole replica set.
Storage backends and metadata services — it underpins systems like FAWN-KV, Hibari, and Microsoft's chain-replicated object store described in the original research, with CRAQ as the production-oriented refinement for read-heavy workloads.
Simple operational reasoning — "writes left, reads right, master fixes the chain" is far easier to debug than version-vector reconciliation.

Avoid it when write latency is critical (the linear path is slow), when nodes are geographically distant (each hop adds a WAN round-trip), or when you face Byzantine/non-fail-stop faults the basic protocol doesn't model.

Chain replication vs other replication schemes

	Chain replication	CRAQ	Primary-backup	Majority quorum (Paxos/Raft)	Quorum read/write (Dynamo)
Consistency	Linearizable	Linearizable	Linearizable	Linearizable	Eventual (tunable)
Write path	N sequential hops	N sequential hops	1 fan-out, wait for all	1 fan-out, wait for majority	Send to N, wait for W
Write latency	O(N) hops	O(N) hops	1 round-trip	1 round-trip (parallel)	1 round-trip (parallel)
Read path	Tail only	Any node	Primary (or backups, stale)	Leader or majority	Read R replicas
Read scalability	1 node (tail)	N nodes	1 node (primary)	1 node (leader)	R of N nodes
Failures tolerated	N − 1	N − 1	N − 1	⌊(N−1)/2⌋	N − W (writes), N − R (reads)
Per-node send fan-out	1 successor	1 successor	N − 1 backups	N − 1 peers	N − 1 peers
Real-world use	FAWN-KV, Hibari, Azure store	read-heavy KV stores	MySQL replication, HDFS	etcd, ZooKeeper, Spanner	Dynamo, Cassandra, Riak

The headline trade-off: chain replication moves all the latency cost onto writes and all the throughput benefit onto the system as a whole. Because the head fans out to exactly one node, you can add replicas without overloading any single machine's network — unlike primary-backup, where the primary's bandwidth is divided among N−1 backups. The price is that a write must traverse the full chain before it commits.

What the numbers actually say

Write latency is linear in chain length. An N-node chain has N − 1 forward links, so on a 0.1 ms intra-datacenter hop a 3-node chain costs roughly 2 × 0.1 ms = 0.2 ms for the update to reach the tail, plus the ack path back up. A majority-quorum write on the same hardware commits in ~0.1 ms because the two fastest replies arrive in parallel. Stretch the chain to 7 nodes and the forward path grows to 6 × 0.1 ms = 0.6 ms, while the quorum write stays ~0.1 ms.
Throughput is bounded by the slowest link, not the sum. Each node forwards once, so steady-state write throughput is set by a single hop's bandwidth — adding replicas doesn't reduce it, unlike primary-backup where the primary's uplink is split N−1 ways.
Reads bottleneck at the tail. In the original protocol, all read load lands on one machine. The van Renesse–Schneider simulations showed chain replication delivering comparable or higher throughput than primary-backup precisely because the head's write fan-out is 1 instead of N−1 — but a read-skewed workload saturates the tail. CRAQ's measurements report read throughput scaling near-linearly with N on read-mostly workloads (e.g. ~3× on a 3-node chain, ~7× on 7 nodes) when reads are mostly clean.
Fault tolerance is N − 1. A 3-node chain survives 2 failures; a 3-node Paxos group survives only 1. To match chain replication's N−1 tolerance, a quorum system would need 2N−1 replicas.

JavaScript implementation

A minimal in-memory chain. Each node forwards writes to its successor and sends acks back to its predecessor; the tail commits and serves reads. This models the data path, not the master/reconfiguration logic.

class Node {
  constructor(id) {
    this.id = id;
    this.next = null;       // successor (toward tail)
    this.prev = null;       // predecessor (toward head)
    this.store = new Map(); // key -> committed value
    this.pending = new Map(); // seq -> { key, value }
  }
  get isHead() { return this.prev === null; }
  get isTail() { return this.next === null; }

  // Write enters at the head only.
  write(seq, key, value, resolve) {
    this.pending.set(seq, { key, value });
    this.store.set(key, value);     // apply locally (dirty until acked)
    if (this.isTail) {
      this.commit(seq);
      resolve(value);               // reply to client from the tail
    } else {
      this.next.write(seq, key, value, resolve); // forward one hop
    }
  }

  // Ack flows back up the chain; each node commits and forwards the ack.
  commit(seq) {
    this.pending.delete(seq);       // now committed; drop from pending
    if (this.prev) this.prev.commit(seq);
  }

  // Reads are served by the tail.
  read(key) {
    if (!this.isTail) throw new Error('reads must hit the tail');
    return this.store.get(key);
  }
}

class Chain {
  constructor(n) {
    this.nodes = Array.from({ length: n }, (_, i) => new Node(i));
    for (let i = 0; i < n - 1; i++) {
      this.nodes[i].next = this.nodes[i + 1];
      this.nodes[i + 1].prev = this.nodes[i];
    }
    this.head = this.nodes[0];
    this.tail = this.nodes[n - 1];
    this.seq = 0;
  }
  write(key, value) {
    return new Promise(resolve =>
      this.head.write(++this.seq, key, value, resolve));
  }
  read(key) { return this.tail.read(key); }
}

// Usage
const chain = new Chain(3);
await chain.write('x', 42);
console.log(chain.read('x')); // 42 — visible only after reaching the tail

Two details mirror the real protocol. First, a node applies the write locally before forwarding, but the value is only "committed" once the ack returns — that pending/committed distinction is exactly what CRAQ exploits. Second, reads are physically rejected anywhere but the tail, enforcing the read-path rule in code rather than convention.

Python implementation — with CRAQ-style reads

This version adds the CRAQ optimization: any node may serve a read, returning its clean value directly, but querying the tail for the committed version number when it holds a dirty (uncommitted) update.

class Node:
    def __init__(self, nid):
        self.id = nid
        self.next = None
        self.prev = None
        self.versions = {}   # key -> list of (seq, value)
        self.clean = {}      # key -> seq that is fully committed

    @property
    def is_head(self): return self.prev is None
    @property
    def is_tail(self): return self.next is None

    def write(self, seq, key, value):
        self.versions.setdefault(key, []).append((seq, value))
        if self.is_tail:
            self.clean[key] = seq          # tail commits immediately
            self._ack(seq, key)            # send ack upstream
        else:
            self.next.write(seq, key, value)

    def _ack(self, seq, key):
        self.clean[key] = seq
        # drop versions older than the committed one
        self.versions[key] = [(s, v) for (s, v) in self.versions[key] if s >= seq]
        if self.prev:
            self.prev._ack(seq, key)

    # CRAQ read: serve locally if clean, else ask the tail which seq is committed.
    def read(self, key, tail):
        vers = self.versions.get(key, [])
        if not vers:
            return None
        latest_seq = vers[-1][0]
        if self.clean.get(key) == latest_seq:
            return vers[-1][1]             # clean -> serve locally
        committed_seq = tail.committed_seq(key)   # one hop to the tail
        for s, v in vers:
            if s == committed_seq:
                return v
        return vers[0][1]

    def committed_seq(self, key):
        return self.clean.get(key)


class Chain:
    def __init__(self, n):
        self.nodes = [Node(i) for i in range(n)]
        for i in range(n - 1):
            self.nodes[i].next = self.nodes[i + 1]
            self.nodes[i + 1].prev = self.nodes[i]
        self.head, self.tail = self.nodes[0], self.nodes[-1]
        self.seq = 0

    def write(self, key, value):
        self.seq += 1
        self.head.write(self.seq, key, value)

    # Apportioned query: any node can answer.
    def read(self, key, node_index=0):
        return self.nodes[node_index].read(key, self.tail)


chain = Chain(3)
chain.write('x', 42)
print(chain.read('x', node_index=0))  # 42, served from the head (clean)

The CRAQ insight is subtle: a node serves a stale-looking local value only when it knows that value is committed (clean). The instant a write is in flight, the node has a dirty version and must defer to the tail's authority — one extra hop, but only on contended keys. On read-mostly workloads almost every read is clean, so reads scale across all N replicas.

Variants worth knowing

CRAQ (Chain Replication with Apportioned Queries). Terrace and Freedman, 2009. The variant above: read load is spread across the whole chain instead of pinned to the tail, turning read throughput from O(1) into O(N) on read-heavy workloads while preserving linearizability.

FAWN-KV (chain replication over low-power flash nodes). A storage system that ran chain replication across an array of wimpy, flash-backed nodes, demonstrating that the one-successor write pattern keeps low-power, low-bandwidth machines from saturating their network links.

Object-distributed chains (sharded chains). Rather than one global chain, partition the key space and run an independent chain per shard, with each physical node playing head for some shards, middle for others, tail for the rest. This balances the tail-read load across the cluster even without CRAQ and underlies Microsoft's chain-replicated object store.

Hibari / "long chains for durability." Lengthening the chain raises durability (more copies) at the cost of write latency — an explicit dial that quorum systems don't expose as directly.

Geo-chains with relaxed reads. When replicas span datacenters, some deployments allow bounded-staleness reads from intermediate nodes to avoid the full WAN traversal, trading a slice of consistency for latency.

Common bugs and edge cases

Serving reads from a non-tail node without CRAQ. The whole point of tail-only reads is that the tail holds exactly the committed set. Read a middle node directly and you may return an uncommitted value that a later failure erases — a linearizability violation.
Acknowledging the client before the tail commits. If the head replies as soon as it applies a write, a head-side failure can lose data the client believes is durable. The ack must originate at the tail.
Split-brain from a partitioned master. If two master instances each believe they own the chain, they can form two chains that diverge. The master must itself be a real consensus group (Paxos/Raft) with a single leader.
Forgetting to replay the pending-queue diff on splice. When a middle node is removed, the new successor may be missing updates the predecessor already forwarded downstream; failing to re-send the difference silently drops them.
Assuming N−1 fault tolerance under correlated failure. The N−1 bound holds for independent fail-stop crashes. A rack power loss or a bad deploy can take the whole chain at once — durability still needs cross-failure-domain placement.
Long chains for "more safety." Each extra node adds a full hop of write latency for marginal durability gains. Past 3–5 nodes the latency cost usually outweighs the benefit; shard instead.

Frequently asked questions

Why does chain replication give strong consistency for free?

Because a write is not acknowledged to the client until it reaches the tail, and the tail is the single node that serves all reads. By the time any client can observe a value, every replica already holds it, so reads and writes are linearizable without quorum voting or version reconciliation.

How does chain replication differ from primary-backup replication?

Primary-backup sends every write from one primary to all backups in parallel and waits for all acks, so the primary's outbound network is the bottleneck. Chain replication forwards a write one hop at a time down a line, so each node only talks to its successor — the per-node forwarding load is constant regardless of chain length, and read load is offloaded entirely to the tail.

What happens when a node in the chain fails?

A separate, fault-tolerant master detects the failure and reconfigures the chain. A failed head promotes its successor to head; a failed tail promotes its predecessor to tail; a failed middle node is spliced out by reconnecting its predecessor to its successor, replaying any in-flight updates the successor missed. The chain stays available as long as one replica survives.

What is CRAQ and why was it invented?

CRAQ (Chain Replication with Apportioned Queries) lets every node in the chain serve reads, not just the tail, while keeping strong consistency. A node serves a clean (fully-committed) value locally, but if it holds a dirty (not-yet-acknowledged) version it asks the tail which version is committed. This spreads read load across all N replicas instead of overloading one tail.

Does chain replication have higher write latency than quorum systems?

Yes — write latency grows linearly with chain length because the update visits every node in sequence. A 3-node chain traverses 2 forward links to reach the tail (N − 1 hops in general), whereas a majority-quorum system commits after the fastest 2 of 3 replicas reply in parallel. Chain replication trades write latency for simpler consistency reasoning and cheaper, tail-only reads.

Can chain replication lose committed data?

Not under its fail-stop assumption: once the tail commits, every upstream node already has the update, so any single failure leaves the value intact. It can lose data if the entire chain is lost simultaneously, if the master is partitioned and elects two chains (split-brain), or under non-fail-stop faults like silent disk corruption that the protocol does not model.