Distributed Systems

Read Repair

Catch stale replicas while you're already paying for the read

Read repair piggybacks anti-entropy on regular read traffic. When a coordinator queries multiple replicas, it compares their versions; if any are stale, it pushes the latest value back to them inline. Cheap convergence for the hot path — Cassandra and DynamoDB lean on it heavily.

  • Triggered byEvery read with divergent replicas
  • Latency cost~1–5 ms blocking, ~0 async
  • Write amplification+10–30% when drift is high
  • Hot-key convergenceWithin seconds of write
  • Cold-key convergenceNeeds Merkle tree repair
  • Used inCassandra, DynamoDB, Riak, ScyllaDB

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How read repair works

A Dynamo-style store replicates each key to N replicas, but writes don't always land everywhere. A write at W = 3 against N = 5 leaves two replicas unpatched. A network hiccup or a brief node restart can leave gaps even at higher W. Over hours and days, replicas drift apart — same key, different versions, depending on which replica you ask.

Read repair closes the drift cheaply. Whenever a coordinator runs a read at quorum R, it has already collected R replica responses with their version metadata. Comparing those responses tells you which replicas are stale; pushing the freshest value back to the laggards is one extra write per laggard. The cost is amortized over read traffic you were going to do anyway.

The mechanic:

  1. Coordinator sends a read to all N replicas (or to R explicitly, depending on speculative-execution config).
  2. Responses come back with (value, timestamp_or_vclock).
  3. The coordinator picks the freshest version — by max timestamp for LWW, or by vector-clock dominance.
  4. For every replica that returned an older version, the coordinator sends a repair write with the fresh value.
  5. The freshest value is returned to the client (foreground: after repairs ack; background: before).

Foreground, background, and probabilistic

  • Foreground (blocking) read repair. The coordinator waits for the repair writes to acknowledge before returning the response to the client. Guarantees read-monotonicity: a subsequent read at the same consistency level sees the repaired state. Costs an extra round-trip on every read with divergence (~1–5 ms intra-DC).
  • Background (async) read repair. The coordinator returns the response immediately and fires the repair writes as fire-and-forget. No latency penalty, but a follow-up read might hit a not-yet-patched replica. Cassandra's default for ONE-level reads.
  • Probabilistic read repair. Repair only fires with some probability p per read (Cassandra's read_repair_chance, historically 0.1). Caps overhead on hot keys while still ensuring eventual convergence. Deprecated in newer Cassandra in favor of DC-local versus global controls.
  • Speculative read repair. The coordinator pre-emptively reads from all N replicas at every quorum read (not just R). Wastes bandwidth on the extra N - R reads but eliminates the gap on lower consistency levels. Used in DynamoDB's strongly-consistent reads.
  • Selective digest read. Cassandra optimization — first replica returns full data, the rest return only digests (hashes); only on mismatch does the coordinator re-fetch full data. Cuts bandwidth ~75% for the common case.

When read repair earns its keep

  • Hot-key workloads. Frequently-read keys converge within seconds even at low write quorums. Read repair scales with read traffic, so it naturally pours effort into the keys that need it.
  • Sloppy quorum environments. When writes can land on temporarily-substituted replicas (hinted handoff in flight), read repair stitches the actual replica set back together once the substitutes drain.
  • Partition recovery. When a partition heals, the two sides have divergent state. Read repair catches the discrepancy on the first read of any key touched during the partition.
  • Cross-DC consistency at LOCAL_QUORUM. Each DC's quorum reads catch divergence within the DC; cross-DC consistency comes from eventual replication plus read repair when a global read happens.

Read repair is the wrong fit when divergence is concentrated on cold keys (a key never read won't converge), when latency budgets exclude an extra round-trip for stragglers (use async only), or when application semantics demand siblings be surfaced rather than auto-resolved (then disable LWW and propagate vector-clock siblings).

Read repair vs alternatives

Read repairHinted handoffMerkle tree repairSynchronous replication
Triggered byReads with divergenceWrites to down replicasScheduled sweepsEvery write
Latency overhead per op1–5 ms blocking; 0 async0 (background)0 (offline)Slowest replica fsync
Bandwidth costR - 1 extra writes per divergent readOne replay per missed writeO(keyspace) per sweepN × value size per write
Covers cold keys?NoNoYesYes
Convergence timeSeconds (hot) → never (cold)Minutes after replica returnsHours per sweep cycleReal-time
Operational costBuilt-inBuilt-inManual schedulingThroughput hit

The pragmatic answer is to combine. Cassandra uses all three of the eventual-consistency techniques: read repair for hot paths, hinted handoff for brief outages, Merkle tree repair (nodetool repair) for the long tail. None of them alone is sufficient; together they bound replica drift in practice.

Pseudo-code

// Foreground read repair — runs at quorum read
read_with_repair(key, R):
    replicas = replicas_for(key)
    futures  = [send_read(r, key) for r in replicas]
    responses = await_first(futures, count=R, timeout=read_timeout)

    if len(responses) < R:
        return TIMEOUT

    // Determine the freshest version
    winner = max(responses, key=lambda r: r.timestamp)

    // Identify laggards
    laggards = [r for r in responses if r.timestamp < winner.timestamp]

    // Push repair writes (blocking variant — wait for ack)
    repair_futures = [send_write(r.node, key, winner.value, winner.timestamp) for r in laggards]
    await_all(repair_futures, timeout=repair_timeout)

    return winner.value

// Background variant — fire and forget
async_read_with_repair(key, R):
    responses = collect_replicas(key, R)
    winner = max(responses, key=...)

    // Fire-and-forget repair
    for r in responses:
        if r.timestamp < winner.timestamp:
            schedule_async_write(r.node, key, winner.value, winner.timestamp)

    return winner.value

JavaScript implementation

class QuorumCoordinator {
  constructor(replicas, N) {
    this.replicas = replicas; // node URLs
    this.N = N;
  }

  // Foreground (blocking) read repair
  async readWithRepair(key, R, opts = { foreground: true }) {
    const responses = await this.collectReplicas(key, R);
    if (responses.length < R) throw new Error('quorum not met');

    // Pick latest (LWW)
    const winner = responses.reduce((a, b) => (b.ts > a.ts ? b : a));

    // Identify and patch laggards
    const laggards = responses.filter(r => r.ts < winner.ts);
    const repairs = laggards.map(r =>
      this.sendWrite(r.node, key, winner.value, winner.ts)
    );

    if (opts.foreground) {
      // Block on repair acks (1-5 ms intra-DC)
      await Promise.allSettled(repairs);
    }
    // else: fire-and-forget — caller doesn't wait

    return winner.value;
  }

  async collectReplicas(key, R) {
    const allReads = this.replicas.map(node => this.sendRead(node, key));
    const responses = [];
    const settled = await Promise.allSettled(allReads);
    for (const s of settled) {
      if (s.status === 'fulfilled' && s.value) responses.push(s.value);
    }
    return responses;
  }
}

Common pitfalls

  • Background repair on critical reads. If consistency matters for follow-up reads, use foreground (blocking). Otherwise the next read can hit a still-stale replica even after the "successful" repair from the prior call.
  • Ignoring tombstones. A deleted value is represented as a tombstone (a value with the delete flag and a timestamp). Read repair must propagate tombstones the same way as live values — a buggy repair that treats "no value" as "older" will resurrect deleted keys. Cassandra's gc_grace_seconds determines how long tombstones live before being collected.
  • Read repair on time-skewed clusters. LWW conflict resolution uses timestamps. If your clocks are skewed, "latest" might actually be a write from a node with a wrong clock — and repair propagates that wrong version everywhere. Always synchronize NTP and use HLC if drift is a concern.
  • Repair amplification under high write contention. A frequently-updated counter with low read traffic can churn its repairs faster than they propagate, creating waves of "winner" oscillations. Counters are better stored as CRDTs (PN-counter, G-counter) that merge by union rather than LWW.
  • Skipping the digest optimization. Sending full values from every replica wastes 75%+ of the bandwidth for small clusters. Have stragglers return only a hash of their value; only re-fetch full data on hash mismatch.

Performance characteristics

  • Latency: blocking adds 1–5 ms intra-DC, 30–80 ms cross-DC. Single extra round-trip from coordinator to laggard. Async adds zero observable latency but loses the read-monotonicity guarantee.
  • Write amplification: 10–30% during normal drift, 50%+ after partition. Each divergent read becomes 1 read + up to (R - 1) writes. Cassandra historically capped this with read_repair_chance = 0.1 — meaning only 10% of reads triggered the repair branch.
  • Bandwidth: digest-mode cuts payload ~75%. Only the first replica sends full data; others send hashes. Full-data re-fetch only on hash mismatch — the common case (no drift) costs N × hash_size + 1 × value_size.
  • Convergence: hot keys within seconds, cold keys never. A key read once per second converges almost as fast as it's written. A key never read drifts indefinitely until Merkle repair runs.
  • Read repair load is self-tuning. More drift = more divergent reads = more repairs = faster convergence. Less drift = less work. Naturally elastic with cluster health.

Frequently asked questions

What problem does read repair solve?

In a Dynamo-style replicated store, writes don't always reach every replica. A node might be briefly partitioned, slow, restarting, or simply not in the write quorum. Over time those gaps accumulate as replica drift — same key, different values on different replicas. Read repair closes the gaps cheaply by piggybacking on existing read traffic: every time a coordinator collects replica responses for a read, it spots the divergence and pushes the latest value back to the lagging replicas. No separate background sweep, no extra read fan-out.

Is read repair foreground or background?

Both, depending on the system and the consistency level. Foreground (blocking) read repair holds the response to the client until the lagging replicas are patched — guarantees that subsequent reads see the repaired state. Background (async) read repair returns the response immediately and fires the patches as fire-and-forget — faster, but a follow-up read might still hit unpatched replicas. Cassandra defaults to blocking repair for consistency levels at or above LOCAL_QUORUM, async for lower levels.

How is read repair different from hinted handoff?

Hinted handoff catches writes intended for a replica that was down: the coordinator stores a hint locally and replays it when the replica returns. Read repair catches writes that already happened but didn't propagate everywhere — discovered passively when a read fans out and the responses disagree. Hinted handoff fixes "I missed this write entirely"; read repair fixes "I have a stale version of this key." Cassandra and DynamoDB use both as complementary anti-entropy mechanisms.

How is read repair different from anti-entropy Merkle tree repair?

Merkle tree repair is a periodic full-keyspace sweep — every replica computes a Merkle tree of its keys, replicas exchange hashes, and divergences trigger key transfers. It catches keys that haven't been read recently. Read repair only catches keys that are being read — hot keys converge quickly, cold keys can drift for weeks. Cassandra runs both: incremental read repair for hot paths and scheduled Merkle "nodetool repair" for the long tail.

How much overhead does read repair add?

Background read repair adds roughly 10 to 30% extra writes when replica drift is significant — every read that finds divergence becomes 1 read plus up to (R - 1) writes. Cassandra's read_repair_chance parameter (default 0.1 for many older versions, deprecated in newer ones in favor of DC-local versus global controls) caps the probability that a read triggers extra repair work, limiting overhead at the cost of slower convergence. Foreground repair adds 1 to 5 ms of latency per read with divergence.

Does read repair work with vector clocks?

Yes — and the resolution is richer. With timestamp-only versions, the coordinator picks the latest and overwrites the laggards. With vector clocks, the coordinator can detect concurrent (incomparable) versions and either present them to the application as siblings (Riak) or apply a deterministic resolver (LWW, set-union for CRDTs). Read repair in Riak uses the vector-clock comparison to decide who needs which update — sometimes the "latest" is a merged version, not one of the inputs.

Why not just use a stronger consistency level instead of read repair?

Even with QUORUM reads and writes, replicas outside the quorum can drift. A write at W=3 leaves 2 of 5 replicas unpatched; later, those replicas may be queried at a lower consistency level (say, by background analytics jobs) or become the only available replicas during a partition. Read repair brings them into line opportunistically, without requiring every read to fan out to all N replicas (which is what R=N would imply). It's the cheap glue that holds eventual consistency together.