Distributed Systems
Anti-Entropy
The background process that keeps eventual consistency honest
Anti-entropy periodically compares replicas using Merkle trees and ships only the divergent rows. It is the eventual-consistency floor that Cassandra, DynamoDB, and Riak rely on.
- Comparison costO(log N) for sparse divergence
- Bandwidth (no diff)~32 KB / range / replica pair
- Typical scheduleNightly incremental, weekly full
- Repair windowMust beat tombstone GC (10d default)
- Used inCassandra, DynamoDB, Riak, ScyllaDB
- CPU overhead10–30% during active repair
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How anti-entropy works
Replicated databases don't stay in sync by accident. Every distributed write is racing the network — a packet dropped here, a replica restarted there, a write quorum of 2 out of 3 that simply skipped the third copy. Over hours and days, the replicas drift apart. Reads from different nodes start returning different answers. Anti-entropy is the background process that crawls the dataset, finds those discrepancies, and repairs them.
The naive approach is unthinkable: ship every row from replica A to replica B and compare. On a 10-TB Cassandra node that's hours of cross-data-center bandwidth — for a sync where 99.999% of the data is already identical. Real systems use Merkle trees. Each replica hashes its data into a binary tree of digests. Two replicas compare their root hashes in one round-trip; if they match, the entire dataset is identical and we move on. If they differ, both sides descend only the mismatched subtrees, halving the search space at each step.
The cost of finding a single divergent row this way is O(log N) hash comparisons plus the row itself. For a partition with one bad key out of a million, that's about 20 hash comparisons and one row transfer — instead of a million rows.
Why Merkle trees are the right primitive
A Merkle tree partitions the keyspace into ranges and recursively hashes:
leaf_hash(range) = H(row₁ ‖ row₂ ‖ … ‖ row_k)
inner_hash(L, R) = H(L.hash ‖ R.hash)
root_hash = inner_hash(left_subtree, right_subtree)
Any change to any row invalidates the chain of parents all the way to the root — the root hash is a fingerprint of the entire dataset. Two replicas can sit on opposite sides of a continent, exchange 32 bytes, and learn whether their data is identical.
The depth of the tree controls the granularity of comparison. Cassandra uses 15 levels — 32 768 leaves per token range. Each leaf covers roughly 30 rows for a 1-million-row range. When two leaves disagree, the system ships those 30 rows; when subtrees disagree the system descends. The result is bandwidth proportional to the amount of disagreement, not the size of the data.
When to use anti-entropy
- You run a multi-replica AP store — Cassandra, ScyllaDB, DynamoDB, Riak, Couchbase. The replica set is intentionally eventually consistent, and you need a floor under "eventual."
- You have cold data that nobody reads. Read repair only fixes keys someone touches. Cold rows would drift forever without a background sweep.
- You delete data. Tombstones plus a missed replica plus garbage collection equals resurrected ghosts. Anti-entropy within the GC grace period prevents zombie writes.
- You operate across multiple data centers. Cross-region partitions are routine; anti-entropy is the reconciliation pass after the partition heals.
You do not need anti-entropy in a strongly consistent system — Raft, Paxos, Spanner, CockroachDB. There, every commit already requires a majority, so divergence cannot persist by construction. The cost is latency and write throughput, which is exactly why AP systems trade strong consistency for anti-entropy in the first place.
Anti-entropy vs read repair vs hinted handoff
| Anti-entropy | Read repair | Hinted handoff | Quorum write | |
|---|---|---|---|---|
| Trigger | Scheduled / periodic | Client read | Replica unreachable at write time | Client write |
| Coverage | Entire dataset | Keys clients touch | Writes since hint window opened | Single write |
| Cold-data drift | Repaired | Not repaired | Not relevant | Not relevant |
| Latency impact | None on critical path | Adds to read tail | Slight queue cost | 1 RTT to majority |
| Bandwidth (no diff) | ~32 KB / range / pair | 0 | 0 | 0 |
| Bandwidth (with diff) | Δ rows + log(N) hashes | 1 row per stale replica | Buffered write replayed | 0 |
| Required for correctness | Yes (eventual floor) | No (optimization) | No (optimization) | Yes (read-your-write) |
The three repair mechanisms are complementary, not redundant. Hinted handoff catches replicas that were briefly down. Read repair catches keys hot enough to be requested. Anti-entropy catches everything else — the long tail of cold rows that escape both other paths.
What anti-entropy actually costs
A full repair on a 5-TB Cassandra node takes roughly 4–10 hours, depending on the replication factor, the number of token ranges, and how much divergence the Merkle tree turns up. Network throughput peaks at the replica's outbound bandwidth; CPU runs 10–30% above baseline for the duration of the validator and merkle-build phases.
Incremental repair — introduced in Cassandra 2.1 and refined heavily through 4.0 — only rebuilds Merkle trees over partitions that have changed since the last repair. On a typical workload where 1% of partitions change per day, incremental repair reduces the cost by roughly 99%: 4 minutes instead of 4 hours. The Merkle build still touches every changed row but skips the unchanged majority entirely.
The bandwidth bill on a 32-token-range repair with no divergence is about 32 ranges × 32 KB of hashes × (RF−1) replicas, so 32 × 32 KB × 2 = 2 MB total — vanishingly small. Add divergence and the bill grows linearly with the number of divergent rows. The slope is what kills you if your replicas have been drifting for weeks.
Pseudo-code
// Background repair worker on replica A talking to replica B for a range.
repair(range, B):
treeA = build_merkle_tree(range) // hash my partition
treeB = B.build_merkle_tree(range) // ask B to build hers
diff_leaves = []
descend(treeA.root, treeB.root, diff_leaves)
for leaf in diff_leaves:
myRows = read_rows(range, leaf)
theirRows = B.read_rows(range, leaf)
reconcile(myRows, theirRows) // pick winner per row
push_winners(B, range, leaf)
descend(a, b, out):
if a.hash == b.hash:
return // identical subtree, skip
if a.is_leaf():
out.append(a)
return
descend(a.left, b.left, out)
descend(a.right, b.right, out)
reconcile(rowsA, rowsB):
// Last-write-wins by timestamp; vector clocks in Riak.
for key in union(rowsA.keys, rowsB.keys):
winner = max_by_timestamp(rowsA.get(key), rowsB.get(key))
update_local(key, winner)
Python: a minimal Merkle-tree repair
import hashlib
def H(*parts):
h = hashlib.sha256()
for p in parts:
h.update(p if isinstance(p, bytes) else str(p).encode())
return h.digest()
class MerkleNode:
__slots__ = ("hash", "left", "right", "rows", "lo", "hi")
def __init__(self, hash, lo, hi, left=None, right=None, rows=None):
self.hash, self.lo, self.hi = hash, lo, hi
self.left, self.right, self.rows = left, right, rows
def build(rows, lo, hi, depth=15):
bucket = [(k, v) for (k, v) in rows if lo <= k < hi]
if depth == 0 or len(bucket) <= 1:
return MerkleNode(H(*[H(k, v) for k, v in bucket]), lo, hi, rows=bucket)
mid = (lo + hi) // 2
L = build(rows, lo, mid, depth - 1)
R = build(rows, mid, hi, depth - 1)
return MerkleNode(H(L.hash, R.hash), lo, hi, left=L, right=R)
def diff(a, b, out):
if a.hash == b.hash:
return
if a.left is None or b.left is None:
out.append((a.lo, a.hi))
return
diff(a.left, b.left, out)
diff(a.right, b.right, out)
def repair(replicaA, replicaB):
tA = build(replicaA.rows, 0, 1 << 30)
tB = build(replicaB.rows, 0, 1 << 30)
ranges = []
diff(tA, tB, ranges)
for lo, hi in ranges:
a_rows = {k: (v, ts) for k, v, ts in replicaA.scan(lo, hi)}
b_rows = {k: (v, ts) for k, v, ts in replicaB.scan(lo, hi)}
for key in a_rows.keys() | b_rows.keys():
av, at = a_rows.get(key, (None, -1))
bv, bt = b_rows.get(key, (None, -1))
if at > bt: replicaB.put(key, av, at)
elif bt > at: replicaA.put(key, bv, bt)
This is the entire repair loop in 30 lines. Production systems add SSTable-aware merkle build, validator/streamer separation, throttling, and incremental state, but the algorithm is exactly this: build, compare roots, descend disagreements, reconcile leaves.
Variants and extensions
- Incremental repair (Cassandra). Tracks which SSTables have been repaired since the last run and only rebuilds Merkle trees over the unrepaired set. Reduces repair cost by 90%+ on stable workloads. Default in Cassandra 4.x.
- Subrange repair. Repairs one small token range at a time instead of an entire vnode, so a single failure or restart doesn't lose hours of progress.
nodetool repair -st ... -et ...in Cassandra. - Reaper / scheduled orchestration. Tools like Cassandra Reaper drive thousands of subrange repairs across a cluster, throttling per host and retrying failures. The bare protocol is fine; managing it at scale is the hard part.
- Active anti-entropy (Riak). Maintains the Merkle tree continuously as writes arrive, so the build phase is amortized across the workload instead of paid in one burst.
- DynamoDB internal repair. Amazon's hosted variant runs Merkle-tree comparisons across replica sets every few minutes; the user never sees a "run repair" knob, but the same algorithm is doing the work.
- Hash-based vs row-by-row. Some systems skip Merkle trees entirely and stream sorted-key digests, comparing in O(N) bandwidth but with simpler code. Acceptable when ranges are small.
Common pitfalls and edge cases
- Repair window exceeds tombstone GC grace. Cassandra's default is 10 days. If a repair misses a deletion within that window and the tombstone is purged on one replica, the un-deleted copy on the other replica resurrects the row on the next anti-entropy round. Always:
repair_interval < gc_grace_seconds. - Clock skew across replicas. Last-write-wins reconciliation hinges on timestamps. A node with a 5-second-fast clock wins every conflict; a 5-second-slow clock loses every one. NTP is non-negotiable. Riak's vector clocks dodge this, but pay a per-row metadata cost.
- Merkle build saturates I/O. Reading every SSTable to compute hashes turns the validator into a sequential-scan storm. Throttle with
compaction_throughput_mb_per_secor run during off-peak. - Repair coordinator failure leaves orphan state. Older Cassandra versions could leave repair sessions half-finished, blocking subsequent runs. Modern incremental repair tracks per-SSTable state on disk to survive restarts.
- Network amplification on highly divergent data. If replicas have drifted by a million rows, anti-entropy ships a million rows. The fast path assumes divergence is sparse. After a long outage, expect a bandwidth surge.
- Forgetting to repair after a node replacement. A newly bootstrapped node fills mostly from streaming, but any in-flight writes during the bootstrap window land only on the new node's peers. The first post-bootstrap repair is mandatory.
Frequently asked questions
Why do replicas drift apart in the first place?
Dropped messages, brief network partitions, replica restarts during a write, hinted-handoff buffers that expire before delivery, or a write that hit a quorum of 2 out of 3 — leaving the third replica permanently behind. In a system handling millions of writes per day, even a 0.01% loss rate produces hundreds of divergent rows per replica, per day.
Why use a Merkle tree instead of comparing rows directly?
Comparing rows directly costs O(N) bandwidth — you ship the whole dataset across the network. A Merkle tree lets two replicas compare their root hashes in one round; if those match, the entire subtree is identical and you skip it. Sparse divergence — the common case — costs only O(log N) hash comparisons plus the actual diverging leaves. For a 1 TB keyspace with 100 changed rows, that means transferring a few kilobytes of hashes plus the 100 rows, instead of 1 TB.
How is anti-entropy different from read repair?
Read repair fixes divergence on the read path: when a client read returns mismatched values from replicas, the coordinator picks the winning version and pushes it back to the stale replicas. It is reactive and only repairs keys that someone reads. Anti-entropy is proactive — it runs in the background and reconciles cold data that nobody is reading. Together they form a complete eventual-consistency floor.
Does anti-entropy slow down production traffic?
Yes, measurably. A full repair on a multi-TB Cassandra node can saturate the network for hours and add 10–30% CPU load. Operators throttle it heavily — Cassandra's incremental repair only validates partitions that have changed since the last repair, which cuts the cost by 90%+ on typical workloads. Off-peak scheduling and per-node concurrency caps are standard.
How often should anti-entropy run?
Within the tombstone GC grace period, which defaults to 10 days in Cassandra. If you go longer than that without a repair, deleted rows can resurrect — a deleted row's tombstone gets garbage-collected on one replica but the original write still exists on another, and the next gossip round happily resurrects it. Weekly is the practical sweet spot; nightly incremental + weekly full is the production pattern.
What's the trade-off in Merkle tree depth?
Shallow trees mean fewer hash comparisons but each leaf covers a larger range, so any single divergence pulls a big chunk of data across. Deep trees give precise diff localization but more hash storage and more comparison rounds. Cassandra uses 15 levels (32k leaves) per range by default; DynamoDB and Riak tune similarly. The sweet spot is when leaf size roughly equals one network MTU's worth of rows.
Why not just use anti-entropy and skip quorum writes?
Anti-entropy provides eventual consistency, not linearizability. Between the moment a write hits one replica and the moment anti-entropy reconciles the others, a read can return stale data — sometimes for hours. Quorum writes give read-your-writes immediately; anti-entropy is the cleanup crew that catches the long tail of writes the quorum missed.