Distributed Systems

Hinted Handoff

When a replica is down, a peer stores the write — and replays it when the replica returns

Hinted handoff keeps writes alive during transient replica outages. The coordinator parks an "I'll deliver this for you" hint on a healthy node; when the downed replica reappears, the hint holder replays the buffered writes. The trick Dynamo used to stay available without losing data.

  • Triggered byWrites to unreachable replicas
  • Default hint window3 hours (Cassandra max_hint_window_in_ms)
  • Hint storageLocal disk, system.hints table
  • Replay triggerGossip "node alive" event
  • Conflict resolutionLWW by timestamp
  • Used inDynamo, Cassandra, ScyllaDB, Riak

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The handoff sequence

A Dynamo-style write enters the coordinator with a list of N replicas to receive it. Suppose one of those N replicas — call it node-3 — is unreachable. The naive options are bad: fail the write (loss of availability), or silently drop the missing replica (data loss). Hinted handoff takes the middle path:

  1. Coordinator detects unreachability. Gossip protocol marks node-3 as DOWN, or the write attempt times out.
  2. Coordinator picks a hint holder. Another live node — typically the next one on the consistent hash ring — agrees to hold the write on node-3's behalf.
  3. Hint persisted locally. The hint holder writes (target=node-3, key, value, timestamp) to its local system table (system.hints in Cassandra).
  4. Coordinator counts the hint toward write quorum. With sloppy quorum, the hint counts as a "successful" write for the purpose of meeting W. With strict quorum, only writes to actual replicas count.
  5. Replay on return. When gossip reports node-3 is alive again, the hint holder iterates through its hints for node-3 and pushes them across.
  6. Hint deletion. Once acknowledged by node-3, the hint is deleted from local storage.

If node-3 never returns within the configured hint window (3 hours by default in Cassandra), the hints are discarded and the replica must catch up via Merkle tree anti-entropy (nodetool repair) or read repair. The window prevents hint storage from growing unboundedly during long outages.

Variants and refinements

  • Strict vs sloppy quorum. In strict quorum, the hint does not count toward W — the write still needs W acks from actual replicas. In sloppy quorum (Dynamo's design), the hint does count, so writes succeed even when fewer than W original replicas are alive. Cassandra's ANY consistency level is sloppy; QUORUM is strict.
  • Coordinator-local hints. Cassandra historically stored hints on the coordinator that wrote them. Later versions added "transitive hints" — hints can be passed between nodes, which helps when the coordinator itself fails before replay.
  • Hint pressure backoff. When a node's hint store grows beyond max_hints_size_per_host (default 128 GB in modern Cassandra), the coordinator stops creating new hints for that target and the write degrades to "failed for this replica" — preserving the rest of the cluster's stability.
  • Dynamo's transient handoff. Amazon's original Dynamo used a slightly different model — any node could become a temporary replica during partitions, with a "transient" flag, and the data would migrate back when the canonical replicas were reachable.
  • Cassandra 4.0 hints batching. Modern Cassandra batches hint replays into larger writes (typically 128 KB) to reduce protocol overhead — the same idea as sstable compaction, applied to hint storage.

When hinted handoff earns its keep

  • Rolling restarts. Patch a node, restart it, takes 30 seconds. During that gap, every write to its replicas would either fail or skip the node. Hinted handoff absorbs the gap invisibly.
  • Brief network partitions. A few-minute partition between rack switches or AZs. Hints from the surviving side hold the data; replay reunites the two sides once the partition heals.
  • Slow nodes. A node under heavy GC pressure or a temporarily-overloaded disk subsystem can fall behind without going fully offline. Hints accumulate and drain when load eases.
  • Bursty write workloads where one node is temporarily saturated. Better to hint and replay than to drop writes or back-pressure clients.

Hinted handoff is the wrong fit for long-duration outages (use Merkle repair), for workloads where stale data is unacceptable (the replica catches up only on replay, so reads to the recovering replica can be stale until then), or for systems requiring linearizability (Raft-style consensus handles availability differently — by waiting for a majority to be reachable).

Hinted handoff vs alternatives

Hinted handoffRead repairMerkle repairRaft/Paxos
Triggered byWrites to down replicasReads finding divergenceScheduled sweepQuorum unavailable
Latency impact~0 (extra disk write per hint)1–5 ms blocking0 (offline)Blocked writes
Storage costHint per missed write, capped at 128 GBNoneMerkle tree per rangeLog entries
Convergence timeSeconds after replica returnsOn next read of affected keyPer-sweep cycle (hours)On commit
Handles long outagesNo (3-hour cap)Yes if reads continueYesNot applicable
Counts toward W?Yes (sloppy) / No (strict)N/A — after writeN/AAlways yes

The three anti-entropy mechanisms complement each other. Hinted handoff is the cheap, fast fix for short outages. Read repair catches the gaps that hints didn't fill, opportunistically. Merkle repair is the periodic deep clean. Together they bound replica drift at the cost of careful tuning per workload.

Pseudo-code

// Coordinator-side write with hinted handoff
write(key, value, timestamp, W):
    replicas = replicas_for(key)
    successes = 0
    hint_holders = []

    for r in replicas:
        if gossip.is_alive(r):
            send_async(r, key, value, timestamp)
        else:
            // pick a healthy node to hold the hint
            holder = next_alive_neighbor(r)
            store_hint(holder, target=r, key=key, value=value, timestamp=timestamp)
            hint_holders.append(holder)
            successes += 1   // counts toward W in sloppy mode

    // wait for actual replica acks
    for ack in await_first(replicas_alive, count=W - len(hint_holders)):
        successes += 1
        if successes >= W: return OK

    if successes < W: return TIMEOUT

// Hint replay — runs on each node when a target comes back alive
on_node_alive(target_node):
    hints = local_hint_store.iter(target=target_node)
    for hint in hints:
        if age_of(hint) > MAX_HINT_WINDOW_MS:
            local_hint_store.delete(hint)
            continue
        try:
            send_sync(target_node, hint.key, hint.value, hint.timestamp)
            local_hint_store.delete(hint)
        except UnreachableError:
            return  // give up; will retry on next alive event

Java-flavored implementation sketch

// Cassandra-style hint storage
public class HintService {
  private final HintStore localHints;
  private final GossipService gossip;
  private final Duration maxWindow = Duration.ofHours(3);
  private final long maxHintBytes = 128L * 1024 * 1024 * 1024; // 128 GB

  public void storeHint(InetAddress target, ByteBuffer key,
                        ByteBuffer value, long timestampMicros) {
    if (localHints.totalBytes() > maxHintBytes) {
      throw new HintCapacityException("hint store full for " + target);
    }
    Hint h = new Hint(target, key, value, timestampMicros, System.currentTimeMillis());
    localHints.append(h);  // persisted to local SSTable
  }

  // Called on gossip "endpoint alive" event for nodes we hold hints for
  public void onTargetAlive(InetAddress target) {
    Iterator<Hint> iter = localHints.iterForTarget(target);
    while (iter.hasNext()) {
      Hint h = iter.next();
      long age = System.currentTimeMillis() - h.createdAtMs;
      if (age > maxWindow.toMillis()) {
        iter.remove();  // expired — let Merkle repair handle it
        continue;
      }
      try {
        sendMutation(target, h.key, h.value, h.timestampMicros);
        iter.remove();  // delivered — drop the hint
      } catch (UnreachableException e) {
        return;  // target went down again; wait for next alive event
      }
    }
  }
}

Common pitfalls

  • Treating hints as guaranteed durability. Hints live on the coordinator's local disk only. If that disk dies before replay, the hints are lost — eventual consistency via Merkle repair becomes the safety net. Don't disable repair scheduling just because hinted handoff is on.
  • Running with sloppy quorum without understanding the trade-off. Sloppy quorum lets writes succeed when fewer than W canonical replicas are reachable — but then a strict-quorum read might see the write only via the hint replay, not from a canonical replica. Some consistency anomalies become possible. Stick to strict quorum unless you've truly thought through the implications.
  • Forgetting to monitor hint backlog. A growing hint backlog on a coordinator is a leading indicator of cluster trouble — usually one replica is consistently slow or down. Cassandra's nodetool tablestats system.hints and JMX hint metrics expose this.
  • Mismatched clocks. Hints replay with their original timestamp. If the writing node had a clock 5 minutes in the future, the hint will appear to be the "latest" value when replayed — potentially overwriting newer writes that happened with correct clocks during the gap. NTP sync is non-negotiable.
  • Tuning the hint window without context. Pushing max_hint_window_in_ms to 24 hours sounds safe — until a node truly fails and you've buffered 24 hours of writes on every other node. Hint storage can balloon into hundreds of GB during a real outage.

Performance characteristics

  • ~0 latency on the hot path. A hint is one extra local disk write on a healthy node — typically < 1 ms with battery-backed write cache. Coordinator returns success as soon as the hint is durable.
  • Storage cost: ~100 bytes per hint at rest. Key + value + timestamp + target node + serialization overhead. A node with 10 minutes of buffered hints at 10,000 writes/sec holds ~600 MB of hints.
  • Replay rate: 100k–1M hints/sec. Cassandra 4.0+ batches replays into 128 KB chunks; the bottleneck is usually the target replica's commit-log write throughput, not the hint dispatcher.
  • Hint window: 3 hours default, configurable up to ~24 hours practical. Beyond that, hint storage growth dominates over replay efficiency — Merkle repair becomes cheaper.
  • Convergence: seconds after node returns for the typical case. With 600 MB of hints replaying at 100 MB/sec, the recovering replica catches up in ~6 seconds. Slower if the target is throttling its own commit-log.

Frequently asked questions

What is hinted handoff?

Hinted handoff is a write-availability mechanism for replicated stores. When a write needs to go to N replicas and one of them is unreachable, the coordinator forwards the write to a healthy node along with a hint — a small piece of metadata saying "this write was for replica X, please replay it when X comes back." The hint sits on the hint holder's local disk until it can be delivered. The Dynamo paper (DeCandia et al., 2007) introduced the term; Cassandra inherited the design.

How is hinted handoff different from read repair?

Hinted handoff fixes writes that never reached a replica because the replica was down at write time. Read repair fixes writes that did reach the replica eventually, but the replica's version became stale before a read fanned out to it. Hinted handoff replays writes proactively when a node returns; read repair patches them reactively when a read happens to expose the drift. Both run in production simultaneously in Cassandra and Dynamo-style stores.

How long are hints stored?

Cassandra defaults to a max_hint_window_in_ms of 3 hours (10,800,000 ms). After that window, hints for an unreachable replica are discarded and the replica must catch up via Merkle tree anti-entropy (nodetool repair) instead. The bound prevents hints from accumulating indefinitely if a node is permanently lost rather than briefly down. The hint store on a single coordinator is typically capped at a few gigabytes — beyond that, hints stop being written and the write succeeds at lower W.

Does a write succeed if all its replicas are down?

It depends on the consistency level and whether sloppy quorum is enabled. Cassandra's ANY consistency level lets a write succeed even if zero original replicas are reachable — the hint alone counts as a write. QUORUM and ALL require successful writes to actual replicas. Most production deployments use QUORUM with sloppy quorum disabled, so writes during a major outage simply fail; hinted handoff covers the transient single-node case.

What happens to hints if the coordinator that holds them crashes?

Hints are persisted to local disk before the coordinator acknowledges the write, so a coordinator restart doesn't lose them — replay resumes after restart. If the coordinator's disk dies, the hints are lost; the cluster relies on Merkle tree anti-entropy or read repair to catch the missed writes. Cassandra hints live in a system table (system.hints) that is replicated within the local data directory but not across coordinators.

Why does hinted handoff need a max window?

Without a bound, hints accumulate indefinitely whenever a node is down — and a permanently-failed node creates a permanent backlog. The hint store would grow without limit on every other node, eventually exhausting disk. A 3-hour window assumes transient outages (rolling restart, brief network partition, slow recovery) but switches to bulk recovery (Merkle repair) for longer absences. Tune max_hint_window_in_ms upward if your typical outage exceeds a few hours; tune downward if you have many short-lived nodes.

Can hinted handoff cause out-of-order writes?

Yes, and the system relies on timestamps to resolve them. When a hint is replayed minutes or hours after the original write, the absent replica may have already received fresher writes via normal paths. Cassandra uses last-write-wins by timestamp — the hint write only takes effect if its timestamp is newer than what the replica already holds. Without timestamps (or with skewed clocks), hinted handoff can resurrect deleted values or roll back updates. Always sync clocks and use cell-level timestamps.