Distributed Systems
Raft Consensus
A leader, a log, and a quorum — that's the whole algorithm
Raft is a consensus algorithm for replicated state machines that elects a single leader to serialize log entries across a cluster. It tolerates f failures with 2f+1 nodes and was designed to be understandable, unlike Paxos.
- Fault tolerancef failures with 2f+1 nodes
- Replication latency (steady state)~1 RTT to a majority
- Leader election~150–300 ms typical
- Failure modelCrash-fault, not Byzantine
- Used inetcd, Consul, CockroachDB, TiKV
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How Raft works
Raft splits consensus into three pieces: leader election, log replication, and safety. At any moment every node is in one of three roles — follower, candidate, or leader. Followers passively receive RPCs. The leader is the single node that accepts client writes and pushes them to the rest of the cluster. A candidate is a follower that has timed out waiting for a leader and is campaigning to become one.
Time is divided into terms, monotonically increasing integers. Each term begins with an election. Inside a term there is at most one leader. Every RPC carries the sender's term, and any node that sees a term newer than its own immediately reverts to follower and adopts that term. That single rule — "always defer to a higher term" — is what keeps split-brain from corrupting the log.
Once a leader is elected, replication is straightforward. The client sends a command to the leader. The leader appends it to its local log, sends an AppendEntries RPC to every follower, and waits for a majority to acknowledge. Once acknowledged, the entry is committed: the leader applies it to its state machine and tells followers to do the same on the next heartbeat.
When to use Raft
- You need linearizable reads and writes across 3, 5, or 7 nodes — a strongly consistent metadata store, a configuration service, or a transaction coordinator.
- You can tolerate roughly 1 RTT per write (between the leader and its slowest majority follower).
- Crash-fault tolerance is enough — you control all nodes and don't need to defend against malicious peers.
- You want a small, well-specified algorithm with mature reference implementations (etcd, hashicorp/raft, tikv/raft-rs).
Raft is not a fit for cross-region hot paths where 100ms+ inter-region RTT is the dominant write latency, nor for anything that needs to scale horizontally on writes — there is exactly one leader. For those, layer Raft per shard or pick an AP store.
Raft vs Paxos vs Multi-Paxos vs Zab vs PBFT vs EPaxos
| Raft | Classic Paxos | Multi-Paxos | Zab | PBFT | EPaxos | |
|---|---|---|---|---|---|---|
| Leader model | Strong, single | None (per-decree) | Stable distinguished proposer | Strong, single | Rotating primary | Leaderless |
| RTTs per commit (steady state) | 1 | 2 | 1 | 1 | 2 (3 phases) | 1 fast / 2 slow |
| Failure model | Crash | Crash | Crash | Crash | Byzantine | Crash |
| Min nodes for f failures | 2f+1 | 2f+1 | 2f+1 | 2f+1 | 3f+1 | 2f+1 |
| Designed for understandability | Yes (explicit goal) | No | No | Partly | No | No |
| Membership change | Joint consensus, single-server | Not specified | Reconfiguration via decree | FastLeaderElection | View change | Configuration change |
| Production users | etcd, Consul, CockroachDB | (rarely shipped raw) | Chubby, Spanner, Megastore | ZooKeeper | Hyperledger, Tendermint | Research, some at scale |
Multi-Paxos and Raft converge on essentially the same steady-state behavior: a stable leader, one round-trip per commit. Raft's contribution is making the leader-election and log-matching rules explicit instead of leaving them as exercises for the implementer.
What Raft actually costs
In a healthy 5-node cluster with a stable leader, every client write costs roughly one round-trip from the leader to its slowest majority follower. If the leader's three closest followers reply in 5 ms, the commit takes 5 ms — even nodes 4 and 5 can be slow or down without delaying anything.
Leader election is the expensive case. When the current leader dies or partitions, every follower waits a randomized timeout (typically 150–300 ms). The first to time out becomes a candidate, increments the term, and starts a vote. If it wins, it sends an immediate heartbeat to suppress further candidates. If two candidates split the vote, both back off and retry — randomized timeouts make repeated splits exponentially unlikely. Election typically completes in 2–3 timeout intervals, so plan for sub-second unavailability per leader change.
Log replication scales linearly with the number of followers — the leader sends N AppendEntries RPCs in parallel, so the bottleneck is the leader's outbound bandwidth and the slowest majority. A 7-node cluster doesn't make commits faster, but it does let you tolerate 3 simultaneous failures instead of 2. The trade-off: quorum size grows from 3 to 4, so the leader waits for one more acknowledgement per write.
JavaScript: leader-election state machine
// Raft node states
const FOLLOWER = 'follower', CANDIDATE = 'candidate', LEADER = 'leader';
class RaftNode {
constructor(id, peers) {
this.id = id;
this.peers = peers;
this.state = FOLLOWER;
this.currentTerm = 0;
this.votedFor = null;
this.log = []; // [{ term, command }, ...]
this.commitIndex = 0;
this.electionTimeout = null;
this.resetElectionTimer();
}
// Randomized to avoid split votes (150–300 ms)
resetElectionTimer() {
clearTimeout(this.electionTimeout);
const ms = 150 + Math.random() * 150;
this.electionTimeout = setTimeout(() => this.startElection(), ms);
}
startElection() {
this.state = CANDIDATE;
this.currentTerm += 1;
this.votedFor = this.id;
let votes = 1; // vote for self
this.peers.forEach(peer => {
peer.requestVote({
term: this.currentTerm,
candidateId: this.id,
lastLogIndex: this.log.length - 1,
lastLogTerm: this.log.at(-1)?.term ?? 0,
}).then(({ term, voteGranted }) => {
if (term > this.currentTerm) return this.stepDown(term);
if (voteGranted && this.state === CANDIDATE) {
votes += 1;
if (votes > (this.peers.length + 1) / 2) this.becomeLeader();
}
});
});
this.resetElectionTimer();
}
stepDown(term) {
this.state = FOLLOWER;
this.currentTerm = term;
this.votedFor = null;
this.resetElectionTimer();
}
becomeLeader() {
this.state = LEADER;
clearTimeout(this.electionTimeout);
this.heartbeat(); // assert leadership immediately
}
}
The randomized election timeout is the unsung hero. Without it, every follower would time out at the same instant after a leader crash, every node would campaign in the same term, and votes would split N ways. Random jitter makes one candidate consistently win the race and shut the others up with a heartbeat before they can start their own election.
Python: AppendEntries RPC handler
def append_entries(self, term, leader_id, prev_log_index,
prev_log_term, entries, leader_commit):
"""Handle an AppendEntries RPC from a leader."""
# Reply false if leader's term is older
if term < self.current_term:
return {"term": self.current_term, "success": False}
# Newer term seen — step down
if term > self.current_term:
self.current_term = term
self.voted_for = None
self.state = "follower"
self.reset_election_timer()
# Log-matching property: previous entry must agree on term
if prev_log_index >= 0:
if prev_log_index >= len(self.log):
return {"term": self.current_term, "success": False}
if self.log[prev_log_index]["term"] != prev_log_term:
# Conflict — leader will retry with an earlier prev_log_index
return {"term": self.current_term, "success": False}
# Truncate any conflicting suffix, then append the new entries
self.log = self.log[: prev_log_index + 1] + entries
# Advance commit index, but never beyond what we actually have
if leader_commit > self.commit_index:
self.commit_index = min(leader_commit, len(self.log) - 1)
self.apply_committed()
return {"term": self.current_term, "success": True}
Two invariants make this safe. First, log matching: if two logs contain an entry at the same index with the same term, all entries before it are identical. The prev_log_index / prev_log_term check enforces this — followers reject any append that would create a divergent history. Second, the leader never overwrites or deletes its own log entries; it only appends. Combined with the election restriction (a candidate can't win unless its log is at least as up-to-date as a majority), this guarantees committed entries are durable.
Variants and extensions
- Pre-vote. Before incrementing its term, a candidate first asks "would you vote for me?" without persisting a vote. This stops a partitioned node from churning the cluster's term number every time it rejoins. Used in etcd since 2018.
- Joint consensus. The original membership-change algorithm: enter a transitional configuration where commits require majorities of both the old and new node sets, then finalize. Avoids the dual-majority bug that single-server changes can hit during cascading reconfigurations.
- Single-server membership change. The pragmatic alternative used in production: add or remove one server at a time, since adjacent configurations always overlap by a majority. Simpler to implement, suffices for almost all real reconfiguration patterns.
- Log compaction via snapshotting. The log can't grow forever. Periodically the state machine writes a snapshot, the leader truncates the prefix, and lagging followers receive snapshots via
InstallSnapshotRPCs instead of replaying millions of entries. - Leader leases. A leader that's recently heard heartbeats from a majority can serve linearizable reads from its own state without a quorum round-trip. The lease's expiry must be shorter than any successor's election timeout.
- Witness replicas. A non-voting member that receives the log but doesn't apply it — used to maintain a remote disaster-recovery copy without paying the latency cost on every commit.
Common bugs and edge cases
- Split brain after partition heal. Two leaders briefly coexist if the old leader hasn't yet noticed it lost majority. The term check makes their writes harmless — once either contacts the other, the lower term steps down — but client-side libraries that retry on a stale leader can observe spurious "no leader" errors.
- Forgetting to persist
currentTermandvotedFor. Both must be on stable storage before sending RPCs, or a crash-restart can vote twice in the same term and elect two leaders. - Committing an entry from a previous term. A new leader cannot mark an old-term entry committed solely because a majority replicates it. It must first commit at least one entry from its own term — otherwise a follower's vote could re-elect the old leader and overwrite the "committed" entry.
- Slow follower starves the leader. If the leader's
nextIndexfor a far-behind follower keeps decrementing one entry at a time, recovery takes O(log size) RPCs. Production implementations batch the rollback or fall back to InstallSnapshot. - Read-your-write inconsistency on follower reads. A read served by a follower can lag the leader's commit index. Either route reads through the leader, use a read-index protocol, or accept the staleness explicitly.
- Clock skew breaking leader leases. Leases are sound only if no node's clock drifts faster than a documented bound. NTP-skewed nodes have caused real production split-brain reads.
Frequently asked questions
Why does Raft need an odd number of nodes?
Raft commits an entry once a majority replicates it. With 2f+1 nodes, you tolerate f failures and still have f+1 left to form a majority. Going from 5 to 6 nodes doesn't increase fault tolerance — both tolerate 2 failures — but raises the quorum cost from 3 to 4 acks per write.
How does Raft prevent two leaders at once?
Each leader is elected for a numbered term. A node only votes once per term. To win, a candidate needs majority votes — and any two majorities must overlap by at least one node. That overlapping voter blocks the second candidate. An old leader that comes back from a partition sees a higher term and steps down.
Is Raft strictly easier than Paxos to implement?
The basic algorithm is, yes. Raft was explicitly designed for understandability and ships with a complete spec for log replication, leadership, and membership change. But edge cases — log compaction, joint consensus, pre-vote, leader leases for safe reads — still trip implementers. etcd and HashiCorp's raft library both shipped subtle bugs years after their initial release.
Does Raft tolerate Byzantine failures?
No. Raft assumes crash-fault tolerance: nodes either behave correctly or stop. A malicious or buggy node that lies in AppendEntries can corrupt the cluster. For Byzantine resistance you need PBFT, Tendermint, or HotStuff at the cost of 3f+1 nodes and more rounds per decision.
Why does Raft use heartbeats instead of an explicit lease?
Heartbeats are AppendEntries with no payload, sent every 50–150 ms. They simultaneously assert leadership and probe followers' liveness. A separate lease primitive would still need its own renewal protocol — heartbeats fold both jobs into one mechanism. Most production deployments add a leader-lease layer on top for safe local reads without a quorum round-trip.
What happens when the network partitions?
The partition with the majority continues making progress under the existing or a newly elected leader. The minority partition cannot elect a leader (no majority of votes) and cannot commit anything new. When the partition heals, the stale-term leader steps down, its uncommitted entries are overwritten, and the cluster resyncs.