Distributed Systems
Leader Election
Bully algorithm, Raft term elections, ZooKeeper ephemeral nodes — the building block of replication
Leader election is the problem of choosing a single coordinator from a group of distributed processes — necessary because operations like writes, lock acquisitions, and consensus rounds need a unique decision-maker. Classic algorithms: the Bully (Garcia-Molina 1982), where the highest-ID node wins after a cascade of broadcast messages. Modern: Raft uses term-based randomized timeout elections (any follower can become candidate after election timeout 150-300 ms). ZooKeeper uses ephemeral sequential znodes — the lowest-numbered node wins. Pitfalls: split brain (two leaders), election thrashing, and the FLP impossibility (no leader election in asynchronous systems with even one failure).
- Classic algorithmBully (Garcia-Molina 1982)
- Raft election timeout150-300 ms randomized
- ZooKeeperephemeral sequential znodes
- Split braintwo-leader hazard
- FLP impossibility1985
- Real consensusRaft, Paxos, ZAB
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why leader election matters
- Kafka controller. One broker in the cluster is the controller — owner of partition assignments, ISR membership, and topic creation. Pre-KRaft, ZooKeeper elected it via ephemeral znode; KRaft now does it with Raft. Failover takes ~10 seconds.
- Kubernetes leader-elector. kube-controller-manager and kube-scheduler each run as multi-replica deployments but only the elected leader actually acts. Election goes through etcd's
LeaseAPI; the active replica refreshes the lease every 2 seconds (default), failover is ~15 seconds. - Distributed locks. Redlock and ZooKeeper-based locks reduce to leader election: the lock holder is the "leader" of that resource. Clients race to create an ephemeral key; lowest sequence wins; others queue with watches.
- Primary-replica databases. PostgreSQL with Patroni, MySQL with Orchestrator, MongoDB replica sets — all elect a primary that accepts writes; replicas tail the log. Failover requires re-electing among the survivors.
- Cluster managers. HBase HMaster, HDFS NameNode HA, Yarn ResourceManager — all use ZooKeeper-based leader election so only one master is active at a time.
- SaaS feature flags / config rollouts. A scheduled job runs once per cluster, not once per replica; leader election ensures exactly one replica acquires the "scheduler" role.
What a leader-election protocol must guarantee
- Safety (uniqueness). At any time, at most one node is recognized as leader by the system as a whole. Note "recognized" — old leaders may still believe they are leader during a partition, which is why fencing matters.
- Liveness. Eventually, when the network is reasonable, some node becomes leader. FLP impossibility shows you cannot have both safety and liveness in an asynchronous model with crash failures, so practical protocols compromise on liveness during pathological asynchrony.
- Termination. The election round itself completes in finite time under reasonable assumptions.
- Fairness (optional). Any candidate can eventually be elected. ZooKeeper's sequential znodes give FIFO fairness; Raft's randomized timeouts are uniformly random.
Raft elections in detail
Each Raft node is in one of three states: follower, candidate, or leader. Time is divided into terms — monotonically increasing integers, restarted on every election. A follower with no leader heartbeat for its election timeout (150–300 ms randomized) becomes a candidate: it increments its term, votes for itself, and sends RequestVote RPCs to all peers. Each peer grants its vote at most once per term, to the first request whose log is at least as up-to-date as the peer's. If the candidate gets a majority of votes, it becomes leader and starts sending heartbeats (AppendEntries with empty log). If it receives a higher-term message, it reverts to follower. If the timeout expires without majority (split vote), it increments term again and retries.
The "log up-to-date" constraint matters: it ensures a leader has all committed entries, otherwise a leader could overwrite committed values. The randomized timeout makes split votes self-healing — typical convergence is 1-2 rounds (~300-600 ms) on a cluster with no faults besides the failed leader.
ZooKeeper election in detail
- Each candidate creates an ephemeral sequential znode under
/election/leader_. ZooKeeper appends a globally-monotonic counter:leader_0000000007. - The candidate calls
getChildren(/election)and finds the smallest sequence number among the children. - If its own number is smallest, it is the leader. Done.
- Otherwise, it identifies the immediate predecessor (the next-smaller number) and sets an exists-watch on it.
- If the candidate's session times out (default 30 s) or it crashes, ZooKeeper atomically deletes its znode — the next-smallest candidate's watch fires and it becomes leader.
- Watching only the predecessor (not all candidates) avoids the herd: a single deletion wakes a single watcher.
This works because ZooKeeper guarantees session semantics: an ephemeral znode lives exactly as long as the creating client's session. Heartbeats keep the session alive; failure deletes the znode atomically.
Fencing tokens — the safety net
Even with quorum-based election, there is a brief window where a deposed leader does not yet know it has lost. If it can still reach the storage system, it can send stale writes and corrupt state. Fencing solves this: the election authority issues a monotonically increasing token (epoch, term number, ZooKeeper zxid) with the leadership grant. Every storage operation includes the token. The storage system tracks the highest token it has seen and rejects any operation with a smaller token. Old leaders, even if reachable, are silently fenced out. Implementations: HBase epoch numbers in WAL, GFS chunk-server leases with version numbers, Kubernetes resourceVersion in optimistic-concurrency writes.
Common misconceptions
- "Leaders never fail." The whole reason for an election protocol is failover. Leaders die, network partitions occur, GC pauses cause apparent death — every production system must elect a new leader regularly.
- "Any node can elect itself." A node can declare candidacy but needs a majority quorum to actually become leader. Without quorum, two partition halves cannot both elect a leader.
- "Raft and Paxos are leader-election protocols." They include leader election as a sub-step, but they are full consensus protocols — they replicate a log of operations, not just a leader identity. Pure leader election is simpler.
- "ZooKeeper itself does not need leader election." It does — internally, ZooKeeper is a Zab-replicated cluster that elects its own leader for write serialization. The application-level leader election we describe sits on top.
- "GC pauses are not a problem." A long GC pause (Java >30 seconds was historically common) makes a leader appear dead, election fires, but the original leader resumes and thinks it is still in charge — classic split-brain. Fencing tokens are the only safe answer.
- "Lower-ID-wins is fair." Bully and ZooKeeper-sequential are not balanced — the same low-ID node tends to win repeatedly under churn. Raft's randomized timeouts give better load distribution.
Practical numbers
- Raft. 150-300 ms election timeout, ~600 ms typical leader failover including detection.
- ZooKeeper. Session timeout 30 s default; ephemeral-znode-based election typically detects failover in under 1 s after session expires.
- etcd. 1 s lease + 1 s heartbeat = ~2-3 s failover for application leader election on top.
- Kubernetes (default). 15 s lease duration, 10 s renewal deadline, 2 s retry — failover in 10-15 s.
- Kafka KRaft. 18 s controller failover (lower than the 30 s+ of the ZooKeeper-based predecessor).
Frequently asked questions
What is the Bully algorithm?
Garcia-Molina's Bully algorithm (1982) elects the highest-ID alive node. When any node N notices the leader is dead, it sends an ELECTION message to every node with a higher ID. If anyone replies with OK (meaning 'I'm alive, I'll handle it'), N stops. The OK senders then start their own ELECTION rounds. Eventually only the highest live ID receives no OKs and broadcasts COORDINATOR to declare itself leader. Worst case is O(n²) messages and one round per ID gap. The Bully is fine in small clusters with stable IDs but assumes synchronous timing — in real networks delayed messages can cause two nodes to both think they won, hence the name 'bully' for the way the highest-ID forces all others to defer.
How does Raft randomize election timeouts to avoid splits?
If every follower used the same election timeout, a leader failure would trigger many simultaneous candidacies — each requesting votes from peers, each splitting the votes. No candidate would get a majority. Raft fixes this by giving each follower a randomized election timeout in [150 ms, 300 ms] (other implementations use [500, 1000] or other ranges). The first follower whose timer expires becomes a candidate, increments its term, and asks for votes — usually before any peer's timer fires. If a vote split does occur (rare), the term increments fail and a new randomized round starts. The probability of repeated splits drops exponentially with the timeout's variance, so elections converge in 1-2 rounds typically.
How does ZooKeeper's ephemeral-znode protocol work?
Each candidate creates an ephemeral sequential znode under a known path (e.g. /election/n_). ZooKeeper assigns a strictly increasing suffix: n_0000000005, n_0000000006, etc. Each candidate then lists the parent path and checks whether its sequence number is the lowest. If yes, it is the leader. If no, it sets a watch on the immediately-preceding znode and waits. When the predecessor's session expires (the candidate dies, network partitions), ZooKeeper deletes its ephemeral znode atomically, which triggers the watch — the next-lowest candidate now checks and finds itself the leader. This avoids thundering herd (each waiter only watches one node), is fair (FIFO by sequence number), and exploits ZooKeeper's session semantics for liveness.
What is split-brain and how does fencing prevent it?
Split-brain occurs when two nodes both believe they are leader at the same time — typically because of a network partition or a leader being slow but not dead. Both accept writes, leading to divergent state. Prevention: (1) Quorum — a leader must hold a majority's confirmation; on partition only one side can have a majority. (2) Lease — a leader holds a time-bounded lease and must renew it; old leaders self-demote when their lease expires. (3) Fencing tokens — every operation a leader performs carries a monotonically increasing token; the resource (storage system) rejects operations from older tokens. Even if two leaders co-exist briefly, the older one's writes are rejected. Martin Kleppmann's 'how to do distributed locking' essay popularized fencing tokens as the rigorous solution.
What does FLP impossibility actually say?
Fischer, Lynch, and Paterson (1985) proved that in a fully asynchronous model with even one faulty process, no deterministic protocol can guarantee consensus on every execution. The intuition: you cannot tell a slow node from a dead one without a timeout, and any timeout-based protocol can be defeated by adversarial timing. Real systems sidestep FLP with partial synchrony (Dwork-Lynch-Stockmeyer, 1988) — assume eventual message delivery — or with randomization (Ben-Or, 1983) — use coin flips so the adversary cannot stall every execution. Raft, Paxos, and ZAB all rely on partial synchrony: they make progress when timing is reasonable and remain safe even when it is not. FLP is a liveness result, not a safety result.
How does etcd v3 use Raft for leader election?
etcd is a distributed key-value store implemented as a Raft replicated state machine. The Raft cluster (typically 3 or 5 nodes) elects its own log leader, which serializes all writes. On top of this, etcd exposes a leader-election API for application use: clients call Campaign() with a desired key prefix; etcd creates a lease and a key with the client's value. The client whose key has the lowest revision number is the leader; others watch the prefix for revision changes. When the leader's lease expires (default 10 s, refreshed every 1 s), its key is deleted, the next-lowest revision wins. Kubernetes uses this exact pattern via leader-election libraries — kube-controller-manager and kube-scheduler each elect a leader through etcd to ensure only one active instance per role.