Question 1

What is the Bully algorithm?

Accepted Answer

Garcia-Molina's Bully algorithm (1982) elects the highest-ID alive node. When any node N notices the leader is dead, it sends an ELECTION message to every node with a higher ID. If anyone replies with OK (meaning 'I'm alive, I'll handle it'), N stops. The OK senders then start their own ELECTION rounds. Eventually only the highest live ID receives no OKs and broadcasts COORDINATOR to declare itself leader. Worst case is O(n²) messages and one round per ID gap. The Bully is fine in small clusters with stable IDs but assumes synchronous timing — in real networks delayed messages can cause two nodes to both think they won, hence the name 'bully' for the way the highest-ID forces all others to defer.

Question 2

How does Raft randomize election timeouts to avoid splits?

Accepted Answer

If every follower used the same election timeout, a leader failure would trigger many simultaneous candidacies — each requesting votes from peers, each splitting the votes. No candidate would get a majority. Raft fixes this by giving each follower a randomized election timeout in [150 ms, 300 ms] (other implementations use [500, 1000] or other ranges). The first follower whose timer expires becomes a candidate, increments its term, and asks for votes — usually before any peer's timer fires. If a vote split does occur (rare), the term increments fail and a new randomized round starts. The probability of repeated splits drops exponentially with the timeout's variance, so elections converge in 1-2 rounds typically.

Question 3

How does ZooKeeper's ephemeral-znode protocol work?

Accepted Answer

Each candidate creates an ephemeral sequential znode under a known path (e.g. /election/n_). ZooKeeper assigns a strictly increasing suffix: n_0000000005, n_0000000006, etc. Each candidate then lists the parent path and checks whether its sequence number is the lowest. If yes, it is the leader. If no, it sets a watch on the immediately-preceding znode and waits. When the predecessor's session expires (the candidate dies, network partitions), ZooKeeper deletes its ephemeral znode atomically, which triggers the watch — the next-lowest candidate now checks and finds itself the leader. This avoids thundering herd (each waiter only watches one node), is fair (FIFO by sequence number), and exploits ZooKeeper's session semantics for liveness.

Question 4

What is split-brain and how does fencing prevent it?

Accepted Answer

Split-brain occurs when two nodes both believe they are leader at the same time — typically because of a network partition or a leader being slow but not dead. Both accept writes, leading to divergent state. Prevention: (1) Quorum — a leader must hold a majority's confirmation; on partition only one side can have a majority. (2) Lease — a leader holds a time-bounded lease and must renew it; old leaders self-demote when their lease expires. (3) Fencing tokens — every operation a leader performs carries a monotonically increasing token; the resource (storage system) rejects operations from older tokens. Even if two leaders co-exist briefly, the older one's writes are rejected. Martin Kleppmann's 'how to do distributed locking' essay popularized fencing tokens as the rigorous solution.

Question 5

What does FLP impossibility actually say?

Accepted Answer

Fischer, Lynch, and Paterson (1985) proved that in a fully asynchronous model with even one faulty process, no deterministic protocol can guarantee consensus on every execution. The intuition: you cannot tell a slow node from a dead one without a timeout, and any timeout-based protocol can be defeated by adversarial timing. Real systems sidestep FLP with partial synchrony (Dwork-Lynch-Stockmeyer, 1988) — assume eventual message delivery — or with randomization (Ben-Or, 1983) — use coin flips so the adversary cannot stall every execution. Raft, Paxos, and ZAB all rely on partial synchrony: they make progress when timing is reasonable and remain safe even when it is not. FLP is a liveness result, not a safety result.

Question 6

How does etcd v3 use Raft for leader election?

Accepted Answer

etcd is a distributed key-value store implemented as a Raft replicated state machine. The Raft cluster (typically 3 or 5 nodes) elects its own log leader, which serializes all writes. On top of this, etcd exposes a leader-election API for application use: clients call Campaign() with a desired key prefix; etcd creates a lease and a key with the client's value. The client whose key has the lowest revision number is the leader; others watch the prefix for revision changes. When the leader's lease expires (default 10 s, refreshed every 1 s), its key is deleted, the next-lowest revision wins. Kubernetes uses this exact pattern via leader-election libraries — kube-controller-manager and kube-scheduler each elect a leader through etcd to ensure only one active instance per role.

Leader Election

Interactive visualization

Watch the 60-second explainer

Why leader election matters

What a leader-election protocol must guarantee

Raft elections in detail

ZooKeeper election in detail

Fencing tokens — the safety net

Common misconceptions

Practical numbers

Frequently asked questions