Networking
Maglev Load Balancing
Google's load balancer that splits traffic evenly and barely flinches when a server dies
Maglev is Google's software load balancer that builds a fixed-size lookup table by consistent hashing, spreading connections almost perfectly evenly while remapping only a tiny fraction of flows when a backend is added or removed.
- Lookup per packetO(1)
- Table build cost≈ O(M log M)
- Table size M65537 (prime) ≫ N
- Load imbalance< 1% with M ≫ N
- Flows remapped on 1 backend change≈ 1/N
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How Maglev spreads connections
Maglev is the software load balancer Google described in its 2016 NSDI paper, "Maglev: A Fast and Reliable Software Network Load Balancer." A fleet of commodity Linux machines sits in front of every Google-facing service. Routers spray incoming packets across these machines using ECMP (equal-cost multi-path), and each Maglev machine forwards every packet to one of N backend servers. The whole point is to pick that backend so that (a) load is split evenly, and (b) the choice survives a packet landing on a different Maglev machine, and (c) the choice barely changes when backends come and go.
The naive approach — backend = backends[hash(5-tuple) % N] — fails condition (c) spectacularly. Change N from 100 to 101 and the modulus shifts under almost every flow: about (N−1)/N ≈ 99% of connections get reassigned to a different server. Every one of those is a live TCP session that breaks. You need consistent hashing: a scheme where adding or removing a backend only disturbs the connections that have to move.
Ring-based consistent hashing (the Karger-style hash ring) solves the disruption problem but spreads load unevenly unless you sprinkle hundreds of virtual nodes per backend — and even then the variance is meaningful. Maglev's insight is to precompute a flat lookup table of fixed size M (a prime, much larger than N). Each of the M entries holds the index of one backend. A packet's destination is just table[hash(5-tuple) % M] — a single array read. Because every Maglev machine builds the identical table from the identical backend list, they all agree on where any flow goes, even when ECMP reshuffles which machine sees it.
The mechanism: permutations and the fill loop
Maglev fills the table with a beautiful little algorithm. Give each backend a preference list — a permutation of all M slot indices, telling that backend the order in which it would like to claim slots. The permutation for backend i is generated from two independent hashes:
offset = h1(name[i]) mod M
skip = h2(name[i]) mod (M − 1) + 1 // in [1, M−1], so gcd(skip, M)=1 since M prime
permutation[i][j] = (offset + j · skip) mod M // j = 0 .. M−1
Because M is prime and skip ∈ [1, M−1], the sequence offset + j·skip (mod M) hits every slot exactly once — a genuine permutation. Now run the fill loop: go round-robin through the backends; each backend proposes its next-most-preferred slot; if that slot is empty, the backend takes it; if it's already claimed, the backend advances its own cursor and tries again next turn. Repeat until all M slots are full.
next[i] = 0 for all backends i // each backend's cursor into its permutation
entry[j] = -1 for all slots j // -1 = empty
filled = 0
while filled < M:
for i in 0 .. N-1: // round-robin over backends
c = permutation[i][next[i]] // backend i's current top choice
while entry[c] != -1: // skip slots already taken
next[i] += 1
c = permutation[i][next[i]]
entry[c] = i // backend i claims slot c
next[i] += 1
filled += 1
if filled == M: break
Round-robin is what makes the result fair: every backend gets the same number of turns, so each ends up owning either ⌊M/N⌋ or ⌈M/N⌉ slots — they differ by at most one. The permutations are what make it consistent: when a backend disappears, its slots reopen and the same fill process largely refills them from the same preference orders, so most surviving assignments are untouched.
The fill loop runs in expected O(M log M) time — the log factor comes from the rising cost of skipping already-filled slots as the table saturates (the worst case is technically O(M²) if permutations collide pathologically, but with independent hashes that never happens in practice). The paper's microbenchmark builds an M = 65537 table in about 1.8 ms (and a 655373 table in 22.9 ms), and the table is rebuilt only on a membership change, not per packet.
When to reach for Maglev hashing
- L4 load balancing at line rate. When you forward millions of packets per second and need an O(1) per-packet decision, a precomputed table beats per-request scoring.
- Stateless, replicated balancers behind ECMP. Maglev's determinism lets you run many identical balancer instances with no shared state — any instance routes a flow the same way.
- You care about connection stability. If a backend deploy or autoscale event must not tear down healthy connections to other backends, minimal disruption is the whole game.
- Near-uniform load with cheap memory. When you'd rather spend a few hundred KB of table than tolerate the load skew of a bare hash ring.
Skip Maglev when you need weighted routing that changes per request (least-connections, latency-aware), when backends are heterogeneous and you want true weighting (Maglev supports weights but only coarsely, via duplicated entries), or when N is so small that simple round-robin already balances fine and you don't need flow stickiness.
Maglev vs other load-balancing strategies
| Maglev table | Ring consistent hashing | Rendezvous (HRW) | Modulo hash | Round-robin | Least-connections | |
|---|---|---|---|---|---|---|
| Lookup cost | O(1) | O(log V) (V vnodes) | O(N) | O(1) | O(1) | O(log N) heap |
| Build / update cost | O(M log M) rebuild | O(V) insert | none | none | none | per-request bookkeeping |
| Flows moved on 1 backend change | ≈ 1/N | ≈ 1/N | exactly 1/N (optimal) | ≈ (N−1)/N | all (no affinity) | all (no affinity) |
| Load evenness | < 1% with M ≫ N | needs 100s of vnodes | near-perfect | perfect (if N fixed) | perfect | perfect by design |
| Per-flow stickiness | yes | yes | yes | only if N fixed | no | no |
| Memory | O(M) table | O(V) ring | O(N) names | O(N) | O(1) | O(N) counters |
| Used by | Google L4 LB, Katran, Cilium | Memcached, Cassandra, Dynamo | CRUSH (Ceph), GLB | toy/naïve LBs | HAProxy default | HAProxy, NGINX |
The headline trade-off: rendezvous hashing achieves exactly the optimal 1/N disruption but pays O(N) per lookup; Maglev gets within a hair of that disruption bound while making every lookup a single array index. Ring hashing matches Maglev's disruption but needs heavy virtual-node tuning to match its evenness. Modulo hashing is the cautionary tale — fast and even, but it reshuffles almost everything the instant N changes.
What the numbers actually say
- Disruption per backend change ≈ 1/N. Remove 1 backend from a 1,000-backend fleet and only ~0.1% of flows move. Compare with modulo hashing, where ~99.9% of flows would jump.
- Load imbalance under 1%. With M = 65537 and N in the hundreds, each backend owns 65537/N slots ± 1. At N = 400 that's ~164 slots each, and the busiest/least-busy gap is one slot — about 0.6%. Even at the paper's N = 1,000 test point the gap is still one slot (66 vs 65 entries, ~1.5%), versus ~30% overprovisioning that ring hashing needed at the same table size.
- O(1) lookup vs O(N) scoring. A 256-backend rendezvous lookup hashes 256 times per packet; Maglev does one modulo and one array read. At 10 Mpps that's the difference between feasible and not.
- Table memory is tiny. 65537 entries × 2 bytes (16-bit backend index) ≈ 128 KB per service — it fits comfortably in L2 cache, so the lookup is effectively free.
- Rebuild is a couple of milliseconds. The paper clocks a 65537-slot table at about 1.8 ms to build (rising to 22.9 ms at 655373), so a membership change is absorbed without dropping line rate — and the table is rebuilt only on membership changes, never per packet.
JavaScript implementation
A compact, faithful Maglev table builder. We use a simple deterministic string hash for the demo; production uses two strong independent hashes (e.g. two seeds of SipHash or FNV).
// Two independent string hashes (toy FNV-1a variants — replace with SipHash in prod)
function h1(s) { let h = 2166136261 >>> 0; for (const c of s) { h ^= c.charCodeAt(0); h = Math.imul(h, 16777619); } return h >>> 0; }
function h2(s) { let h = 0x811c9dc5 ^ 0x9e3779b9; for (const c of s) { h = Math.imul(h ^ c.charCodeAt(0), 0x01000193); } return h >>> 0; }
function buildMaglev(backends, M) { // M must be prime, M >> backends.length
const N = backends.length;
// permutation[i] is generated lazily via offset + j*skip mod M
const offset = backends.map(b => h1(b) % M);
const skip = backends.map(b => (h2(b) % (M - 1)) + 1); // in [1, M-1]
const perm = (i, j) => (offset[i] + j * skip[i]) % M;
const entry = new Int32Array(M).fill(-1);
const next = new Int32Array(N); // each backend's cursor
let filled = 0;
while (filled < M) {
for (let i = 0; i < N; i++) {
let c = perm(i, next[i]);
while (entry[c] !== -1) { next[i]++; c = perm(i, next[i]); }
entry[c] = i;
next[i]++;
if (++filled === M) break;
}
}
return entry; // entry[slot] = backend index
}
function lookup(table, flowKey, M) {
return table[h1(flowKey) % M]; // O(1) per packet
}
// --- demo: how few flows move when a backend dies ---
const M = 65537;
const A = ['s0','s1','s2','s3','s4'];
const tA = buildMaglev(A, M);
const B = ['s0','s1','s2','s4']; // s3 removed
const tB = buildMaglev(B, M);
let moved = 0;
for (let s = 0; s < M; s++) if (A[tA[s]] !== B[tB[s]]) moved++;
console.log((100 * moved / M).toFixed(2) + '% of slots reassigned'); // ~20% = 1/N for the dead backend's share
Two details matter. First, skip is forced into [1, M−1] and M is prime, which guarantees gcd(skip, M) = 1 so the permutation visits every slot. Second, the per-backend next[] cursor is what skips already-claimed slots cheaply — you never rescan from the top.
Python implementation
def _h1(s: str) -> int:
h = 2166136261
for ch in s:
h = ((h ^ ord(ch)) * 16777619) & 0xFFFFFFFF
return h
def _h2(s: str) -> int:
h = 0x811c9dc5 ^ 0x9e3779b9
for ch in s:
h = ((h ^ ord(ch)) * 0x01000193) & 0xFFFFFFFF
return h
def build_maglev(backends, M): # M prime, M >> len(backends)
N = len(backends)
offset = [_h1(b) % M for b in backends]
skip = [(_h2(b) % (M - 1)) + 1 for b in backends] # in [1, M-1]
perm = lambda i, j: (offset[i] + j * skip[i]) % M
entry = [-1] * M
nxt = [0] * N
filled = 0
while filled < M:
for i in range(N):
c = perm(i, nxt[i])
while entry[c] != -1:
nxt[i] += 1
c = perm(i, nxt[i])
entry[c] = i
nxt[i] += 1
filled += 1
if filled == M:
break
return entry # entry[slot] = backend index
def lookup(table, flow_key, M):
return table[_h1(flow_key) % M]
# disruption check
M = 65537
A = ['s0', 's1', 's2', 's3', 's4']
tA = build_maglev(A, M)
B = ['s0', 's1', 's2', 's4'] # s3 removed
tB = build_maglev(B, M)
moved = sum(1 for s in range(M) if A[tA[s]] != B[tB[s]])
print(f"{100 * moved / M:.2f}% of slots reassigned") # ~1/N for the dead backend
Notice the demo: removing one of five backends reassigns roughly a fifth of the slots (the dead backend's share), and almost nothing else moves. With modulo hashing the same change would reassign nearly the entire table.
Variants and relatives worth knowing
Weighted Maglev. To give a backend more load, list it multiple times in the backend array (or scale its number of turns in the fill loop). Each listing claims its own ~M/N slots, so a backend listed 3× gets roughly triple the traffic. Weights are coarse — granularity is one slot — but with M = 65537 that's fine.
Connection tracking. Maglev pairs the table with a per-flow tracking table inside each instance. Even the small set of flows the rebuild would move can be pinned by remembering their existing backend, so a backend addition causes zero disruption to already-established connections on a given Maglev — the table only governs new flows.
Rendezvous / highest-random-weight (HRW) hashing. The minimal-disruption ideal: for each flow, compute hash(flow, backend) for every backend and pick the maximum. Exactly 1/N of keys move on a change, but lookup is O(N). CRUSH (Ceph) and GitHub's GLB use HRW ideas; Maglev is the O(1) table-baked cousin.
Ring (Karger) consistent hashing. Place backends and keys on a hash ring; a key goes to the next backend clockwise. Add virtual nodes (100–200 per backend) to smooth load. Maglev avoids the vnode tuning by spreading evenly through the fill loop instead.
Open-source descendants. Facebook's Katran (an XDP/eBPF L4 LB) and Cilium both implement Maglev hashing for backend selection; Google's own GFE and Cloud Load Balancing trace to the Maglev design.
Common bugs and edge cases
- Choosing a non-prime M. If M isn't prime,
skipcan share a factor with M, the "permutation" misses slots, and the fill loop spins forever (or fills unevenly). M must be prime — 65537 and 655373 are the canonical choices. - Letting skip be 0. If
skip = 0, the permutation degenerates to a constant — every term equalsoffset. Always map skip into[1, M−1], not[0, M−1]. - Reusing one hash for both offset and skip. If h1 and h2 are correlated, backends with similar names get similar permutations and load skews. Use two genuinely independent hashes (different seeds or different functions).
- M too close to N. Evenness depends on M ≫ N. If M is only a few times N, the ±1-slot rounding becomes a large percentage and load skews. Keep M at least ~100× the maximum backend count.
- Forgetting that ECMP also reshuffles. Maglev keeps the backend stable, but a router rehash can send a live flow to a different Maglev machine. Without per-instance connection tracking, the table alone still preserves the backend — but only because every instance built the identical table. Diverging backend lists across instances silently breaks consistency.
- Treating the table as a session store. The table maps flow hashes to backends; it is not application session affinity. If two different clients hash to the same slot, they share a backend — that's correct for L4, but don't assume one slot equals one user.
Frequently asked questions
What problem does Maglev hashing solve that plain modulo hashing doesn't?
Plain hash(flow) % N reshuffles almost every connection when N changes — add or remove one backend and roughly (N−1)/N of all flows jump to a different server, breaking live TCP sessions. Maglev keeps the table size M fixed and only reassigns the slots that belonged to the changed backend, so a single backend change disturbs about 1/N of flows instead of nearly all of them.
How does Maglev decide the lookup table size M?
M is a fixed prime chosen so M is much larger than the number of backends N — Google used 65537 and 655373 in production. Each backend ends up owning roughly M/N slots, so the larger M is relative to N, the more evenly load is spread. The trade-off is memory and build time: the table is O(M) and building it is roughly O(M log M).
Is Maglev hashing perfectly even?
Almost. The busiest and least-busy backend differ by at most one slot, so when M is at least ~100× the backend count the imbalance stays under about 1%. The paper reports almost-perfect balancing for Maglev at every table size it tested — far tighter than ring-based consistent hashing, which needed backends overprovisioned by ~30% at M = 65537 to absorb its skew.
Why does Maglev sit behind ECMP routers instead of replacing them?
Routers use ECMP (equal-cost multi-path) to spray packets across many Maglev machines by 5-tuple hash. Because every Maglev builds the same lookup table from the same backend list, any Maglev that receives a packet maps it to the same backend — so even if ECMP reshuffles which Maglev a flow hits, the backend stays consistent. Maglev adds per-flow consistency that ECMP alone can't guarantee.
What happens to existing connections when a backend is removed?
Maglev rebuilds the table without the dead backend. Only the slots that pointed to the removed backend get reassigned to survivors; every other slot keeps its old target, so all unaffected flows stay pinned to the same server. Maglev also keeps a per-flow connection-tracking table so even the small set of disturbed flows can be held steady within a single Maglev instance.
How is Maglev different from rendezvous (HRW) hashing?
Rendezvous hashing computes hash(key, backend) for every backend at lookup time and picks the max — O(N) per request, perfect minimal disruption. Maglev front-loads that work into a precomputed table, so each request is one O(1) array index. Maglev trades a slightly-less-than-perfect disruption bound for constant-time lookups at line rate, which matters when you forward millions of packets per second.