Compilers
Register Allocation
Unlimited virtual registers, sixteen physical ones — pick wisely
Register allocation maps a compiler's unlimited virtual registers to a CPU's fixed physical registers. Graph coloring (Chaitin) or linear scan (most JITs). Spill cost minimization. The pass that decides whether your hot loop runs at full speed.
- ProblemMap virtual registers → physical registers
- ComplexityNP-complete (Chaitin 1981) — heuristic algorithms only
- x86-64 GPRs available13–16 (after ABI/reserved registers)
- Main algorithmsGraph coloring (Chaitin-Briggs), linear scan, PBQP, SSA-based
- JITs typically useLinear scan or greedy — 10–100× faster than coloring
- Spill cost in tight loop5–30% wall-time per spilled value
Interactive visualization
Build an interference graph from a live-range overlap, then color it with k physical registers. Watch a node spill when the graph won't color.
How register allocation works
A compiler IR (LLVM, GIMPLE, V8's Sea of Nodes) uses an unbounded set of "virtual registers" — every SSA value gets its own name. The CPU, however, has a fixed register file. x86-64 exposes 16 general-purpose integer registers, but several are reserved by ABI or hardware: RSP for the stack pointer, RDI/RSI/RDX/RCX/R8/R9 as the System V argument-passing registers, RAX/RDX clobbered by mul/div. The allocator typically has 13–16 usable GPRs.
The job: assign each virtual register a physical register, or — if you can't — spill it to a stack slot. Constraints:
- Two values simultaneously live (holding different live values at the same instruction) cannot share a physical register.
- Some instructions force specific registers (the result of
divgoes in RAX:RDX; the count for variable shifts must be in CL). - Calling conventions reserve registers (RBP often as frame pointer; callee-saved must be preserved).
The first constraint is the hard one — it makes register allocation equivalent to graph coloring, which Chaitin proved NP-complete in 1981.
The interference graph
Run liveness analysis: for each virtual register, compute the set of instructions where it's live (defined before, used after). Two virtual registers interfere if their live ranges overlap at any point. Build a graph — vertices are virtual registers, edges connect interfering pairs.
Now register allocation is a graph coloring problem: color each vertex with one of k colors (= k physical registers) so adjacent vertices have different colors. If k-coloring fails, you need to spill — pick a vertex, remove it from the graph, generate load/store code for it, and try again.
// Example: 4 virtual registers, 3 physical registers (R1, R2, R3)
// Live ranges (instruction numbers):
// a: lives 1..5
// b: lives 2..7 (overlaps with a)
// c: lives 4..6 (overlaps with a and b)
// d: lives 6..9 (overlaps with b and c)
// Interference graph:
// a — b — d
// | |
// +— c —+
// 3-coloring:
// a: R1 b: R2 c: R3 d: R1 (a and d don't overlap)
Chaitin–Briggs graph coloring
The classical algorithm runs in two phases — simplify and select — with a spill fallback when simplification gets stuck.
// Chaitin-Briggs register allocation
// k = number of physical registers
func allocate(graph, k):
stack = []
while graph has vertices:
v = find_vertex(graph, degree < k)
if v exists:
stack.push(v)
graph.remove(v) // simplify
else:
v = pick_spill_candidate(graph) // optimistic / Briggs
stack.push(v)
graph.remove(v)
// Select phase — pop and color
while stack:
v = stack.pop()
used = { color(u) for u in v.original_neighbors if colored }
if exists c in [1..k] - used:
color(v) = c
else:
spill(v) // actual spill
if any spills:
rewrite_with_spills(); restart()
The Briggs improvement over Chaitin's original is the "optimistic" coloring: even if a node has ≥ k neighbors in the current graph, push it on the stack anyway — at coloring time some of its neighbors may have ended up with the same color and it might still succeed. Reduces unnecessary spills by 20–30% in practice.
Linear scan
Poletto and Sarkar published linear scan in 1999. The insight — for JIT compilation, you don't need the optimal coloring, you need fast allocation. Don't build the interference graph at all; instead, sort live ranges by their start position and sweep linearly.
// Linear-scan allocator
func linear_scan(ranges, k):
sort ranges by start_pos
active = [] // intervals currently holding a reg
free_regs = {R1..Rk}
for range in ranges:
expire_old(range.start, active, free_regs)
if free_regs is empty:
spill_at(range, active)
else:
range.reg = free_regs.pop()
active.add(range)
func expire_old(now, active, free_regs):
for r in active where r.end < now:
free_regs.add(r.reg); active.remove(r)
func spill_at(range, active):
longest = max(active, key = end_pos)
if longest.end > range.end:
range.reg = longest.reg
spill(longest)
active.remove(longest); active.add(range)
else:
spill(range)
Linear scan runs in O(n log n) (the sort dominates), is trivial to implement, and produces 5–15% slower code than graph coloring on benchmarks. For JITs that need to compile in milliseconds, that's the right trade. V8's TurboFan, HotSpot's C1, and .NET's RyuJIT all use variants of linear scan; LLVM's RegAllocGreedy is a more sophisticated linear-scan derivative that has become the default since LLVM 4.0.
Allocator comparison
| Graph coloring | Linear scan | PBQP | |
|---|---|---|---|
| Time complexity | O(n²) typical, up to O(n³) | O(n log n) | O(n³) worst, polynomial-approx |
| Code quality | Best | 5–15% slower than coloring | Best (with constraint modeling) |
| Implementation effort | High — liveness, IG, simplify, select, spill rewrite | Low — sort + sweep | Very high — quadratic-form solver |
| Handles constraints | Yes (precolored nodes) | Partial (extra fixup) | Excellent (encoded in cost matrix) |
| Used in production | GCC (old), HotSpot C2 | V8, HotSpot C1, JikesRVM, .NET RyuJIT | LLVM (optional), research compilers |
| Compile speed (1k lines) | ~50 ms | ~3 ms | ~150 ms |
| Best for | AOT, optimizing tier | JIT, fast compile | Architectures with weird constraints (DSPs) |
When this matters for you
- Writing inline assembly with constraints. GCC and Clang's
asm()blocks accept constraint letters that pin operands to specific registers — you need to know what the allocator is doing so you don't over-constrain it. - Profiling shows surprising load/store activity. If your hot loop has
movs to and from the stack, you're spilling. Reduce live ranges (split variables, hoist) or restructure to reduce simultaneous liveness. - Targeting register-poor architectures. 32-bit x86 has only 8 GPRs (and EAX is special) — spills are common even on simple code. AVR microcontrollers have 32 8-bit registers but operations work on pairs. ARM Thumb-2 restricts you to R0–R7 for many encodings.
- Debugging compiler-generated assembly. Understanding why
-O2chose a particular register layout requires understanding the allocator's heuristics.
Worked example — a 4-color interference graph
The animation walks through this case. Six virtual registers v1–v6, four physical registers R1–R4, with this interference pattern:
Interference edges:
v1 — v2, v1 — v3, v1 — v4
v2 — v3, v2 — v5
v3 — v4, v3 — v5
v4 — v6
v5 — v6
Chaitin simplify:
v6 has degree 2 (< 4) — push, remove
v1 now has degree 3 (< 4) — push, remove
v4 now has degree 1 (< 4) — push, remove
v5 now has degree 1 (< 4) — push, remove
v2 has degree 1, v3 has degree 1 — push, remove
Select (pop in reverse):
v3: assign R1
v2: neighbors {v3=R1} → R2
v5: neighbors {v2=R2, v3=R1} → R3
v4: neighbors {v3=R1} → R2
v1: neighbors {v2=R2, v3=R1, v4=R2} → R3
v6: neighbors {v4=R2, v5=R3} → R1
Result: 4-colorable, no spills. All in registers.
Now add one edge — v1 ↔ v6. Suddenly v1 interferes with v6's neighbors v4 and v5 too transitively, and the graph requires 5 colors. With only 4 physical registers, the allocator spills the lowest-cost vertex (typically the one with the fewest uses inside loops).
C-style allocator skeleton
// Minimal linear-scan allocator
typedef struct { int vreg, start, end, preg; bool spilled; } Range;
void linear_scan(Range *ranges, int n, int k) {
qsort(ranges, n, sizeof(Range), cmp_start);
int free_regs[k]; for (int i = 0; i < k; i++) free_regs[i] = 1;
Range *active[k]; int nactive = 0;
for (int i = 0; i < n; i++) {
Range *r = &ranges[i];
// Expire
for (int j = 0; j < nactive; ) {
if (active[j]->end < r->start) {
free_regs[active[j]->preg] = 1;
active[j] = active[--nactive];
} else j++;
}
// Find free reg
int reg = -1;
for (int p = 0; p < k; p++) if (free_regs[p]) { reg = p; break; }
if (reg == -1) {
// Spill the longest-living active or self
Range *longest = active[0];
for (int j = 1; j < nactive; j++)
if (active[j]->end > longest->end) longest = active[j];
if (longest->end > r->end) {
r->preg = longest->preg;
longest->spilled = true;
// Replace longest with r in active
} else {
r->spilled = true;
}
} else {
r->preg = reg;
free_regs[reg] = 0;
active[nactive++] = r;
}
}
}
Real allocators add coalescing (eliminate copies left over from SSA deconstruction), live-range splitting (allow a value to occupy different registers in different regions), rematerialization (recompute a value instead of reloading from stack if it's cheap), and per-register-class handling (integer vs floating-point vs vector files).
Common register allocation pitfalls
- Pre-coloring constraints poison the graph. Instructions that force a specific physical register (x86
divwants RAX, shifts want CL) create "pre-colored" nodes. Pre-colored nodes are non-removable in simplification — they create high-degree pseudo-vertices that easily trigger spills nearby. - Calling conventions consume registers. Every function call clobbers caller-saved registers. Across a call, any live virtual register holding a value in a caller-saved physical register must spill (or be moved to a callee-saved register first). Callee-saved register usage shows up in function prologue/epilogue.
- Loop-carried dependencies inflate register pressure. A reduction like
sum += a[i]needs sum live the entire loop. Multiple parallel reductions for vectorization multiply pressure. Strip-mining or partial-sums restructuring helps. - Phi-node coalescing failures. SSA deconstruction inserts copies; if the allocator can't coalesce them away (because the two φ-operands interfere), you pay real
movs. Sreedhar's destruction algorithm minimizes this. - Floating-point pressure. x86-64 has 16 SSE/AVX registers (XMM0–XMM15). Hot floating-point code easily blows past this. AVX-512 adds 32 ZMM registers — much better.
Frequently asked questions
Why does register allocation matter for performance?
A CPU register access is roughly 1 cycle. A stack/memory access — even with the L1 cache — is 3–5 cycles, and an L2 miss is 10+. Every variable that doesn't fit in a register turns into load/store instructions. For tight inner loops where the working set exceeds the register file, register allocation quality is the single biggest performance lever the compiler has. Spilling one variable in a hot loop can cost 5–30% wall-time.
What's the interference graph?
A graph where vertices are virtual registers (live ranges) and edges connect any two that are simultaneously live. Two vertices that share an edge cannot occupy the same physical register because they hold different values at the same time. Register allocation becomes graph coloring — assign one of k colors (k = number of physical registers) to each vertex so no edge has both endpoints the same color. Chaitin proved in 1981 that this problem is NP-complete in general.
What's graph coloring register allocation?
The Chaitin–Briggs algorithm. Build the interference graph from liveness analysis. Repeatedly remove any vertex with fewer than k neighbors (it will be colorable). Push those vertices on a stack. If you can't find any low-degree vertex, mark one as a spill candidate and remove it. When the graph is empty, pop the stack and assign colors. Used by GCC's IRA, LLVM's older RegAllocGreedy fallback, and most static optimizing compilers.
What's linear scan register allocation?
Poletto and Sarkar 1999 — sort all live ranges by start position, scan through them allocating physical registers in O(n log n) time. When two ranges overlap and all registers are taken, spill the longest one. Doesn't build the full interference graph. Linear scan is what JITs use — V8 (TurboFan), HotSpot C1, .NET RyuJIT — because graph coloring is too slow at runtime. Produces 5–15% slower code than full graph coloring but compiles 10–100× faster.
What's spilling?
When a virtual register can't get a physical register, the compiler "spills" it to a stack slot — inserts a store before every definition and a load before every use. Spilling is correct but slow. Spill cost is a heuristic: roughly (uses + defs) × 10^(loop_depth), so a value used inside a triply-nested loop costs 1000× more to spill than one used at top level. The allocator picks low-cost candidates first.
How many physical registers does x86-64 have?
x86-64 exposes 16 general-purpose integer registers, but several are reserved or constrained — RSP is the stack pointer, RBP often the frame pointer, RAX/RDX are clobbered by div/mul, RCX is the shift-count register, RDI/RSI/RDX/RCX/R8/R9 are the System V ABI argument registers. Practically the allocator has 13–16 GPRs to work with depending on calling convention and ABI choices. ARM64 gives you 31 general-purpose registers — almost double the pressure budget.
What's coalescing?
Eliminating unnecessary move instructions. SSA deconstruction inserts copies on φ-function edges; many of those copies are redundant if the source and destination would naturally land in the same register. The allocator merges (coalesces) their live ranges into one node before coloring. Aggressive coalescing (Chaitin) merges anything mergeable; conservative coalescing (Briggs) only merges if it won't cause the combined node to become high-degree. Modern compilers use iterated register coalescing — George and Appel 1996.