Computer Architecture

Branch Prediction

Modern CPUs guess if/else outcomes 95%+ accurately — and pay 15-cycle penalties when wrong

Branch prediction is a CPU technique that guesses whether a conditional branch (if/while/for) will be taken before knowing the actual outcome — letting the pipeline keep fetching/decoding speculatively. A modern x86 CPU achieves 95-99% prediction accuracy via two-level adaptive predictors and TAGE/perceptron predictors. A misprediction costs 15-20 cycles (the entire pipeline must be flushed). Famous example: a sorted array runs ~6× faster than an unsorted one in a tight branchy loop, because branches become predictable. Spectre (2018) showed branch prediction can leak data via speculative execution side channels.

Modern accuracy95-99%
Misprediction penalty15-20 cycles
Predictor typesBimodal, two-level, TAGE, perceptron
Sorted vs unsorted~6× difference
History bits32-256
Spectre2018 disclosure

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why branch prediction matters

Pipeline depth. Modern Intel and AMD pipelines are 14-20 stages deep. Without prediction, every branch stalls the pipeline waiting for the condition to evaluate — losing 15+ cycles per branch. With ~20% of all instructions being branches, that would cap IPC (instructions per cycle) below 1; predictors are why we get IPC of 4-8 on modern cores.
Tight loops. Loop back-edges are nearly always taken; the predictor learns this in 1-2 iterations and sustains 99%+ accuracy. Mostly-predictable loops are why CPUs achieve their advertised peak throughput.
Profile-guided optimization (PGO). Compilers like GCC, Clang, MSVC use runtime branch profiles to lay out code so the likely branch falls through, packing hot code densely in I-cache and aligning branches favorably for the predictor.
Branchless code. When prediction fails (e.g. random/data-dependent conditions), branchless equivalents using cmov, masks, or min/max avoid the misprediction penalty even at the cost of always doing both arms of the conditional.
JIT compilers. V8, HotSpot, .NET emit branch prediction-aware code, including type guards that trade an occasional misprediction (deopt) for the common-case speed.
Security. Spectre, Meltdown, and successors weaponize speculative execution to leak microarchitectural state. Mitigations (retpoline, IBPB, IBRS) cost 5-30% on syscall-heavy workloads.

How a modern predictor works

A current Intel/AMD predictor combines several mechanisms in parallel:

BTB (Branch Target Buffer). Cache of recently-encountered branch instructions and their targets. Hit means the CPU knows where to fetch from before decoding the branch instruction. Sized 2K-8K entries.
Direction predictor. Predicts taken/not-taken. Modern designs use TAGE: multiple tables tagged with different history-length hashes, the longest matching tag wins. Captures both short-range and long-range branch correlations.
Indirect branch predictor. For computed jumps (vtable, switch tables, function pointers), a separate target-prediction structure indexed by recent history.
Return Stack Buffer (RSB). Mini hardware stack tracking call/return pairs; predicts return targets without consulting the BTB.
Loop predictor. Recognizes simple loops (e.g. for i in 0..100) and predicts the exit branch correctly on the final iteration.

The end-to-end accuracy on SPECint 2017 reaches 97-99% for loops and structured code, dropping to 90-95% on hash-table walks and indirect-call-heavy OOP code.

Anatomy of a misprediction

A misprediction unfolds as follows:

Branch fetched. Predictor guesses NOT TAKEN. Pipeline fetches sequential instructions speculatively.
Speculative instructions decode, dispatch, execute out-of-order. Some commit, some buffered in reorder buffer (ROB).
10-20 cycles later, the branch resolves: it WAS taken.
Pipeline is flushed: ROB drained, register renaming reset, fetched instructions discarded.
Fetch redirected to the correct branch target. Pipeline refills (15-20 cycles to a fully-issued state).

Net cost: 15-20 cycles of useful work lost per misprediction. At a 3 GHz clock that's 5-7 nanoseconds — equivalent to an L2 cache hit. Frequent mispredictions tank IPC: a workload mispredicting 5% of branches with 15-cycle penalty effectively pays 0.75 cycles per branch instruction, halving observed throughput.

The famous sorted-array example

The Stack Overflow question "Why is processing a sorted array faster than an unsorted array?" (2012, 1.5M views) demonstrated branch prediction's impact:

// Unsorted array, random values 0..255
int sum = 0;
for (int i = 0; i < ARRAYSIZE; ++i) {
    if (data[i] > 128) sum += data[i];
}

With unsorted data, data[i] > 128 is true ~50% of the time at random positions. The predictor's accuracy plummets to 50%; every other iteration eats a 15-cycle penalty. Measured: ~11 seconds for 100M iterations on an Intel i7 of that era.

Sort the array first: now the branch is taken for the first half, not-taken for the second half. The predictor learns each segment in 2 iterations and stays correct. Measured: ~1.9 seconds — a 5.7× speedup. The instructions executed are identical; only the branch outcome pattern changes.

Branchless alternatives

When data is genuinely unpredictable, branchless code avoids the penalty:

// Branchless: always compute, mask via comparison
int t = (data[i] - 128) >> 31;  // 0 if >=128, -1 otherwise
sum += data[i] & ~t;

No conditional jump — the CPU pipeline never has to predict. On the unsorted benchmark, branchless is ~2.0 seconds, comparable to the sorted version. The trade-off: branchless does both "halves" of work always; for highly imbalanced or predictable branches, branchful + good prediction is faster because skipped work stays skipped.

Measurement tools

Linux perf. perf stat -e branch-misses,branches reports miss rate. perf record -e branch-misses samples mispredicting PCs for source-level attribution.
Intel VTune. Microarchitecture Exploration analysis breaks down "Bad Speculation" as a top-level bottleneck category.
AMD μProf. Equivalent to VTune for AMD CPUs.
Apple Instruments. CPU Counters template exposes branch_mispred events.
llvm-mca / uica.uops.info. Static analysis of expected branch behavior in tight loops.

A first-cut diagnostic: branch miss rate >1% of total branches indicates either intrinsic data unpredictability or insufficient predictor warmup; >5% is a serious bottleneck worth restructuring code for.

Common misconceptions

"Modern CPUs predict perfectly." 99% accuracy still means a misprediction every 100 branches; with 20% branch density, that's one penalty every 500 instructions. On a 4-IPC core, that's 6-7 ns of stall per ~150 ns of work. Far from invisible.
"Spectre is fully fixed." Mitigations exist for known variants but new speculative side channels keep appearing (Branch History Injection 2022, Inception 2023). The fundamental tension between speculation and isolation remains unresolved.
"Branchless is always faster." Only when prediction would have failed. For highly predictable branches, branchful code skips the unused arm entirely; branchless always evaluates both. Profile before refactoring.
"likely/unlikely macros directly improve runtime." They mostly improve I-cache layout and rarely-taken-path attribution; on modern x86 the dynamic predictor learns the same information from the first few executions. Their effect is real but small (typically 1-3% on hot code, more on cold paths).
"Switch statements always use a jump table." Modern compilers prefer binary-search trees of comparisons for sparse switches and small jump tables for dense ones, plus they sometimes lay out cases by frequency from PGO. Indirect-jump prediction is harder than direct-branch prediction; misprediction rates 5-15% are common on virtual dispatch.
"You don't have to think about branches in high-level languages." JIT compilers (HotSpot, V8, .NET) emit branchful machine code with the same prediction behavior. A poorly-laid-out conditional in JavaScript still pays misprediction penalties.

Design implications

Sort before scan when downstream code branches on values.
Use branchless idioms — min, max, abs, bit tricks — for hot loops with unpredictable conditions.
Hoist predicates out of loops when possible; one branch per loop is better than per-iteration.
Group similar work together. Polymorphic dispatch on heterogeneous lists devastates indirect predictors.
Exploit PGO/BOLT/Propeller to align branches and lay out hot code densely in I-cache.
Tolerate misprediction on cold paths (error handling, logging) — readability wins.

Frequently asked questions

How does the CPU 'guess' a branch?

The CPU maintains a Branch Target Buffer (BTB) and Branch History Table (BHT) indexed by the program counter of each branch. On encountering a branch in the fetch stage, the CPU consults these tables: the BHT predicts taken/not-taken, the BTB predicts the target address. The pipeline speculatively fetches from the predicted target. When the branch actually resolves (10-20 stages later), if the prediction was correct, work continues; if wrong, the speculative work is discarded and the pipeline restarts. Modern predictors index by recent global branch history and program counter together for context-aware prediction.

What is the bimodal vs two-level predictor?

Bimodal: per-branch 2-bit saturating counter (Strongly Not Taken, Weakly NT, Weakly T, Strongly T). Predicts based on counter sign; updates on resolution. Cheap, ~80-90% accuracy. Two-level (Yeh & Patt 1991): a global history register (last N branches taken/not) indexes a pattern history table. Captures correlations like 'after the first branch was taken, this one usually is too'. Hits ~95%. TAGE (Seznec 2006): tagged geometric history lengths — multiple tables with different history depths competing. Modern Intel/AMD use TAGE-derivatives reaching 99% on regular code.

Why does sorting an array speed up a loop with conditionals?

Famous Stack Overflow question (2012, 1.5M views). Loop sums elements only if value > 128. With unsorted random data, the branch is 50/50 — predictor random-guesses, hits ~50% accuracy, eats a 15-cycle penalty on every other iteration. Net: ~7-9 ns per element. With the array sorted, the branch outcome flips at most once (when crossing 128) — predictor learns 'always taken' or 'always not' segments, accuracy ~99%, no penalty. Net: ~1.2 ns per element. ~6× speedup from the same instructions in a different data order.

How does Spectre exploit branch prediction?

Spectre v1 (Bounds Check Bypass): attacker trains the branch predictor on legitimate inputs to predict 'within bounds'. Then feeds an out-of-bounds index. The CPU speculatively executes the array load past the bound, fetches secret memory into cache. The branch eventually resolves as not taken; speculative work is discarded but the cache state remains. Attacker probes the cache to recover the secret bit-by-bit. Mitigations: lfence after bounds checks, masking (and-bound), Speculative Store Bypass disable, retpoline for indirect branches. Per-process kernel mitigations cost 5-30% performance on syscalls.

What are likely/unlikely macros for?

GCC __builtin_expect(cond, 1) and Linux kernel likely(cond)/unlikely(cond) hint to the compiler which branch direction is more probable. Compiler uses this for two purposes: (1) reorder code so the likely path is the fall-through (fewer taken-branch cache fetches), (2) on architectures with static prediction hints, encode the hint into the branch instruction. On modern x86 the dynamic predictor learns quickly so static hints matter mostly for cold code (first execution) and rarely-taken paths (error handling).

How do hardware perceptrons work?

Perceptron predictor (Jimenez & Lin 2001): a per-branch perceptron, with N weights (one per global history bit) plus a bias. To predict, compute the dot product of weights with history (-1 for not-taken, +1 for taken); if positive, predict taken. After resolution, update weights via the perceptron learning rule. Captures linear correlations longer than tagged-table approaches. AMD has used perceptron-based predictors since Bulldozer (2011); Apple Silicon's Firestorm uses a hybrid TAGE+perceptron design. Reach 99%+ on typical code.