Computer Architecture
Speculative Execution
Run first, prove right later — the engine of high IPC, the door to Spectre
Speculative execution lets a CPU run instructions past unresolved branches, throwing away the wrong path. It buys high IPC but leaks data through cache timing — the Spectre family.
- In-flight speculation window200–600 µops (ROB)
- Branch-mispredict penalty15–20 cycles
- IPC without speculation~0.3 (vs 4+ today)
- Spectre v1 leakage~2 KB/s cross-process
- Mitigation overhead1–30% (workload-dependent)
- First disclosedJanuary 2018 (Spectre, Meltdown)
Interactive visualization
Press play, or step through manually. Watch the pipeline keep running past an unresolved branch, then flush when the guess turns out wrong.
How speculative execution works
A modern x86 or ARM core can issue four to eight instructions per cycle. To keep those slots filled it has to fetch four to eight new instructions per cycle from the front-end. The problem: real code has a conditional branch roughly every five to seven instructions. If the core had to wait until each branch resolved before fetching past it, the pipeline would stall constantly — IPC would collapse to about 0.3.
The fix is to guess. The branch predictor picks a direction before the branch executes. The front-end keeps fetching down the predicted path, decodes it, allocates rename registers, and dispatches the work into the back-end. By the time the branch actually computes, dozens or hundreds of speculated instructions have already executed. If the guess was right — and predictors hit 95–99% on typical code — those instructions retire normally. If the guess was wrong, every speculated instruction is squashed: the Reorder Buffer rolls back its rename map, in-flight stores are killed in the store buffer, and the front-end restarts at the correct target. The misprediction penalty is the depth of the pipeline — roughly 15–20 cycles on current cores.
Speculation lives in microarchitectural state that is invisible to the architecture spec. From the programmer's view, only retired instructions are real. From the cache's view, every load that ran — even the ones that were squashed — left a footprint. That asymmetry is the entire premise of Spectre.
The Reorder Buffer and the speculation window
The Reorder Buffer (ROB) is where speculation lives. Each in-flight micro-op gets a ROB entry; it stays there until it can retire in program order. Modern cores carry hundreds of entries:
| Microarchitecture | ROB size (µops) | Load buffer | Store buffer |
|---|---|---|---|
| Intel Skylake (2015) | 224 | 72 | 56 |
| Intel Sunny Cove (2019) | 352 | 128 | 72 |
| Intel Golden Cove (2021) | 512 | 192 | 114 |
| AMD Zen 3 (2020) | 256 | 72 | 64 |
| AMD Zen 4 (2022) | 320 | 136 | 64 |
| Apple M1 Firestorm (2020) | ~630 | ~210 | ~140 |
A ROB of 600 entries running at 5 GHz means the core can have 600 instructions speculated in flight at once, each of which could be transient if the oldest in-flight branch mispredicts. The bigger the ROB, the more parallelism the core can extract — and the larger the side-channel window.
Spectre and the side-channel surface
Spectre v1 is the bounds-check bypass. The attacker convinces a victim function to speculatively read out-of-bounds memory by training the branch predictor:
// Victim gadget — fine architecturally.
if (x < array1_size) {
y = array2[array1[x] * 4096];
}
// Train predictor with valid x, then attack with malicious x.
// CPU speculates "taken", loads array1[x] (a secret byte),
// uses it as an index into array2, and brings a specific
// cache line into L1. Architecturally rolled back.
// Microarchitecturally, the cache line is still hot.
The attacker then probes each of the 256 candidate cache lines in array2 with a timed load. The hot one was the secret byte. With cache-flush-and-reload, the timing difference between an L1 hit (4 cycles) and a DRAM miss (~250 cycles) is unambiguous.
The original disclosure reported around 10 KB/s in-process and ~2 KB/s cross-process leakage. Later refinements — NetSpectre, SgxPectre, RIDL, Fallout, ZombieLoad — extended the attack to other microarchitectural buffers (line-fill, store, load ports). RetBleed (2022) revived branch-target injection on Intel Skylake-class and AMD Zen 1/2 by abusing return predictors.
Mitigations
The defenses split into software, microcode, and hardware fixes:
- LFENCE after bounds checks (Spectre v1). The compiler or kernel inserts a serializing fence on the safe path.
__builtin_expecthints help the predictor;array_index_nospec()in Linux is the canonical pattern. - Retpoline (Spectre v2). Indirect branches are replaced with a return-to-trampoline sequence so the BTB cannot be poisoned. Costs roughly 1–5% on most code, more on heavy indirect-call workloads like Python interpreters.
- IBRS, STIBP, IBPB. Microcode MSRs that flush or restrict indirect-branch predictor state. Higher overhead, used selectively (e.g., entering the kernel).
- KPTI (Meltdown). The kernel unmaps itself from user page tables. Every syscall costs an extra TLB flush — measured at 5–30% on syscall-heavy benchmarks at launch.
- Hardware fixes. Intel Cascade Lake / Ice Lake and AMD Zen 2+ silicon close the Meltdown class entirely; later parts add Enhanced IBRS so the predictor isolation has near-zero per-context cost.
Inspecting and controlling speculation
// Spectre v1 mitigation in the Linux kernel.
// array_index_nospec(): clamp the index so even speculative
// loads cannot go out of bounds.
if (idx < array_size) {
idx = array_index_nospec(idx, array_size);
// The macro expands to a bitmask computed with branch-free
// arithmetic + LFENCE on x86, CSDB on ARM. After it, any
// speculative path with idx >= array_size sees idx == 0.
val = array[idx];
}
// Manual LFENCE — serializing fence on x86.
asm volatile("lfence" ::: "memory");
// Branch hint to the predictor.
if (__builtin_expect(rare_condition, 0)) {
handle_rare();
}
// Retpoline-style indirect call (GCC -mindirect-branch=thunk).
// gcc emits a thunk that uses a ret-to-trampoline loop instead
// of a raw indirect jump, defeating BTB poisoning.
On x86, perf stat -e br_misp_retired.all_branches counts retired branch mispredictions; cpu/event=0xC5,umask=0x4/ on Intel exposes the speculation rate. cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 reports the active mitigation strategy. On ARM, CSDB is the equivalent of LFENCE for value-speculation barriers.
Performance numbers
- Without speculation, IPC on SPEC CPU2017 drops from ~3.5 to ~0.3 — a 10× regression.
- Each branch misprediction costs ~15–20 cycles of bubble; at 5 GHz that is 3–4 ns wasted per mispredict.
- A modern branch predictor hits 98% on integer SPEC and 99%+ on floating-point. The remaining 1–2% accounts for most of the wasted speculation energy.
- Spectre v1 leakage: ~10 KB/s in-process, ~2 KB/s cross-process in the original paper. Enough to steal a 4096-bit RSA key in under 30 minutes.
- Meltdown KPTI overhead on a 32K-syscall/sec workload: ~22% throughput loss on pre-PCID hardware, ~3% with PCID and INVPCID.
- The Reorder Buffer on Apple M1 Firestorm (~630 entries) is roughly 3× Intel Skylake's; this is one reason M1 hits ~5 IPC on SPECint vs ~3.5 for Skylake.
Common pitfalls
- Assuming architectural rollback erases everything. It does not erase cache state, TLB state, predictor state, or DRAM row-buffer state. Every Spectre variant exploits something that survives rollback.
- Disabling SMT as the universal fix. SMT (hyperthreading) shares more microarchitectural state between threads, which is why several MDS variants required disabling it. But disabling SMT costs 15–30% on parallel workloads. STIBP plus EIBRS is usually enough.
- Trusting
ifas a security boundary. A bounds check that is correct architecturally can still be bypassed speculatively. Usearray_index_nospec()or a CMOV-based clamp. - Forgetting indirect branches. Spectre v2 attacks the BTB, not the conditional predictor. Function pointers, virtual calls, and switch tables are all potential gadgets.
- Believing "I have no secrets to leak." Stack canaries, ASLR base addresses, KASLR offsets, and TLS keys are all worth stealing. So is the kernel direct-map.
Frequently asked questions
What is speculative execution?
Speculative execution is a CPU technique that issues and executes instructions before it knows they are needed — typically past an unresolved branch. The predictor guesses which way the branch will go, the back-end runs the predicted path, and if the guess was right those results retire normally. If the guess was wrong, the speculated work is squashed and the pipeline restarts.
How is it different from out-of-order execution?
Out-of-order execution reorders ready instructions to fill execution units. Speculation lets the front-end keep fetching past an unresolved branch so there are instructions to reorder. Modern cores do both — a Reorder Buffer can hold 200–600 in-flight instructions.
What is the Spectre family of vulnerabilities?
Spectre is a class of side-channel attacks that abuse speculative execution. The attacker trains the predictor to mispredict, the CPU speculatively loads secret data and indexes a cache line with it, the misprediction is detected and architecturally rolled back — but the cache line remains hot. A flush-and-reload probe then reads the secret indirectly.
How much data can a Spectre attack leak per second?
The original Spectre v1 paper reported around 10 KB/s in-process and ~2 KB/s cross-process. Practical exploits range from 100 B/s to 10 KB/s depending on gadget and microarchitecture. Slow for bulk data, plenty fast for SSH keys, cookies, or password hashes.
What mitigations have CPUs and operating systems shipped?
LFENCE after bounds checks; retpolines and IBRS/STIBP/IBPB; KPTI page-table isolation; STIBP for SMT; PCID/ASID for TLBs; and hardware fixes in later silicon (Cascade Lake, Ice Lake, Zen 3+). Costs range 1–30% depending on workload.
Why don't CPUs just stop speculating?
Because speculation is responsible for most of the IPC. A branch happens every 5–7 dynamic instructions; without speculation a 20-stage pipeline stalls 14 cycles per branch. Estimated slowdown of disabling speculation entirely: 3–5× on most code, 10×+ on branch-heavy workloads.
What is a transient instruction?
A transient instruction is one that executes speculatively, then never retires because the speculation was wrong. Architecturally it is as if it never happened. Microarchitecturally it still touched the cache, TLB, and branch predictor — and that footprint is where Spectre lives.