Computer Architecture
Superscalar Execution
One core, many instructions per tick — parallelism hiding inside a single thread
Superscalar execution issues multiple instructions per clock cycle to several parallel execution units on one core, exploiting instruction-level parallelism (ILP) to raise IPC above 1 — limited by data dependencies, hazards, and the issue width.
- Peak throughput= issue width (IPC > 1)
- First mainstream CPUIntel Pentium, 1993 (2-wide)
- Typical wide core today4–8 issue slots
- Practical ILP ceiling≈ 2–6 for integer code
- Hard limittrue (read-after-write) dependencies
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How superscalar execution works
A scalar pipelined processor is a single conveyor belt. It overlaps the stages of consecutive instructions — fetch, decode, execute, memory, write-back — so a new instruction enters every cycle and one finishes every cycle. The ceiling is exactly one instruction per cycle: IPC = 1. Crank the clock all you want; the belt is still one item wide.
A superscalar processor widens the belt. It fetches a bundle of instructions each cycle, decodes them in parallel, checks them for dependencies, and dispatches as many as it can to a bank of duplicated execution units — several integer ALUs, one or two load/store units, a multiplier, floating-point and SIMD pipes. If the machine is 4-wide and the next four instructions are mutually independent, all four execute in the same cycle and IPC hits 4.
The trick is that this happens transparently. The program is an ordinary sequential instruction stream — the same binary that ran on a scalar chip. The hardware extracts parallelism on the fly by inspecting a sliding window of upcoming instructions and finding ones with no data dependence between them. No new ISA, no annotations, no recompile. That is the whole appeal of superscalar over its explicit cousin VLIW: the parallelism is the CPU's job, not the compiler's.
A modern superscalar front end does four things every cycle:
- Fetch a wide block of instructions, steered by the branch predictor so the window doesn't stall at every branch.
- Decode and rename — translate architectural registers (the ~16–32 names in the ISA) into a much larger pool of physical registers, erasing false dependencies.
- Dispatch renamed micro-ops into issue queues / reservation stations.
- Issue ready ops — those whose inputs are all available — to whichever execution units are free, up to the issue width.
The mechanism: dependencies set the real ceiling
Issue width is the peak. The achievable rate is governed by data dependencies between nearby instructions. There are three kinds, and only one of them is fundamental:
- RAW (read-after-write) — a true dependency. Instruction B reads a register that A writes. B genuinely needs A's result, so B cannot execute before A finishes. This is the irreducible limit — no hardware trick removes it.
- WAR (write-after-read) and WAW (write-after-write) — false dependencies. These exist only because the ISA has a small fixed set of register names, so two unrelated computations happen to reuse
r3. Register renaming maps each write to a fresh physical register, dissolving WAR and WAW entirely.
Consider four instructions:
I1: r1 = r2 + r3
I2: r4 = r1 * r5 ; RAW on r1 — must wait for I1
I3: r6 = r7 + r8 ; independent — can pair with I1
I4: r9 = r6 - r2 ; RAW on r6 — must wait for I3
On a 4-wide machine, the best schedule is two cycles, not one: {I1, I3} issue together, then {I2, I4}. The dependency chains I1→I2 and I3→I4 are each length 2, so the critical path is 2 cycles regardless of width. The fundamental bound is:
cycles ≥ max( ⌈N / issue_width⌉, // throughput bound
longest dependency chain, // latency / critical-path bound
⌈N_op / units_of_that_op⌉ ) // structural / resource bound
where N is the instruction count. The processor's job is to schedule so the realized cycle count approaches this lower bound. The wider the window of in-flight instructions it can search, the more independent work it finds to fill issue slots — which is why deep speculative and out-of-order machines extract far more ILP than narrow in-order ones.
Where superscalar pays off — and where it doesn't
- Branchy, dependency-light scalar code — interpreters, parsers, OS kernels, business logic. There's no SIMD to exploit, so width plus good branch prediction is the only lever left, and it's the reason general-purpose cores are wide.
- Pointer-chasing with overlap — a wide out-of-order window can keep many independent cache misses in flight (memory-level parallelism), overlapping their latencies even when the code looks serial.
- Mixed workloads where you can't predict the instruction mix, so duplicated heterogeneous units (ALU + load + FP) keep utilization up.
Where extra width earns little: tight loops that are already memory-bandwidth-bound (the bottleneck is DRAM, not issue slots), long single dependency chains (a serial reduction, a linked-list traversal with a true loop-carried dependency), and code with a branch every few instructions and a mispredict rate the predictor can't tame. For data-parallel numeric kernels, throwing the work at SIMD or a GPU beats widening a scalar core. Superscalar is the answer to "make one sequential thread faster," not "process a giant array."
Superscalar vs other ways to go faster
| Scalar pipelined | Superscalar (in-order) | Superscalar (OoO) | VLIW | SIMD / vector | SMT / multicore | |
|---|---|---|---|---|---|---|
| Peak IPC (one thread) | 1 | = issue width | = issue width | = bundle width | 1 op on N lanes | 1 per core × cores |
| Who finds parallelism | — | hardware, same-cycle pairs only | hardware, large window | compiler, at build time | programmer / compiler | programmer (threads) |
| Needs recompile | no | no | no | yes (ISA-specific) | yes (vector code) | yes (threading) |
| Handles unpredictable deps | n/a | poorly — stalls | well — reorders | poorly — fixed schedule | n/a (regular data) | n/a |
| Hardware cost of going wider | linear | ~quadratic (ports, bypass) | ~quadratic + window logic | linear (simple HW) | linear per lane | linear per core |
| Energy per instruction | low | moderate | high (rename, schedule, ROB) | low | very low (amortized) | moderate |
| Canonical example | classic 5-stage RISC | Intel Pentium (1993) | Intel P6, Apple Firestorm | Itanium, TI C6x DSP | AVX-512, ARM SVE | any modern multicore |
The headline split is "who finds the parallelism." Superscalar puts it on the hardware at run time, so old binaries get faster on new chips — but the scheduler logic grows roughly quadratically with width. VLIW pushes the job to the compiler, keeping hardware simple, but a fixed compile-time schedule can't react to a cache miss whose latency only shows up at run time — which is why Itanium underdelivered on general-purpose code. SIMD and multicore attack a different axis (data and thread parallelism) and compose with superscalar rather than competing with it: a real core is pipelined, superscalar, out-of-order, SMT, and has SIMD units all at once.
What the numbers actually say
- The Pentium (1993) was 2-wide in-order, with rigid pairing rules — the famous "U-pipe / V-pipe." It could only co-issue when the second instruction was simple and independent, so sustained IPC on real code was well under 2, often near 1.1–1.3.
- Wall's 1991 ILP study found that with realistic (not perfect) branch prediction and renaming, typical programs expose only about 2–6 instructions of parallelism per cycle — and integer code sits at the low end. This is the empirical reason no one ships a 16-wide general-purpose core.
- Doubling issue width is roughly quadratic in hardware. Going from 4-wide to 8-wide roughly quadruples the register-file read ports and the operand-bypass network, and the dependency-check matrix at dispatch grows as O(W²) comparators. The performance return is sub-linear, so the cost/benefit collapses past ~6–8 wide.
- Apple's Firestorm (M1, 2020) is famously ~8-wide decode with a ~630-entry reorder buffer, and sustains 6+ IPC on favorable code — but on a branchy database or compiler workload the same core still averages roughly 1–2 IPC, dominated by mispredicts and cache misses.
- Amdahl in miniature: if 30% of a hot loop is a serial dependency chain, even an infinitely wide machine caps speedup at 1 / 0.30 ≈ 3.3×. Width only helps the parallelizable 70%.
JavaScript: a list scheduler that models superscalar issue
You can't directly observe issue slots from JavaScript, but you can model the scheduling problem the hardware solves: given instructions with latencies and RAW dependencies, how few cycles does a W-wide, R-unit machine need? This greedy list scheduler issues every ready, resource-available instruction each cycle — exactly what an aggressive out-of-order issue stage approximates.
// Each instr: { id, deps: [ids it reads], latency, unit }
// units: how many execution units of each type exist, e.g. { alu: 2, mul: 1, mem: 1 }
function schedule(instrs, issueWidth, units) {
const byId = new Map(instrs.map(i => [i.id, i]));
const done = new Map(); // id -> cycle it completes
const remaining = new Set(instrs.map(i => i.id));
let cycle = 0, issued = 0;
while (remaining.size) {
let slots = issueWidth;
const freeUnits = { ...units };
// A pass over remaining ops; issue the ones that are ready.
for (const id of [...remaining]) {
if (slots === 0) break;
const op = byId.get(id);
// RAW check: every dependency must have COMPLETED by this cycle.
const ready = op.deps.every(d => done.has(d) && done.get(d) <= cycle);
if (!ready) continue;
if ((freeUnits[op.unit] ?? 0) === 0) continue; // structural hazard
freeUnits[op.unit]--;
slots--;
remaining.delete(id);
done.set(id, cycle + op.latency); // completes later
issued++;
}
cycle++;
}
return { cycles: cycle, issued, ipc: +(issued / cycle).toFixed(2) };
}
const prog = [
{ id: 'I1', deps: [], latency: 1, unit: 'alu' },
{ id: 'I2', deps: ['I1'], latency: 3, unit: 'mul' }, // RAW + slow multiply
{ id: 'I3', deps: [], latency: 1, unit: 'alu' },
{ id: 'I4', deps: ['I3'], latency: 1, unit: 'alu' },
];
console.log(schedule(prog, 4, { alu: 2, mul: 1, mem: 1 }));
// I1,I3 issue cycle 0; I4 cycle 1 (waits for I3); I2 finishes cycle 4 (3-cycle multiply).
// Width 4 can't beat the I1->I2 latency chain — that's the critical path.
Two lessons fall out immediately. First, the done.get(d) <= cycle check is the RAW constraint — a dependent op waits for its producer to complete, which for a 3-cycle multiply is three cycles later. Second, freeUnits is the structural hazard: even with free issue slots, you can't run two multiplies if there's one multiplier.
Python: counting available ILP in a basic block
A useful companion question: what is the maximum IPC this code could ever reach on an infinitely wide machine with single-cycle ops? That equals instruction count divided by the longest dependency chain (the critical-path length). This is the famous-problem version of the analysis — "how much ILP is even in here?" — and it's a straightforward longest-path-in-a-DAG computation.
from functools import lru_cache
# instrs: dict id -> list of ids it depends on (reads the result of)
def ideal_ipc(instrs):
@lru_cache(maxsize=None)
def depth(i): # longest dependency chain ending at i
ds = instrs[i]
return 1 + (max(depth(d) for d in ds) if ds else 0)
critical_path = max(depth(i) for i in instrs) # min cycles, infinite width
n = len(instrs)
return {
"instructions": n,
"critical_path": critical_path,
"ideal_ipc": round(n / critical_path, 2), # IPC at infinite width
}
prog = {
"I1": [], # r1 = r2 + r3
"I2": ["I1"], # r4 = r1 * r5 (true dep on I1)
"I3": [], # r6 = r7 + r8 (independent)
"I4": ["I3"], # r9 = r6 - r2 (true dep on I3)
}
print(ideal_ipc(prog))
# {'instructions': 4, 'critical_path': 2, 'ideal_ipc': 2.0}
# Even an infinitely wide CPU needs 2 cycles: two independent length-2 chains.
If ideal_ipc comes back near 1, the block is essentially serial and no issue width will help it — you need to restructure the algorithm (break the dependency chain, unroll, or vectorize). If it comes back at 4 but your real machine only hits 2, the bottleneck is the machine: too few units, too narrow a window, or branch mispredicts truncating the window.
Variants and the surrounding machinery
In-order superscalar. Issues a fixed-position bundle and stalls the whole bundle if the leading instruction isn't ready. Cheap and power-efficient — used in many mobile efficiency cores and the original Pentium and Atom. It only captures ILP that's adjacent in program order.
Out-of-order superscalar. Buffers a large window of decoded ops (the reorder buffer / ROB), issues any op whose operands are ready regardless of program order, then retires them in order to preserve correct architectural state and precise exceptions. This is the design that actually extracts the available ILP, and it's what "superscalar" colloquially means today. The classic algorithms here are Tomasulo's algorithm (1967, IBM 360/91) for dynamic scheduling with reservation stations, and the reorder buffer for precise state.
VLIW (Very Long Instruction Word). Same goal — multiple ops per cycle — but the compiler bundles independent ops into one wide instruction at build time, and the hardware just executes the bundle with no dependency-checking. Simpler hardware, but the schedule is frozen at compile time and can't adapt to run-time events like cache misses. Itanium and many DSPs.
SMT (simultaneous multithreading / Hyper-Threading). A superscalar core often can't find enough ILP in one thread to fill all its slots. SMT feeds it instructions from two (or more) threads at once, so when thread A stalls on a miss, thread B's independent work fills the empty issue slots — raising utilization of the same expensive wide back end.
Common misconceptions and gotchas
- "Superscalar means out-of-order." No. The Pentium was superscalar and strictly in-order. Width (how many you issue per cycle) and ordering (whether you may reorder) are independent axes.
- "Wider is always faster." ILP is capped at ~2–6 for typical integer code, and width costs ~quadratic hardware. Past ~6–8 wide, you spend a lot of transistors and energy for almost no gain — which is why cores went multicore instead of ever-wider.
- "More instructions = slower." On a superscalar machine, two cheap independent instructions can be free relative to one that sits on the critical path. Optimizing for instruction count can be the wrong target; optimizing the dependency chain length is what matters.
- "Renaming removes all dependencies." It removes only the false ones (WAR, WAW). True RAW dependencies are the program's actual data flow and cannot be renamed away — they set the critical path.
- Branch mispredicts truncate the window. A wide machine needs a deep stream of correctly-predicted instructions to find independent work. A mispredict flushes the speculative window, so superscalar performance is hostage to branch prediction quality.
- Memory aliasing blocks reordering. If the hardware can't prove two memory accesses target different addresses, it must keep them ordered — store-to-load forwarding and memory disambiguation are where a lot of real superscalar complexity (and bugs) live.
Frequently asked questions
What is the difference between superscalar and pipelined execution?
Pipelining overlaps the stages of one instruction stream so a new instruction starts each cycle — peak throughput is still one instruction per cycle. Superscalar adds width: it fetches, decodes, and issues several instructions in the same cycle to duplicated execution units, so peak throughput is the issue width, e.g. 4 or 6 instructions per cycle. Real CPUs are both pipelined and superscalar.
What limits how many instructions a superscalar CPU can issue per cycle?
Three things: the issue width (how many decode/issue slots exist), the number and type of execution units (you can't run three multiplies if there are only two multipliers), and instruction-level parallelism in the code itself. Data dependencies are the hard ceiling — if instruction B reads the result of A, B cannot issue in the same cycle as A no matter how wide the machine is.
Does superscalar require out-of-order execution?
No. The original Intel Pentium (1993) was superscalar but in-order — it issued two instructions per cycle only when the next pair happened to be independent. Out-of-order execution dramatically increases how often a wide machine finds independent work, so nearly every high-performance superscalar core since the mid-1990s pairs the two, but they are separate ideas.
What is IPC and why can it exceed 1?
IPC is instructions per cycle, the average number of instructions a core retires each clock. A scalar pipelined CPU tops out at IPC = 1. A superscalar core issues to multiple units in parallel, so IPC can exceed 1 — modern cores like Apple's Firestorm sustain 6–8 IPC on favorable code, though typical general-purpose workloads land closer to 1–2 because of dependencies and cache misses.
Why doesn't doubling the issue width double performance?
Available instruction-level parallelism is limited and has diminishing returns. Studies since Wall (1991) show typical integer code has an ILP ceiling of roughly 2–6 even with a perfect machine, because of true data dependencies and unpredictable branches. Going from 4-wide to 8-wide also costs roughly quadratically more hardware for register-file ports, bypass networks, and dependency-check logic, so the last increments of width buy very little.
How do superscalar CPUs handle two instructions that write the same register?
Register renaming. False dependencies — write-after-write and write-after-read on the same architectural register — are removed by mapping each write to a fresh physical register from a large pool. Only true read-after-write dependencies, where one instruction genuinely needs another's value, remain and serialize execution.