Computer Architecture
CPU Pipelining
An assembly line for instructions — overlap the stages, finish one every cycle
CPU pipelining overlaps the fetch, decode, execute, memory, and write-back stages of consecutive instructions so a new instruction can finish almost every clock cycle — an assembly line that boosts throughput without speeding up any single instruction.
- Classic stages5 (IF, ID, EX, MEM, WB)
- Ideal speedup≈ number of stages
- Steady-state throughput1 instruction / cycle
- Single-instruction latencyunchanged (slightly worse)
- Hazard typesstructural, data, control
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How pipelining overlaps instructions
Imagine doing laundry for five loads. You could wash, dry, fold, and put away one load completely before touching the next — that's the non-pipelined CPU, finishing one "instruction" before fetching the next. Or you can start washing load two the moment load one moves to the dryer. The washer, dryer, and folding table all stay busy. Nothing got faster, but you finish five loads in roughly the time it used to take to finish two. That's pipelining.
A processor splits the work of executing an instruction into fixed stages. The textbook RISC pipeline, popularized by the MIPS designs of John Hennessy at Stanford in the early 1980s, uses five:
- IF — Instruction Fetch. Read the instruction at the program counter from instruction memory; increment PC.
- ID — Instruction Decode. Decode the opcode and read source registers from the register file.
- EX — Execute. Do the ALU work — add, compare, compute a memory address, or evaluate a branch condition.
- MEM — Memory access. Load from or store to data memory (only loads and stores do real work here).
- WB — Write-back. Write the result back into the destination register.
Between every pair of stages sits a pipeline register — a bank of latches that holds the partial result so the next stage can pick it up on the following clock edge. Once the pipeline is full, all five stages are working on five different instructions at once. A space-time diagram makes this obvious:
cycle: 1 2 3 4 5 6 7 8 9
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
I4: IF ID EX MEM WB
I5: IF ID EX MEM WB
I1 finishes at cycle 5. After that, I2 finishes at cycle 6, I3 at cycle 7 — one instruction per cycle. The first instruction took 5 cycles to fill the pipe; every instruction after it appears to take just one.
The speedup math
Let the unpipelined instruction take time T, split evenly into k stages of T/k each. Add a per-stage register overhead d for the latch latency, so each pipeline stage clocks at T/k + d. Running n instructions:
Unpipelined time = n · T
Pipelined time = (k + n − 1) · (T/k + d)
Speedup = n·T / [(k + n − 1)·(T/k + d)]
As n → ∞ the k − 1 fill cost vanishes and the speedup approaches:
Speedup_max = T / (T/k + d) = k / (1 + k·d/T)
If the register overhead d were zero you'd get a perfect k× speedup. It isn't zero — and the k·d/T term is exactly why you can't pipeline forever. The real-world figure of merit is CPI (cycles per instruction): an ideal pipeline has CPI = 1, and every stall pushes it above 1:
CPI = 1 + stall_cycles_per_instruction
Effective speedup = pipeline_depth / CPI (vs. non-pipelined)
Hazards — why CPI is never exactly 1
In a perfect world the diagram above always holds. In reality three classes of hazard force the pipeline to stall — insert a bubble, a no-op that wastes a stage.
Structural hazards. Two instructions want the same hardware in the same cycle. If instruction memory and data memory share a single port, an IF and a MEM colliding in the same cycle stall. Real designs split instruction and data caches (a Harvard split) precisely to remove this hazard.
Data hazards. The most common. A read-after-write (RAW) dependence: instruction I2 needs a register that I1 hasn't written back yet.
ADD r1, r2, r3 ; r1 = r2 + r3 (result ready end of EX, cycle 3)
SUB r4, r1, r5 ; needs r1 in EX (cycle 4) — but WB is cycle 5!
Without help, SUB must wait two cycles for the ADD to reach WB. The fix is forwarding (bypassing): wire the ALU output directly back to the ALU input, so r1's value reaches SUB's EX a full two cycles early. Forwarding kills almost every RAW stall. The one it can't fix is a load-use hazard — a load result isn't ready until end of MEM, one cycle too late for the next instruction's EX, so it costs exactly one bubble.
Control hazards. A branch changes the program counter, but the pipeline has already fetched the next sequential instructions speculatively. If the branch is taken (or mispredicted), those in-flight instructions must be flushed. The penalty equals the number of cycles between fetch and where the branch resolves. Modern CPUs lean hard on branch prediction to guess the outcome early, and on speculative execution to run down the predicted path before the branch is confirmed.
Pipelining vs. other ways to go faster
| Scalar pipeline | Superscalar | Out-of-order (OoO) | VLIW | SIMD / vector | Multicore | |
|---|---|---|---|---|---|---|
| Best-case IPC | 1 | 2–8 | 2–8 (sustained higher) | = issue width | 1 (but wide data) | cores × per-core IPC |
| Instruction order | in-order | in-order issue | reordered by hardware | fixed by compiler | in-order | independent streams |
| Who schedules | hardware (trivial) | hardware | hardware (ROB, reservation stations) | compiler | compiler / intrinsics | OS + programmer |
| Handles data hazards via | forwarding + stalls | forwarding + stalls | register renaming | compiler scheduling | n/a within a lane | n/a across cores |
| Hardware cost | low (pipeline regs) | moderate (dup. units) | high (ROB, RAT, schedulers) | low core, fat compiler | moderate (wide ALUs) | high (replicate cores) |
| Misprediction cost | ~3 bubbles | flushes the whole width | flush ROB to branch | full pipeline flush | same as base pipeline | per-core, isolated |
| Canonical example | MIPS R2000 | Intel Pentium | Pentium Pro, modern x86/ARM | Itanium, some DSPs | AVX-512, ARM SVE | any modern CPU |
These compose. A modern core like Apple's Firestorm or AMD's Zen 4 is pipelined and superscalar and out-of-order and has SIMD units and ships many cores. Pipelining is the foundation the rest are built on.
What the numbers actually say
- 5-stage pipeline, ideal throughput ≈ 5×. In practice, hazards and the fill/drain cost pull a classic in-order 5-stage core to a CPI around 1.1–1.5 on typical integer code — call it a real 3.5–4.5× over a non-pipelined design of the same logic.
- A load-use stall costs 1 cycle. On a 3 GHz core that's ~0.33 ns of dead time per occurrence. Compilers reorder code to slot an independent instruction into that slot and erase it.
- A branch mispredict in a deep pipeline costs 14–20 cycles. At 3 GHz that's ~5–7 ns thrown away per mispredict. With branches every ~5 instructions and 95% predictor accuracy, that's roughly one mispredict per 100 instructions — and at 15 cycles each, it inflates CPI by ~0.15, a 15% throughput hit. This is why predictors are pushed to 99%+.
- The Pentium 4 (NetBurst) ran a 20–31 stage pipeline. It hit 3.8 GHz but burned power and stalled badly on mispredicts. Intel abandoned it; the Core 2 (2006) returned to a ~14-stage pipeline and was faster per clock.
- Register overhead is real. Each pipeline latch adds setup + clock-to-Q delay (tens of picoseconds). Past ~15–20 stages the latches eat enough of the cycle that adding stages stops helping.
JavaScript implementation
This simulator models a 5-stage in-order pipeline. It tracks which instruction occupies each stage per cycle, detects RAW data hazards, and inserts bubbles unless forwarding is enabled (with a forced load-use bubble that forwarding can't remove).
const STAGES = ['IF', 'ID', 'EX', 'MEM', 'WB'];
// instr: { op, dst, src: [a, b], isLoad }
function simulate(program, { forwarding = true } = {}) {
const n = program.length;
// start[i] = cycle in which instruction i enters IF
const start = new Array(n).fill(0);
// writeReadyCycle[reg] = cycle a result is usable by a dependent's EX
// WB completes at start+4; with forwarding EX output is usable at start+2
// (start+3 for a load, since its value is ready after MEM).
let cycles = 0;
for (let i = 0; i < n; i++) {
let earliest = i === 0 ? 1 : start[i - 1] + 1; // one IF per cycle, in order
for (const r of program[i].src) {
// find the most recent earlier writer of r
for (let j = i - 1; j >= 0; j--) {
if (program[j].dst === r) {
const prodEX = start[j] + 2; // EX stage of producer
let ready;
if (!forwarding) {
ready = start[j] + 4 + 1; // must wait for WB, read next cycle
} else if (program[j].isLoad) {
ready = prodEX + 1 + 1; // load value ready after MEM
} else {
ready = prodEX + 1; // ALU result forwarded to next EX
}
// consumer needs the value in its EX, which is start[i] + 2
earliest = Math.max(earliest, ready - 2); // push IF later if value is late
break; // nearest writer dominates
}
}
}
start[i] = earliest;
cycles = Math.max(cycles, start[i] + 4); // WB of last instr
}
const stalls = cycles - (n + 4); // ideal = n + (k-1) = n + 4
return { cycles, stalls, cpi: cycles / n, start };
}
const prog = [
{ op: 'LW', dst: 'r1', src: ['r0'], isLoad: true },
{ op: 'ADD', dst: 'r3', src: ['r1', 'r2'] }, // load-use: 1 bubble even with forwarding
{ op: 'SUB', dst: 'r4', src: ['r3', 'r1'] }, // RAW on r3, killed by forwarding
{ op: 'OR', dst: 'r5', src: ['r6', 'r7'] }, // independent
];
console.log(simulate(prog, { forwarding: true }));
// { cycles: 9, stalls: 1, cpi: 2.25, start: [1, 3, 4, 5] } // 1 load-use bubble
console.log(simulate(prog, { forwarding: false }).stalls); // many more
The key idea: each instruction's IF cycle is one past the previous instruction's (in-order issue), then pushed later if a source register isn't ready in time. With forwarding, an ALU result is usable two cycles after the producer's IF; a load's result is usable three cycles after, which is what creates the unavoidable load-use bubble.
Python implementation
The same model, plus a function that renders the space-time diagram so you can see the bubbles.
STAGES = ['IF', 'ID', 'EX', 'MEM', 'WB']
def simulate(program, forwarding=True):
n = len(program)
start = [0] * n
for i, ins in enumerate(program):
earliest = 1 if i == 0 else start[i - 1] + 1 # in-order, one IF/cycle
for r in ins['src']:
for j in range(i - 1, -1, -1): # nearest earlier writer
if program[j]['dst'] == r:
prod_ex = start[j] + 2 # producer's EX cycle
if not forwarding:
ready = start[j] + 4 + 1 # wait for WB
elif program[j].get('is_load'):
ready = prod_ex + 2 # value ready after MEM
else:
ready = prod_ex + 1 # forwarded ALU result
earliest = max(earliest, ready - 2) # consumer EX = start+2
break
start[i] = earliest
cycles = max(start[i] + 4 for i in range(n))
return {'cycles': cycles, 'stalls': cycles - (n + 4),
'cpi': cycles / n, 'start': start}
def diagram(program, start):
last = max(start[i] + 4 for i in range(len(program)))
print(' ' + ''.join(f'{c:<5}' for c in range(1, last + 1)))
for i, ins in enumerate(program):
row = [''] * last
for k, st in enumerate(STAGES):
row[start[i] - 1 + k] = st
print(f"{ins['op']:<5} " + ''.join(f'{cell:<5}' for cell in row))
prog = [
{'op': 'LW', 'dst': 'r1', 'src': ['r0'], 'is_load': True},
{'op': 'ADD', 'dst': 'r3', 'src': ['r1', 'r2']}, # load-use bubble
{'op': 'SUB', 'dst': 'r4', 'src': ['r3', 'r1']},
{'op': 'OR', 'dst': 'r5', 'src': ['r6', 'r7']},
]
res = simulate(prog, forwarding=True)
print(res) # one bubble between LW and ADD
diagram(prog, res['start'])
Run it and the diagram shows ADD's IF slipping one cycle to the right of where it would sit in a hazard-free pipeline — the visible cost of the load-use dependency.
Variants worth knowing
Superscalar pipelines. Widen the pipe so two-to-eight instructions can occupy the same stage in the same cycle, lifting peak IPC above 1. The Intel Pentium (1993) was the first mainstream superscalar x86, issuing up to two instructions per cycle.
Out-of-order execution. Add a reorder buffer and reservation stations so the hardware can execute a ready instruction even if an earlier one is stalled, then retire results in program order. Register renaming removes false (WAR/WAW) dependencies. This is how modern x86 and ARM cores hide cache-miss latency.
VLIW (Very Long Instruction Word). Push scheduling entirely onto the compiler — each instruction word bundles several operations the compiler proved independent. Intel's Itanium bet on this; it underperformed because compilers couldn't extract enough static parallelism. DSPs still use it.
Deep pipelining (superpipelining). Slice stages thinner to clock higher. The MIPS R4000 used 8 stages; the Pentium 4 pushed to 20–31. The trade-off is a brutal misprediction penalty and diminishing returns from latch overhead.
Branch delay slots. Early MIPS exposed the control hazard to the compiler: the instruction right after a branch always executes, so the compiler fills that slot with useful work. Clever for a 5-stage pipe, but it doesn't scale to deep/superscalar designs, so modern ISAs dropped it in favor of prediction.
Common pitfalls and misconceptions
- Thinking pipelining lowers latency. It doesn't. One instruction takes just as long (a hair longer, with latch overhead). Pipelining raises throughput, not single-instruction speed.
- Forwarding eliminates all data hazards. It eliminates ALU-result RAW stalls, but the load-use case still costs one bubble — a load result simply isn't ready early enough to forward.
- Assuming deeper is always faster. Past ~15–20 stages, latch latency and misprediction penalties dominate; the Pentium 4 is the canonical cautionary tale.
- Ignoring the fill/drain cost in short loops. The speedup formula's
k − 1fill term only amortizes over many instructions. A 5-stage pipe running a 3-instruction burst gets almost no benefit. - Forgetting WAR/WAW hazards in out-of-order designs. A simple in-order pipeline only sees RAW. Reorder instructions and you must also handle write-after-read and write-after-write — which is what register renaming exists to solve.
- Confusing CPI < 1 with pipelining. A plain scalar pipeline caps at CPI = 1. Getting below 1 (IPC > 1) requires superscalar issue, not just more pipeline stages.
Frequently asked questions
Does pipelining make a single instruction run faster?
No. The latency of one instruction is the same or slightly worse — it still passes through every stage, and pipeline registers add a small per-stage delay. What pipelining improves is throughput: with k stages you can have k instructions in flight at once, so in the steady state one instruction completes every cycle instead of every k cycles.
What is a pipeline hazard?
A hazard is a situation that prevents the next instruction from entering or advancing through the pipeline on its intended cycle. The three kinds are structural (two instructions need the same hardware), data (an instruction needs a result that hasn't been written back yet), and control (a branch makes the next address unknown until it resolves).
How is a data hazard usually fixed without stalling?
Forwarding (also called bypassing). Instead of waiting for a result to be written to the register file, the hardware routes it directly from the output of the EX or MEM stage back to the input of a later instruction's EX stage. Forwarding eliminates most RAW stalls; the one case it can't fix is a load followed immediately by a use, which still costs one bubble.
Why does a deeper pipeline not always run faster?
More stages let you raise the clock, but each pipeline register adds latch latency and clock skew, so per-stage useful work shrinks. Deeper pipelines also pay a bigger misprediction penalty — flushing 14 stages costs far more than flushing 5. The Pentium 4's 31-stage NetBurst pipeline famously hit a wall, and Intel retreated to shorter pipelines with the Core architecture.
What's the difference between pipelining and superscalar execution?
Pipelining overlaps the stages of consecutive instructions in one execution lane — its best-case throughput is one instruction per cycle (IPC = 1). Superscalar adds multiple parallel lanes, so several instructions can be in the same stage in the same cycle, pushing IPC above 1. Modern cores are both deeply pipelined and superscalar, and add out-of-order execution on top.
How many bubbles does a mispredicted branch cost?
Roughly the number of stages between fetch and where the branch resolves. In a classic 5-stage MIPS pipeline with the branch resolved in MEM, a mispredict flushes about 3 instructions. In a 14–20 stage modern core it can flush 15+ µops, which is why branch predictors are pushed to 95–99% accuracy — every missed branch wastes a pipeline's worth of work.