Computer Architecture

Out-of-Order Execution

The CPU doesn't run your program in the order you wrote it — and that's why it's fast

Out-of-order execution lets a CPU run independent instructions as soon as their inputs are ready instead of in program order, hiding cache-miss and divide stalls behind useful work while a reorder buffer retires results in order.

  • InventedTomasulo, IBM 360/91, 1967
  • Reorder buffer (modern)200–600 entries
  • Issue width (modern)4–8 instr/cycle
  • Memory-miss latency hidden~100–300 cycles
  • Retire orderstrictly in program order

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The core idea: program order is a lie the CPU tells you

You write code as an ordered list of instructions, and you assume the machine runs them one after another. It doesn't. A modern CPU treats your instruction stream as a dependency graph, not a queue. The only thing that genuinely forces instruction B to wait for instruction A is a real data dependency — B reads a value A produces. Everything else is just the accident of having written A before B.

Out-of-order execution exploits that. The processor fetches and decodes instructions in program order, but then it lets them execute in whatever order their operands become ready. Consider:

load  r1, [r9]      ; cache MISS — 200 cycles to memory
add   r2, r1, r3    ; depends on r1 — must wait
mul   r4, r5, r6    ; independent! can run NOW
xor   r7, r5, r8    ; independent! can run NOW

An in-order core stalls on the load and burns 200 cycles doing nothing, because the very next instruction needs r1. An out-of-order core sees that mul and xor don't touch r1 at all, runs them during the miss, and only the add actually waits. The stall is hidden behind useful work. That overlap — squeezing independent instructions into the shadow of a long-latency operation — is the entire point.

The mechanism: rename, dispatch, wake up, retire

The reference design is Robert Tomasulo's algorithm, built for the IBM System/360 Model 91's floating-point unit in 1967. Every out-of-order core since — from the Intel P6 (1995) through today's Apple and AMD designs — is a descendant. Four structures do the work:

  1. Register renaming. Each architectural register write is mapped to a fresh physical register. This is the single most important trick: it eliminates false dependencies (write-after-write, write-after-read) and leaves only true read-after-write dependencies, which are the real edges in the dependency graph.
  2. Reservation stations / the scheduler. A renamed instruction sits here waiting for its source operands. It holds tags pointing at the physical registers it needs.
  3. Wakeup and select. When a result is produced, its tag is broadcast on a result bus (the "common data bus" in Tomasulo's terms). Every waiting instruction whose operand matches that tag wakes up; the select logic picks which ready instructions issue this cycle, up to the issue width.
  4. The reorder buffer (ROB). Completed results are parked here, in program order. An instruction retires — commits its result to the architecturally visible register file and memory — only when it reaches the head of the ROB and every older instruction has already retired.

The crucial split: execution is out of order, retirement is strictly in order. That in-order retirement is what makes the whole thing safe. It gives you precise exceptions (a fault looks like it happened exactly at the faulting instruction), it lets you roll back mispredicted branches cleanly, and it means software never sees the reordering at all. The hardware reorders aggressively, then carefully pretends it didn't.

The performance ceiling is set by two numbers. The issue width W is how many instructions can start per cycle. The window size N (≈ ROB entries) is how far ahead the core can look for independent work. To sustain IPC (instructions per cycle) of W, you need roughly N ≥ W × (average latency you're trying to hide). Hiding a 200-cycle miss at 4-wide issue would need a 800-entry window in the limit — which is why real windows are large (Apple's M-series cores reportedly exceed 600 ROB entries) yet still rarely fully hide a miss to DRAM.

When out-of-order helps — and when it doesn't

  • Memory-bound code with mixed dependencies. Pointer chasing, hash lookups, and graph traversal stall constantly; an out-of-order core overlaps independent iterations and extracts memory-level parallelism — multiple outstanding misses at once.
  • Branchy, irregular code. Anything a compiler can't statically schedule (interpreters, JITs, business logic) benefits, because the hardware schedules dynamically at run time with information the compiler never had.
  • Variable-latency operations. Integer divide, floating-point sqrt, and cache hierarchy all have latencies that depend on data; static scheduling has to assume the worst case, dynamic scheduling reacts to the actual case.

It does not help when: the code is a pure dependency chain (each instruction needs the previous result — no independent work exists to fill the shadow); the working set fits in registers with no stalls to hide; or you are power- or area-constrained. A long dependent chain of a = a*a runs at the same speed in-order or out-of-order, because there is nothing else to do.

In-order vs out-of-order, and the alternatives

In-order scalarIn-order superscalarVLIWOut-of-order superscalarSMT (hyper-threading)
Who schedulesnobody (fixed)nobody (fixed)compiler, staticallyhardware, dynamicallyhardware + multiple threads
Hides cache miss?no — full stallno — full stallonly what compiler foresawyes — runs window aheadyes — switches to other thread
Reacts to runtime latencynonono (static assumptions)yesyes
Hardware costtinymoderatelow core, fat compilerhigh (ROB, rename, scheduler)added on top of OoO
Energy per instructionlowestlowlowhigh (wakeup/select scales ~N²)amortized over threads
Binary compatibilitystablestablebrittle — recompile per chipstablestable
Real-world exampleCortex-M, RISC-V rocketCortex-A53, Atom (early)Itanium, TI C6x DSPsIntel Core, Apple M, AMD ZenIntel HT, IBM POWER (8-way)

VLIW is the instructive failure. Itanium's bet was "let the compiler find the parallelism statically, keep the hardware simple." It lost because the compiler can't predict cache misses — a miss is a runtime event, and the schedule was baked in at compile time. Out-of-order won precisely because it moves scheduling to run time where the latencies are actually known.

What the numbers actually say

  • A last-level cache miss to DRAM costs ~100–300 cycles. At 3 GHz that is 30–100 nanoseconds — enough time to execute roughly 400–1200 instructions if you had independent work. Out-of-order's job is to find that work; in-order throws every one of those cycles away.
  • Real IPC gains are 1.5×–3× over a comparable in-order core on general-purpose code, not the theoretical issue width. A 6-wide core rarely sustains 6 IPC; sustained IPC of 2–4 is typical because of branch mispredicts, true dependencies, and window limits.
  • The scheduler wakeup/select logic scales roughly as O(N²) in window size N — every producer must potentially wake every consumer. This quadratic cost is the main reason windows can't grow without bound and why doubling the ROB does not double performance.
  • Out-of-order cores spend a large fraction of transistors and energy on bookkeeping. The rename tables, physical register file (often 180+ entries), ROB, and load/store queues are pure overhead from the program's point of view — none of it computes your result, it only reorders.
  • Branch misprediction penalty is ~15–20 cycles on a deep out-of-order pipeline, because the entire window of speculative work past the bad branch must be squashed. This is why branch prediction accuracy of 95%+ is mandatory for an out-of-order core to pay off.

A JavaScript model of the scheduler

You can capture the heart of out-of-order issue — wakeup, ready-set selection, and in-order retirement — in a compact simulator. Each instruction lists the registers it reads and the one it writes; the scheduler issues any instruction whose sources are all produced, respecting a fixed issue width, and retires in program order.

// Each instr: { id, dst, srcs:[...], latency }. "ready" = all srcs produced.
function runOoO(program, issueWidth = 2) {
  const producedAt = new Map();   // register -> cycle its value is available
  const completed  = new Set();   // instruction ids that have finished executing
  const inFlight   = [];          // { id, dst, doneAt }
  let cycle = 0, retired = 0;
  const trace = [];

  // An instruction can ISSUE when every source register is available this cycle.
  const sourcesReady = (insn) =>
    insn.srcs.every(r => producedAt.has(r) && producedAt.get(r) <= cycle);

  while (retired < program.length) {
    // 1. Complete anything whose latency has elapsed; publish its result (wakeup).
    for (let i = inFlight.length - 1; i >= 0; i--) {
      if (inFlight[i].doneAt <= cycle) {
        const f = inFlight[i];
        completed.add(f.id);
        if (f.dst) producedAt.set(f.dst, cycle); // result on the bus
        inFlight.splice(i, 1);
      }
    }

    // 2. SELECT: issue up to issueWidth ready, not-yet-issued instructions
    //    out of program order — pick the oldest ready ones.
    let issuedThisCycle = 0;
    for (const insn of program) {
      if (issuedThisCycle >= issueWidth) break;
      if (insn.issued || !sourcesReady(insn)) continue;
      insn.issued = true;
      inFlight.push({ id: insn.id, dst: insn.dst, doneAt: cycle + insn.latency });
      trace.push({ cycle, event: 'issue', id: insn.id });
      issuedThisCycle++;
    }

    // 3. RETIRE in strict program order: commit the head only when complete.
    while (retired < program.length && completed.has(program[retired].id)) {
      trace.push({ cycle, event: 'retire', id: program[retired].id });
      retired++;
    }
    cycle++;
    if (cycle > 10000) throw new Error('deadlock: a source is never produced');
  }
  return { cycles: cycle, trace };
}

// load r1 misses (latency 8); mul/xor are independent and slip into the shadow.
const prog = [
  { id: 'load r1', dst: 'r1', srcs: ['r9'],     latency: 8 },
  { id: 'add  r2', dst: 'r2', srcs: ['r1','r3'], latency: 1 }, // waits on r1
  { id: 'mul  r4', dst: 'r4', srcs: ['r5','r6'], latency: 3 }, // independent
  { id: 'xor  r7', dst: 'r7', srcs: ['r5','r8'], latency: 1 }, // independent
];
// Seed the inputs that already exist in registers at cycle 0.
['r3','r5','r6','r8','r9'].forEach(/* see note below */ () => {});

One subtlety the toy above glosses over: real hardware needs the source registers that were "already there" at the start to count as ready. In a full model you seed producedAt with the initial architectural registers at cycle -∞. The important behaviors are all present: mul and xor issue while load is still in flight, but they cannot retire until load and add ahead of them have committed — out of order in execution, in order at retirement.

A Python model with register renaming

The JavaScript version skips renaming, so it can't show why write-after-write hazards disappear. Here is a Python sketch that renames each destination to a fresh physical register before scheduling, which is what makes false dependencies vanish.

from dataclasses import dataclass, field
from itertools import count

@dataclass
class Insn:
    name: str
    dst: str | None          # architectural destination register
    srcs: list[str]          # architectural source registers
    latency: int
    # filled in by renaming:
    phys_dst: int | None = None
    phys_srcs: list[int] = field(default_factory=list)

def rename(program):
    """Map each architectural write to a fresh physical register.
    This is what removes WAW / WAR false dependencies."""
    rat = {}                 # register alias table: arch reg -> physical reg
    pool = count()           # infinite supply of physical registers
    # Architectural inputs that exist before the program runs:
    for r in ('r3', 'r5', 'r6', 'r8', 'r9'):
        rat[r] = next(pool)
    for insn in program:
        insn.phys_srcs = [rat[s] for s in insn.srcs]   # read current mapping
        if insn.dst is not None:
            insn.phys_dst = next(pool)                  # fresh reg for the write
            rat[insn.dst] = insn.phys_dst               # later readers see this
    return program

def run_ooo(program, issue_width=2):
    program = rename(program)
    # Only the architectural inputs (phys regs no instruction writes) are ready
    # at cycle 0; a phys reg produced in-program isn't ready until it completes.
    produced = {insn.phys_dst for insn in program if insn.phys_dst is not None}
    ready_at = {p: 0 for insn in program for p in insn.phys_srcs
                if p not in produced}                            # inputs ready @0
    completed, in_flight = set(), []
    cycle = retired = issued_flags = 0
    issued = [False] * len(program)

    while retired < len(program):
        # complete + publish (wakeup)
        for f in [x for x in in_flight if x['done'] <= cycle]:
            completed.add(f['idx'])
            if f['phys_dst'] is not None:
                ready_at[f['phys_dst']] = cycle
            in_flight.remove(f)
        # select: oldest ready instructions, up to issue width
        n = 0
        for i, insn in enumerate(program):
            if n >= issue_width:
                break
            if issued[i]:
                continue
            if all(ready_at.get(p, 1 << 30) <= cycle for p in insn.phys_srcs):
                issued[i] = True
                in_flight.append({'idx': i, 'phys_dst': insn.phys_dst,
                                  'done': cycle + insn.latency})
                n += 1
        # retire in program order
        while retired < len(program) and retired in completed:
            retired += 1
        cycle += 1
    return cycle

prog = [
    Insn('load r1', 'r1', ['r9'],       8),
    Insn('add  r2', 'r2', ['r1', 'r3'], 1),
    Insn('mul  r4', 'r4', ['r5', 'r6'], 3),
    Insn('xor  r7', 'r7', ['r5', 'r8'], 1),
]
print(run_ooo(prog))   # far fewer cycles than 8 + 1 + 3 + 1 done in order

The payoff of renaming shows when you add a fifth instruction that reuses r4, e.g. mov r4, r8. Without renaming that write looks like it must wait for the earlier mul r4 (a write-after-write hazard). With renaming each gets its own physical register, so the mov can run immediately — the false dependency is gone.

Variants and the broader family

Scoreboarding (CDC 6600, 1964). The ancestor. It tracks per-register and per-unit status to issue out of order, but it has no register renaming, so it stalls on write-after-write and write-after-read hazards. Tomasulo's renaming-by-tag is the upgrade that fixed this three years later.

Tomasulo with a reorder buffer. Plain Tomasulo retires as results complete, which gives imprecise exceptions. Adding the ROB (Smith and Pleszkun, 1985) layers in-order retirement on top, delivering precise exceptions and clean branch recovery. This is the design everyone ships now.

Physical register file vs. ROB-as-storage. Two ways to hold renamed values. The Intel P6/Pentium III stored results in the ROB itself; modern cores (Pentium 4 onward, AMD Zen, Apple) use a separate large physical register file and keep only pointers in the ROB. The PRF approach scales better because you don't copy data on retirement.

Memory disambiguation and the load/store queue. Loads and stores can't be freely reordered — a load might alias an earlier store. The load/store queue checks addresses; aggressive cores speculatively let a load go ahead of an unresolved store and squash it if an alias is later detected. This is out-of-order execution applied to memory, and it's where many subtle bugs and side channels live.

Simultaneous multithreading (SMT). Layered on top of an out-of-order core: when one thread's window is full of stalls, the scheduler issues from a second thread's instructions instead, filling the execution units. It reuses the exact same wakeup/select machinery.

Common misconceptions and hardware edge cases

  • "Out-of-order means results come back scrambled." No — retirement is in order. Architectural state always updates in program order; only the internal execution timeline is reordered.
  • Confusing out-of-order with speculative. They're orthogonal but co-travel. Speculative execution runs past unresolved branches; out-of-order runs ready instructions early. A core can do one without the other in principle, but in practice deep out-of-order needs speculation to keep the window full.
  • Assuming a wider issue automatically helps. Issue width without a matching window and enough independent work just leaves slots empty. The bottleneck is usually true dependencies and branch mispredicts, not issue ports.
  • Memory ordering surprises. Because loads and stores reorder, multi-threaded code can observe writes in an order the source never specified. This is why you need memory barriers and a defined memory consistency model — the reordering that speeds up one core leaks into what other cores see.
  • Forgetting the rollback gap. On a branch mispredict or exception, registers and the ROB are rolled back — but micro-architectural state like the cache is not erased. That residue is exactly what Spectre-class attacks read out.
  • Thinking the compiler is now irrelevant. It still matters enormously: good instruction selection, keeping the dependency chains short, and avoiding spills to memory all change how much parallelism the hardware can find. Out-of-order helps most when the compiler hands it independent work to discover.

Frequently asked questions

If instructions finish out of order, how does the program still produce correct results?

Instructions execute out of order but retire (commit their results to architectural state) in strict program order. A reorder buffer holds completed results until every older instruction has retired, so to software the machine looks perfectly sequential. Out-of-order is execution; in-order is the visible result.

What problem does register renaming actually solve?

It removes false dependencies. When two instructions reuse the same architectural register without a real data flow between them (write-after-write or write-after-read hazards), they appear dependent even though they aren't. Renaming maps each write to a fresh physical register, so the only dependencies left are true read-after-write ones — the ones that genuinely constrain order.

Why does a cache miss benefit so much from out-of-order execution?

A miss to main memory costs roughly 100–300 cycles. An in-order core stalls for all of them. An out-of-order core keeps issuing independent instructions from its window during the miss — with a ~200-entry reorder buffer it can have dozens of instructions in flight, overlapping the latency with useful work. This is called memory-level parallelism.

What is the difference between the issue width and the reorder buffer size?

Issue width is how many instructions can start executing per cycle (4–8 on modern cores). Reorder buffer size is how far ahead the core can look for independent work — the instruction window. A wide issue with a tiny window starves; a deep window with narrow issue can't drain fast enough. They scale together.

Is out-of-order execution related to the Spectre and Meltdown vulnerabilities?

Indirectly. Those attacks exploit speculative execution — running instructions past an unresolved branch and rolling them back if wrong. Out-of-order machines are aggressively speculative because they need a deep window to find independent work, so the speculative side effects (cache footprints) that Spectre/Meltdown read out are a consequence of the same hardware. The rollback erases registers but not cache state.

Why don't simple cores like microcontrollers use out-of-order execution?

It is expensive in area and power. Reorder buffers, reservation stations, a large physical register file, and the wakeup/select logic can be a large fraction of the core's transistors and energy. For low-power or real-time parts where predictable timing matters more than peak throughput, a simple in-order pipeline wins. The ARM Cortex-A53 and most microcontrollers stay in-order for exactly this reason.