Compilers

Bytecode Virtual Machine

A fake CPU made of software — the loop behind Python and the JVM

A bytecode virtual machine executes a compact stream of compiled instructions on a software stack machine — the dispatch loop that runs Python, the JVM, and the CLR by pushing operands, popping them into operations, and pushing results back.

  • ArchitectureStack or register
  • Per-instruction cost~10–30 host cycles
  • Interpreter overhead vs native~10–50×
  • Dispatch speedup (threading)20–50%
  • Opcode size1 byte

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a bytecode VM runs your program

Your source code never runs directly. A compiler first lowers it to bytecode — a flat array of one-byte opcodes, each optionally followed by operand bytes. x = a + b * 2 in Python compiles to roughly: load a, load b, load constant 2, multiply, add, store x. Six instructions, no variable names left, no parentheses — just a recipe a machine can follow blindly.

The virtual machine is the program that follows that recipe. It is not hardware; it is a loop in C (CPython), C++ (the HotSpot JVM), or Rust (many newer VMs). It keeps three things in ordinary local variables:

  • The program counter (PC / instruction pointer) — an index into the bytecode array marking the next opcode to fetch.
  • The operand stack — a small array where intermediate values live. Almost every instruction works by popping its inputs off this stack and pushing its result back.
  • The frame — local variables, the constant pool, and a return address, bundled per function call.

The whole engine is a three-step cycle repeated millions of times per second, called fetch-decode-execute:

  1. Fetch the opcode at code[PC] and advance the PC.
  2. Decode — branch to the handler for that opcode (a switch case or a jump-table entry).
  3. Execute — run the handler, which mutates the stack, the locals, or the PC (for jumps).

That loop is the heart of every interpreted language runtime on the planet. The reason a .pyc or .class file runs unmodified on Windows, macOS, and Linux is that the fictional CPU it targets is identical everywhere — only the VM that simulates it gets recompiled per platform.

Stack machine vs register machine

The defining design choice is where operands live. A stack machine keeps them on the operand stack, so most instructions carry no operand bytes at all — IADD just means "pop two, add, push." A register machine names its operands explicitly: ADD r3, r1, r2.

Compute (a + b) on each:

Stack VM (JVM, CPython, .NET CLR):     Register VM (Dalvik, Lua 5, Wasm-ish):

  LOAD a      ; stack: [a]              ADD r2, r0, r1   ; r2 = r0 + r1
  LOAD b      ; stack: [a, b]
  ADD         ; stack: [a+b]
  STORE x     ; stack: []

The stack version is four single-byte instructions; the register version is one wider instruction. Stack code is denser per operation but executes more instructions, and each instruction means another trip through the slow dispatch loop. Register code does more per instruction — fewer dispatches — but needs a register allocator in the compiler and bigger opcodes. The 2008 Dalvik VM measured roughly 30% fewer executed instructions than the equivalent stack JVM bytecode for the same Java source, at the cost of larger instructions (its register opcodes carry explicit operand fields).

When a bytecode VM is the right tool

  • You need portability. Ship one artifact, run it on any OS/CPU that has the VM. This is the entire premise of Java and .NET.
  • You want fast startup. The interpreter executes bytecode the instant it loads — no native compile step, so cold scripts and short-lived processes start in milliseconds.
  • You want a small, debuggable implementation. A stack VM is a few hundred lines. You can single-step opcodes, inspect the stack, and reason about semantics far more easily than with hand-written native codegen.
  • You need sandboxing. Because every instruction passes through your loop, you can meter execution, enforce memory limits, and verify safety (the JVM verifier, WebAssembly's type checks) before anything runs.

Reach for something else when raw throughput dominates: a tree-walking interpreter is simpler if you only run tiny scripts; an ahead-of-time native compiler (Go, Rust, C) wins when you control the deployment target and want maximum speed with no warm-up; a JIT on top of the VM is the answer when you need both portability and native-level hot-loop performance.

Bytecode VM vs other execution models

Tree-walk interpreterBytecode VM (stack)Bytecode VM (register)JIT compilerAOT native
Startup latencyInstantInstantInstantWarm-up neededInstant
Steady-state speedSlowest (~100× native)~10–50× native~8–30× native~1–3× native1× (native)
PortabilityHighHigh (one VM port)HighHighNone (per-target)
Compiler complexityNoneLowMedium (reg alloc)HighHigh
Code densityAST in memoryCompact (1-byte ops)Larger opsNative (large)Native (large)
Memory safety / sandboxEasyEasy (verifiable)EasyHarderManual
Real-world exampleEarly Ruby (MRI 1.8), bashCPython, JVM, CLRDalvik (Android), Lua 5HotSpot C2, V8, PyPyGo, Rust, C/C++

The headline tradeoff is the same axis every time: portability and simplicity at the top, raw speed at the bottom. Production runtimes don't pick one — the JVM and modern CPython start in the interpreter for fast startup, then JIT the hot methods to native code, getting both.

What the numbers actually say

  • Interpreter overhead is roughly 10–50× over native for the same algorithm. CPython famously runs a pure-Python numeric loop on the order of 30–50× slower than the equivalent C; that gap is almost entirely the dispatch loop, not the arithmetic.
  • Each bytecode costs about 10–30 host CPU cycles. A native add is one cycle; the VM's BINARY_ADD spends the rest on fetch, decode, the indirect branch, and type checks on the popped operands.
  • The single biggest interpreter cost is branch misprediction on dispatch. A monolithic switch compiles to one indirect jump the CPU can't predict, so it stalls 10–20 cycles per instruction. Computed-goto "threaded" dispatch gives each opcode handler its own jump site; the original GCC computed-goto patch to CPython reported a ~15–20% speedup, and tight numeric loops can see more.
  • Register VMs execute fewer instructions. Translating stack JVM bytecode to Dalvik's register form cut executed instruction counts by about 30% for the same program, which is why Android's early VM chose registers despite the bigger compiler.
  • Bytecode is tiny. A Python .pyc or Java .class is typically a few KB — orders of magnitude smaller than the AST it came from or the native code it could become — which is why it ships well over a network.

A working stack VM in JavaScript

Here is a complete bytecode VM small enough to read in one sitting. It compiles nothing — we hand-write the bytecode — but it has a real opcode set, operand stack, locals, and a jump for control flow. It computes (3 + 4) * 5 and prints the result.

// Opcodes
const OP = { PUSH:0, ADD:1, MUL:2, STORE:3, LOAD:4, PRINT:5, JMP_IF_LT:6, HALT:7 };

function run(code, constants) {
  const stack = [];
  const locals = [];
  let pc = 0;

  while (true) {
    const op = code[pc++];               // FETCH
    switch (op) {                        // DECODE + EXECUTE
      case OP.PUSH:  stack.push(constants[code[pc++]]); break;
      case OP.ADD:   { const b = stack.pop(), a = stack.pop(); stack.push(a + b); break; }
      case OP.MUL:   { const b = stack.pop(), a = stack.pop(); stack.push(a * b); break; }
      case OP.STORE: locals[code[pc++]] = stack.pop(); break;
      case OP.LOAD:  stack.push(locals[code[pc++]]); break;
      case OP.PRINT: console.log(stack[stack.length - 1]); break;
      case OP.JMP_IF_LT: {              // pop b,a; if a < b jump to target
        const target = code[pc++];
        const b = stack.pop(), a = stack.pop();
        if (a < b) pc = target;
        break;
      }
      case OP.HALT:  return stack.pop();
      default: throw new Error(`bad opcode ${op} at ${pc - 1}`);
    }
  }
}

// (3 + 4) * 5
const consts = [3, 4, 5];
const program = [
  OP.PUSH, 0,   // 3
  OP.PUSH, 1,   // 4
  OP.ADD,       // 7
  OP.PUSH, 2,   // 5
  OP.MUL,       // 35
  OP.PRINT,
  OP.HALT,
];
run(program, consts);   // logs 35, returns 35

Three details carry the whole design. First, pc++ after each fetch is the entire control-flow story — a jump is just an assignment to pc. Second, every arithmetic opcode follows the same pop-pop-push shape, which is why the compiler that produces this bytecode is trivial. Third, the switch is the dispatch loop, and it is exactly the part a serious VM replaces with a computed-goto jump table for speed.

The same VM in Python

Python's own runtime is a bytecode VM, so this is a VM written in bytecode-VM-executed code — turtles all the way down. The dis module lets you peek at CPython's real bytecode for any function.

PUSH, ADD, MUL, STORE, LOAD, PRINT, JMP_IF_LT, HALT = range(8)

def run(code, consts):
    stack, locals_, pc = [], {}, 0
    while True:
        op = code[pc]; pc += 1                  # FETCH
        if op == PUSH:
            stack.append(consts[code[pc]]); pc += 1
        elif op == ADD:
            b = stack.pop(); a = stack.pop(); stack.append(a + b)
        elif op == MUL:
            b = stack.pop(); a = stack.pop(); stack.append(a * b)
        elif op == STORE:
            locals_[code[pc]] = stack.pop(); pc += 1
        elif op == LOAD:
            stack.append(locals_[code[pc]]); pc += 1
        elif op == PRINT:
            print(stack[-1])
        elif op == JMP_IF_LT:
            target = code[pc]; pc += 1
            b = stack.pop(); a = stack.pop()
            if a < b: pc = target
        elif op == HALT:
            return stack.pop()
        else:
            raise RuntimeError(f"bad opcode {op} at {pc-1}")

consts = [3, 4, 5]
program = [PUSH,0, PUSH,1, ADD, PUSH,2, MUL, PRINT, HALT]
run(program, consts)        # prints 35

# See the real thing — CPython's bytecode for a tiny function:
import dis
dis.dis(lambda a, b: (a + b) * 5)
# LOAD_FAST a / LOAD_FAST b / BINARY_OP + / LOAD_CONST 5 / BINARY_OP * / RETURN_VALUE

Run the dis.dis line and you see CPython's actual stack-machine opcodes — LOAD_FAST, BINARY_OP, RETURN_VALUE — the same pop-pop-push shape as the toy above, just with a far larger opcode set and reference-counted objects on the stack instead of raw numbers.

Faster dispatch: switch vs threading

The switch in the loops above is the performance bottleneck. The compiler emits a single indirect jump at the bottom of the loop, and the CPU's branch predictor can't guess which opcode comes next, so it stalls on nearly every instruction. Threaded dispatch fixes this by giving each opcode its own copy of the "jump to the next handler" code, so the predictor learns per-opcode patterns (e.g. a comparison is usually followed by a branch).

/* Computed-goto threading (GCC/Clang extension) — the CPython technique */
static void *table[] = { &&do_push, &&do_add, &&do_mul, &&do_halt };
#define DISPATCH() goto *table[code[pc++]]

    DISPATCH();
do_push: stack[++sp] = consts[code[pc++]]; DISPATCH();
do_add:  { int b = stack[sp--], a = stack[sp--]; stack[++sp] = a + b; } DISPATCH();
do_mul:  { int b = stack[sp--], a = stack[sp--]; stack[++sp] = a * b; } DISPATCH();
do_halt: return stack[sp];

Each handler ends with its own DISPATCH(), so there are N branch sites instead of one. This is a pure micro-optimization — same semantics, same opcodes — but it is where 20–50% of an interpreter's speedup comes from in practice. The next rung up the ladder is a JIT, which removes the dispatch loop entirely for hot code.

Variants worth knowing

Register-based VMs. Dalvik (pre-ART Android), Lua 5, and the Parrot VM keep operands in virtual registers instead of a stack. Fewer, fatter instructions mean fewer dispatches — measurably faster — but the compiler must do register allocation.

Direct-threaded and subroutine-threaded code. Forth-style VMs store actual handler addresses (or call instructions) in the bytecode itself, so "dispatch" is just following pointers. Maximally fast interpretation, minimally portable bytecode.

Verified bytecode. The JVM and WebAssembly run a verifier at load time that proves stack depths and types are consistent on every path. Once verified, handlers can drop bounds and type checks entirely — safety moves from runtime to load time.

Superinstructions and quickening. The VM fuses common opcode pairs into one (e.g. LOAD_FAST; LOAD_FASTLOAD_FAST_LOAD_FAST), or rewrites a generic opcode into a specialized one after seeing the runtime types. CPython 3.11+'s "adaptive specializing interpreter" does exactly this, inlining type-specialized handlers on the fly.

WebAssembly. A modern stack-machine bytecode designed for the browser. Same fetch-decode-execute lineage, but built for streaming compilation to native code and provable memory safety.

Common bugs and edge cases

  • Pop order on the stack. For non-commutative operations the second pop is the left operand: b = pop(); a = pop(); push(a - b). Swapping them silently computes b - a — a classic, hard-to-spot subtraction/division bug.
  • Advancing the PC past operands. Multi-byte instructions must consume all their operand bytes. Forget one and the next fetch reads an operand byte as an opcode, corrupting everything downstream.
  • Stack overflow and underflow. Unbounded recursion or deeply nested expressions blow the operand or call stack. Verified VMs prove this can't happen; unverified ones must cap the stack and raise an error (Python's RecursionError, the JVM's StackOverflowError) rather than scribble past the array.
  • Jump targets into the middle of an instruction. A jump must land on an opcode boundary. Landing on an operand byte executes garbage; this is both a correctness bug and a security hole, which is why the JVM verifier checks every branch target.
  • Forgetting reference counting / GC roots. When the stack holds heap objects (real languages, not the toy), values popped and pushed must be tracked by the garbage collector. A value live only on the operand stack is still reachable — miss it and the GC frees memory still in use.
  • Assuming the interpreter is fast. A tight pure-bytecode loop is 10–50× slower than native. If a hot loop matters, push it into a native library (NumPy, JNI) or let the JIT take over — don't micro-optimize the bytecode.

Frequently asked questions

What is the difference between a bytecode VM and a real CPU?

A real CPU decodes and executes machine code in silicon. A bytecode VM is a program that simulates a fictional CPU: it reads instructions from a byte array, keeps the program counter and operand stack in ordinary variables, and runs each opcode as a chunk of host-language code. The bytecode is portable because the simulated CPU is identical on every platform, while the host program is recompiled per platform.

Why do most bytecode VMs use a stack instead of registers?

A stack machine needs no operand fields in most instructions — ADD just pops two values and pushes one — so the bytecode is compact and the compiler is trivial to write. Register machines like the Dalvik VM and Lua 5 use fewer, larger instructions and run faster per operation, but the compiler must allocate virtual registers. Stack VMs trade raw speed for simplicity and density.

What makes the dispatch loop slow, and how is it sped up?

A naive switch-based dispatch loop branch-mispredicts on almost every instruction because the CPU cannot guess which case comes next, costing 10–20 cycles per opcode. Computed-goto threading (a jump table indexed by opcode, with the dispatch baked into the end of each handler) gives the branch predictor a separate site per opcode and can cut interpreter overhead by 20–50%.

How does a bytecode VM relate to a JIT compiler?

The interpreter runs every bytecode the moment it loads, with no warm-up. A JIT watches which methods run hot, then compiles those to native machine code so they skip the dispatch loop entirely. The JVM and modern Python (3.13+) keep the bytecode interpreter as the baseline tier and JIT only the hot paths — interpreted code stays available for cold code and deoptimization.

Why is bytecode portable but machine code is not?

Bytecode targets a virtual instruction set that the VM defines, not a physical chip. "Write once, run anywhere" works because only the VM is ported to each OS and CPU; the .class or .pyc file is byte-for-byte identical everywhere. Native machine code is locked to one instruction set (x86-64, ARM64) and one calling convention, so it must be recompiled per target.

What happens when the operand stack underflows or overflows?

In a verified VM like the JVM it never happens at runtime: the bytecode verifier proves at load time that every instruction's stack depth is consistent, so handlers can skip bounds checks. In an unverified VM, a malformed program can pop an empty stack or push past its limit, which is a memory-safety bug — production VMs cap the stack and raise an error (Python's RecursionError, the JVM's StackOverflowError) rather than corrupting memory.