Theory

Compilation Pipeline

Source code through 8 stages — text in, executable out

A compiler turns source code into a runnable executable through a pipeline of transformations — lexer → parser → AST → semantic analysis → IR → optimizer → codegen → assembler → linker. Each stage transforms one representation into a more machine-friendly one. Understanding the pipeline is what makes you good at debugging compiler errors, writing fast code, and reading assembly output.

Stages (typical)Lex, parse, AST build, semantic check, IR gen, optimize, codegen, assemble, link
Compile time vs run timeCompilation is one-shot; runtime is forever
AST size~10× source LOC
IR examplesLLVM IR, GIMPLE/RTL (GCC), bytecode (JVM, .NET)
Common optimizationsConstant folding, dead code, inlining, loop unrolling, SSA-based passes
Used inGCC, Clang/LLVM, MSVC, Rust, Go, Swift, JVM, V8

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Watch on YouTube

The pipeline, stage by stage

Stage	Input	Output	Job
1. Lexer (scanner, tokenizer)	Source text	Token stream	Group characters into tokens (identifiers, keywords, operators, literals)
2. Parser	Token stream	Concrete syntax tree	Apply grammar rules; check syntactic structure
3. AST builder	Concrete tree	Abstract Syntax Tree	Drop irrelevant syntax (parens, semicolons); keep meaning
4. Semantic analysis	AST	Annotated AST	Type checking, name resolution, scope analysis
5. IR generation	Annotated AST	IR (e.g., LLVM IR)	Lower high-level constructs to low-level operations
6. Optimization	IR	Optimized IR	Constant folding, dead code, inlining, loop transforms, etc.
7. Code generation	Optimized IR	Assembly	Translate IR to target machine instructions
8. Assembler	Assembly	Object file (machine code + relocations)	Encode instructions to bytes
9. Linker	Object files + libraries	Executable	Resolve symbols, finalize addresses

Modern compilers blur stages — frontends produce IR directly without an explicit assembly intermediate; some optimizations work on AST instead of IR; LTO (link-time optimization) merges files for cross-module optimization. The 9-stage breakdown is conceptual.

Lexing and parsing

Take the source x = a + 42;. The lexer produces:

IDENT("x")  EQUALS  IDENT("a")  PLUS  INT(42)  SEMI

The parser consumes the token stream and builds a tree per the language grammar:

Assign
├── Target: Identifier("x")
└── Value:
    └── BinaryOp("+")
        ├── Left:  Identifier("a")
        └── Right: IntLiteral(42)

Lexers are typically finite state machines (regex-based). Parsers are often LL(k) or LR(1) for simple grammars; recursive descent for more complex languages. Modern parser generators (ANTLR, tree-sitter) handle ambiguous grammars and produce useful error messages.

From AST to IR

The AST is high-level — it has functions, classes, control structures. IR is low-level — basic blocks of instructions, branches, register-like SSA values. The "lowering" step does this translation.

For the C statement if (x > 0) y = 1; else y = -1;:

; LLVM IR (simplified)
entry:
  %cmp = icmp sgt i32 %x, 0
  br i1 %cmp, label %if.then, label %if.else

if.then:
  store i32 1, i32* %y
  br label %if.end

if.else:
  store i32 -1, i32* %y
  br label %if.end

if.end:
  ret void

Each function becomes a control-flow graph of basic blocks. Each basic block is a sequence of instructions that executes top-to-bottom with no branches in the middle. Optimizations operate on this CFG.

Common compiler optimizations

Optimization	Effect	Example
Constant folding	Compute constant expressions at compile time	`3 * 4` → `12`
Constant propagation	Replace uses of known constants	`x = 5; y = x + 1;` → `y = 6;`
Dead code elimination	Remove instructions whose results aren't used	`x = expensive(); /* never used */` deleted
Common subexpression elimination	Compute repeated expressions once	`(a+b)c + (a+b)d` → `t=a+b; tc + td`
Inlining	Replace function call with body	Eliminates call overhead; enables further optimization
Loop unrolling	Repeat loop body to reduce iteration overhead	`for(i=0;i<4;i++) a[i]=0;` → 4 stores
Loop invariant code motion	Move invariant computations out of loops	`while(...){ x=ab; ... }` → `x=ab; while(...){ ... }`
Strength reduction	Replace expensive ops with cheaper	`x*2` → `x<<1`
Tail call optimization	Reuse stack frame for tail calls	Recursion uses O(1) stack instead of O(depth)

Production compilers (LLVM, GCC) ship hundreds of optimization passes. -O2 and -O3 are bundles of these passes; the tradeoff is compile time vs runtime speed.

Static compilation vs JIT

	Static compilation (AOT)	Just-in-time (JIT)
When code is compiled	Once, before execution	During execution, on hot paths
Optimizations	Profile-guided possible (rarely used in practice)	Profile-guided automatic (real runtime data)
Startup speed	Fast (binary loads, runs)	Slow (warm-up while JIT compiles)
Steady-state speed	Limited by static analysis	Often faster (specializes on actual types)
Memory	Smaller — no compiler in process	Larger — JIT compiler resident
Examples	C/C++/Rust/Go executables	JVM, V8, Python (PyPy), .NET

Modern languages mix approaches — V8 first interprets JS, then JIT-compiles hot functions, then re-optimizes them as more profiling data arrives. Java's HotSpot has tiered compilation (interpreter → C1 → C2 → optimized C2). Hybrid is the norm for dynamic languages.

When understanding the pipeline matters

Debugging compiler errors. Each stage has distinct error messages; recognizing "lexer error" vs "parse error" vs "type error" tells you what part of your code to inspect.
Reading assembly to optimize hot code. gcc -S or godbolt.org shows the assembly your code compiles to. Spotting unnecessary memory loads, inefficient branches, missed vectorization — all require reading the codegen output.
Writing language tooling. Linters, formatters, refactoring tools all operate on AST. Understanding the AST shape of your language is a prerequisite.
Knowing which optimizations are automatic. Don't manually inline a function the compiler already inlines. Don't manually constant-fold; the compiler does it. Focus optimization effort on what compilers can't do (algorithmic improvements, data layout).
Cross-compilation and target tuning. Same source, different backends — x86 vs ARM, server vs embedded. The IR layer is what makes this practical.

Common compilation pitfalls

Thinking -O3 is always faster. Aggressive optimization can be slower in cache-pressured workloads (more inlining → larger code → more I-cache misses). Profile both.
Undefined behavior optimizations. C/C++ compilers exploit UB to skip checks ("this can't happen, so I'll assume it doesn't"). Signed integer overflow, null dereference, strict aliasing violations all get aggressively optimized. Use UB sanitizers (UBSan) to catch.
Forgetting that JIT warmup matters. JVM benchmarks must run for 30+ seconds before measurements; first iterations are interpreter-paced. Microbenchmark frameworks (JMH) handle this; ad-hoc measurements often don't.
Ignoring linker errors as "weird messages." "Undefined symbol" means a function/variable was declared but never defined — fix the linkage, not the calling code.
Compile times exploding from template instantiation. C++ templates and Rust generics produce one specialization per type combination. Heavy template use can blow compile times to minutes; LTO/PGO compounds this.

Frequently asked questions

What's the difference between an interpreter and a compiler?

A compiler translates source to machine code (or another lower-level representation) once; the result is run many times. An interpreter executes source directly, parsing and evaluating each statement on the fly. Modern systems blur the line — JVM and V8 compile bytecode just-in-time during execution, achieving near-compiled speeds. Pure interpreters are mostly historical (early BASIC, Tcl); modern "interpreters" are usually JIT compilers.

What's an AST?

Abstract Syntax Tree. The compiler's structured representation of the source code after parsing. Nodes are language constructs (function definitions, expressions, statements); children are sub-expressions. Easier to manipulate than raw source text — semantic analysis, type checking, and most optimizations work on the AST or downstream IR.

What's IR (intermediate representation) and why is it needed?

A platform-independent middle layer between the AST and machine code. The compiler does most optimizations on IR — independent of source language and target architecture. LLVM IR is the most famous; it lets dozens of frontends (Clang, Rust, Swift) share dozens of backends (x86, ARM, RISC-V) with optimizations written once.

What's SSA form?

Static Single Assignment — every variable is assigned exactly once. <code>x = 1; x = 2;</code> becomes <code>x_1 = 1; x_2 = 2;</code> Phi functions handle merge points (control-flow joins). SSA simplifies many optimizations because each variable's definition is unambiguous. Used by LLVM, GCC's GIMPLE, V8's TurboFan, and most modern optimizing compilers.

What does a linker do?

Combines object files (compiled translation units) plus libraries into a single executable. Resolves symbol references (function calls, global variables) across files. Performs relocations (adjusts addresses to final positions). Static linking copies all dependencies into the executable; dynamic linking leaves them as references resolved at load/run time. The "undefined symbol" error you see is the linker failing this step.

What's JIT compilation?

Just-In-Time. The compiler runs at execution time, compiling hot code paths from bytecode/source to native machine code. Lets the runtime exploit profile information (which branches are common, which functions are hot, which objects are real types). V8 (Chrome's JavaScript engine), HotSpot (Java), and PyPy (Python) all use JITs. Trade-off — startup is slower because compilation happens on first execution; steady-state is fast.

Why are compiler errors so cryptic?

Errors are reported by stage. Lexer errors are "unexpected character"; parser errors are "expected ; got }"; semantic errors are "undefined variable foo." Each stage has its own error vocabulary. A confusing error often means the compiler recovered incorrectly from an earlier mistake — the actual bug is several lines before where the error is reported. Pattern recognition over time helps; modern compilers (Rust, Clang) put significant effort into clearer messages.