Theory
Compilation Pipeline
Source code through 8 stages — text in, executable out
A compiler turns source code into a runnable executable through a pipeline of transformations — lexer → parser → AST → semantic analysis → IR → optimizer → codegen → assembler → linker. Each stage transforms one representation into a more machine-friendly one. Understanding the pipeline is what makes you good at debugging compiler errors, writing fast code, and reading assembly output.
- Stages (typical)Lex, parse, AST build, semantic check, IR gen, optimize, codegen, assemble, link
- Compile time vs run timeCompilation is one-shot; runtime is forever
- AST size~10× source LOC
- IR examplesLLVM IR, GIMPLE/RTL (GCC), bytecode (JVM, .NET)
- Common optimizationsConstant folding, dead code, inlining, loop unrolling, SSA-based passes
- Used inGCC, Clang/LLVM, MSVC, Rust, Go, Swift, JVM, V8
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The pipeline, stage by stage
| Stage | Input | Output | Job |
|---|---|---|---|
| 1. Lexer (scanner, tokenizer) | Source text | Token stream | Group characters into tokens (identifiers, keywords, operators, literals) |
| 2. Parser | Token stream | Concrete syntax tree | Apply grammar rules; check syntactic structure |
| 3. AST builder | Concrete tree | Abstract Syntax Tree | Drop irrelevant syntax (parens, semicolons); keep meaning |
| 4. Semantic analysis | AST | Annotated AST | Type checking, name resolution, scope analysis |
| 5. IR generation | Annotated AST | IR (e.g., LLVM IR) | Lower high-level constructs to low-level operations |
| 6. Optimization | IR | Optimized IR | Constant folding, dead code, inlining, loop transforms, etc. |
| 7. Code generation | Optimized IR | Assembly | Translate IR to target machine instructions |
| 8. Assembler | Assembly | Object file (machine code + relocations) | Encode instructions to bytes |
| 9. Linker | Object files + libraries | Executable | Resolve symbols, finalize addresses |
Modern compilers blur stages — frontends produce IR directly without an explicit assembly intermediate; some optimizations work on AST instead of IR; LTO (link-time optimization) merges files for cross-module optimization. The 9-stage breakdown is conceptual.
Lexing and parsing
Take the source x = a + 42;. The lexer produces:
IDENT("x") EQUALS IDENT("a") PLUS INT(42) SEMI
The parser consumes the token stream and builds a tree per the language grammar:
Assign
├── Target: Identifier("x")
└── Value:
└── BinaryOp("+")
├── Left: Identifier("a")
└── Right: IntLiteral(42)
Lexers are typically finite state machines (regex-based). Parsers are often LL(k) or LR(1) for simple grammars; recursive descent for more complex languages. Modern parser generators (ANTLR, tree-sitter) handle ambiguous grammars and produce useful error messages.
From AST to IR
The AST is high-level — it has functions, classes, control structures. IR is low-level — basic blocks of instructions, branches, register-like SSA values. The "lowering" step does this translation.
For the C statement if (x > 0) y = 1; else y = -1;:
; LLVM IR (simplified)
entry:
%cmp = icmp sgt i32 %x, 0
br i1 %cmp, label %if.then, label %if.else
if.then:
store i32 1, i32* %y
br label %if.end
if.else:
store i32 -1, i32* %y
br label %if.end
if.end:
ret void
Each function becomes a control-flow graph of basic blocks. Each basic block is a sequence of instructions that executes top-to-bottom with no branches in the middle. Optimizations operate on this CFG.
Common compiler optimizations
| Optimization | Effect | Example |
|---|---|---|
| Constant folding | Compute constant expressions at compile time | 3 * 4 → 12 |
| Constant propagation | Replace uses of known constants | x = 5; y = x + 1; → y = 6; |
| Dead code elimination | Remove instructions whose results aren't used | x = expensive(); /* never used */ deleted |
| Common subexpression elimination | Compute repeated expressions once | (a+b)*c + (a+b)*d → t=a+b; t*c + t*d |
| Inlining | Replace function call with body | Eliminates call overhead; enables further optimization |
| Loop unrolling | Repeat loop body to reduce iteration overhead | for(i=0;i<4;i++) a[i]=0; → 4 stores |
| Loop invariant code motion | Move invariant computations out of loops | while(...){ x=a*b; ... } → x=a*b; while(...){ ... } |
| Strength reduction | Replace expensive ops with cheaper | x*2 → x<<1 |
| Tail call optimization | Reuse stack frame for tail calls | Recursion uses O(1) stack instead of O(depth) |
Production compilers (LLVM, GCC) ship hundreds of optimization passes. -O2 and -O3 are bundles of these passes; the tradeoff is compile time vs runtime speed.
Static compilation vs JIT
| Static compilation (AOT) | Just-in-time (JIT) | |
|---|---|---|
| When code is compiled | Once, before execution | During execution, on hot paths |
| Optimizations | Profile-guided possible (rarely used in practice) | Profile-guided automatic (real runtime data) |
| Startup speed | Fast (binary loads, runs) | Slow (warm-up while JIT compiles) |
| Steady-state speed | Limited by static analysis | Often faster (specializes on actual types) |
| Memory | Smaller — no compiler in process | Larger — JIT compiler resident |
| Examples | C/C++/Rust/Go executables | JVM, V8, Python (PyPy), .NET |
Modern languages mix approaches — V8 first interprets JS, then JIT-compiles hot functions, then re-optimizes them as more profiling data arrives. Java's HotSpot has tiered compilation (interpreter → C1 → C2 → optimized C2). Hybrid is the norm for dynamic languages.
When understanding the pipeline matters
- Debugging compiler errors. Each stage has distinct error messages; recognizing "lexer error" vs "parse error" vs "type error" tells you what part of your code to inspect.
- Reading assembly to optimize hot code.
gcc -Sorgodbolt.orgshows the assembly your code compiles to. Spotting unnecessary memory loads, inefficient branches, missed vectorization — all require reading the codegen output. - Writing language tooling. Linters, formatters, refactoring tools all operate on AST. Understanding the AST shape of your language is a prerequisite.
- Knowing which optimizations are automatic. Don't manually inline a function the compiler already inlines. Don't manually constant-fold; the compiler does it. Focus optimization effort on what compilers can't do (algorithmic improvements, data layout).
- Cross-compilation and target tuning. Same source, different backends — x86 vs ARM, server vs embedded. The IR layer is what makes this practical.
Common compilation pitfalls
- Thinking
-O3is always faster. Aggressive optimization can be slower in cache-pressured workloads (more inlining → larger code → more I-cache misses). Profile both. - Undefined behavior optimizations. C/C++ compilers exploit UB to skip checks ("this can't happen, so I'll assume it doesn't"). Signed integer overflow, null dereference, strict aliasing violations all get aggressively optimized. Use UB sanitizers (UBSan) to catch.
- Forgetting that JIT warmup matters. JVM benchmarks must run for 30+ seconds before measurements; first iterations are interpreter-paced. Microbenchmark frameworks (JMH) handle this; ad-hoc measurements often don't.
- Ignoring linker errors as "weird messages." "Undefined symbol" means a function/variable was declared but never defined — fix the linkage, not the calling code.
- Compile times exploding from template instantiation. C++ templates and Rust generics produce one specialization per type combination. Heavy template use can blow compile times to minutes; LTO/PGO compounds this.
Frequently asked questions
What's the difference between an interpreter and a compiler?
A compiler translates source to machine code (or another lower-level representation) once; the result is run many times. An interpreter executes source directly, parsing and evaluating each statement on the fly. Modern systems blur the line — JVM and V8 compile bytecode just-in-time during execution, achieving near-compiled speeds. Pure interpreters are mostly historical (early BASIC, Tcl); modern "interpreters" are usually JIT compilers.
What's an AST?
Abstract Syntax Tree. The compiler's structured representation of the source code after parsing. Nodes are language constructs (function definitions, expressions, statements); children are sub-expressions. Easier to manipulate than raw source text — semantic analysis, type checking, and most optimizations work on the AST or downstream IR.
What's IR (intermediate representation) and why is it needed?
A platform-independent middle layer between the AST and machine code. The compiler does most optimizations on IR — independent of source language and target architecture. LLVM IR is the most famous; it lets dozens of frontends (Clang, Rust, Swift) share dozens of backends (x86, ARM, RISC-V) with optimizations written once.
What's SSA form?
Static Single Assignment — every variable is assigned exactly once. <code>x = 1; x = 2;</code> becomes <code>x_1 = 1; x_2 = 2;</code> Phi functions handle merge points (control-flow joins). SSA simplifies many optimizations because each variable's definition is unambiguous. Used by LLVM, GCC's GIMPLE, V8's TurboFan, and most modern optimizing compilers.
What does a linker do?
Combines object files (compiled translation units) plus libraries into a single executable. Resolves symbol references (function calls, global variables) across files. Performs relocations (adjusts addresses to final positions). Static linking copies all dependencies into the executable; dynamic linking leaves them as references resolved at load/run time. The "undefined symbol" error you see is the linker failing this step.
What's JIT compilation?
Just-In-Time. The compiler runs at execution time, compiling hot code paths from bytecode/source to native machine code. Lets the runtime exploit profile information (which branches are common, which functions are hot, which objects are real types). V8 (Chrome's JavaScript engine), HotSpot (Java), and PyPy (Python) all use JITs. Trade-off — startup is slower because compilation happens on first execution; steady-state is fast.
Why are compiler errors so cryptic?
Errors are reported by stage. Lexer errors are "unexpected character"; parser errors are "expected ; got }"; semantic errors are "undefined variable foo." Each stage has its own error vocabulary. A confusing error often means the compiler recovered incorrectly from an earlier mistake — the actual bug is several lines before where the error is reported. Pattern recognition over time helps; modern compilers (Rust, Clang) put significant effort into clearer messages.