Theory

Compilation Pipeline

Source code through 8 stages — text in, executable out

A compiler turns source code into a runnable executable through a pipeline of transformations — lexer → parser → AST → semantic analysis → IR → optimizer → codegen → assembler → linker. Each stage transforms one representation into a more machine-friendly one. Understanding the pipeline is what makes you good at debugging compiler errors, writing fast code, and reading assembly output.

  • Stages (typical)Lex, parse, AST build, semantic check, IR gen, optimize, codegen, assemble, link
  • Compile time vs run timeCompilation is one-shot; runtime is forever
  • AST size~10× source LOC
  • IR examplesLLVM IR, GIMPLE/RTL (GCC), bytecode (JVM, .NET)
  • Common optimizationsConstant folding, dead code, inlining, loop unrolling, SSA-based passes
  • Used inGCC, Clang/LLVM, MSVC, Rust, Go, Swift, JVM, V8

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The pipeline, stage by stage

StageInputOutputJob
1. Lexer (scanner, tokenizer)Source textToken streamGroup characters into tokens (identifiers, keywords, operators, literals)
2. ParserToken streamConcrete syntax treeApply grammar rules; check syntactic structure
3. AST builderConcrete treeAbstract Syntax TreeDrop irrelevant syntax (parens, semicolons); keep meaning
4. Semantic analysisASTAnnotated ASTType checking, name resolution, scope analysis
5. IR generationAnnotated ASTIR (e.g., LLVM IR)Lower high-level constructs to low-level operations
6. OptimizationIROptimized IRConstant folding, dead code, inlining, loop transforms, etc.
7. Code generationOptimized IRAssemblyTranslate IR to target machine instructions
8. AssemblerAssemblyObject file (machine code + relocations)Encode instructions to bytes
9. LinkerObject files + librariesExecutableResolve symbols, finalize addresses

Modern compilers blur stages — frontends produce IR directly without an explicit assembly intermediate; some optimizations work on AST instead of IR; LTO (link-time optimization) merges files for cross-module optimization. The 9-stage breakdown is conceptual.

Lexing and parsing

Take the source x = a + 42;. The lexer produces:

IDENT("x")  EQUALS  IDENT("a")  PLUS  INT(42)  SEMI

The parser consumes the token stream and builds a tree per the language grammar:

Assign
├── Target: Identifier("x")
└── Value:
    └── BinaryOp("+")
        ├── Left:  Identifier("a")
        └── Right: IntLiteral(42)

Lexers are typically finite state machines (regex-based). Parsers are often LL(k) or LR(1) for simple grammars; recursive descent for more complex languages. Modern parser generators (ANTLR, tree-sitter) handle ambiguous grammars and produce useful error messages.

From AST to IR

The AST is high-level — it has functions, classes, control structures. IR is low-level — basic blocks of instructions, branches, register-like SSA values. The "lowering" step does this translation.

For the C statement if (x > 0) y = 1; else y = -1;:

; LLVM IR (simplified)
entry:
  %cmp = icmp sgt i32 %x, 0
  br i1 %cmp, label %if.then, label %if.else

if.then:
  store i32 1, i32* %y
  br label %if.end

if.else:
  store i32 -1, i32* %y
  br label %if.end

if.end:
  ret void

Each function becomes a control-flow graph of basic blocks. Each basic block is a sequence of instructions that executes top-to-bottom with no branches in the middle. Optimizations operate on this CFG.

Common compiler optimizations

OptimizationEffectExample
Constant foldingCompute constant expressions at compile time3 * 412
Constant propagationReplace uses of known constantsx = 5; y = x + 1;y = 6;
Dead code eliminationRemove instructions whose results aren't usedx = expensive(); /* never used */ deleted
Common subexpression eliminationCompute repeated expressions once(a+b)*c + (a+b)*dt=a+b; t*c + t*d
InliningReplace function call with bodyEliminates call overhead; enables further optimization
Loop unrollingRepeat loop body to reduce iteration overheadfor(i=0;i<4;i++) a[i]=0; → 4 stores
Loop invariant code motionMove invariant computations out of loopswhile(...){ x=a*b; ... }x=a*b; while(...){ ... }
Strength reductionReplace expensive ops with cheaperx*2x<<1
Tail call optimizationReuse stack frame for tail callsRecursion uses O(1) stack instead of O(depth)

Production compilers (LLVM, GCC) ship hundreds of optimization passes. -O2 and -O3 are bundles of these passes; the tradeoff is compile time vs runtime speed.

Static compilation vs JIT

Static compilation (AOT)Just-in-time (JIT)
When code is compiledOnce, before executionDuring execution, on hot paths
OptimizationsProfile-guided possible (rarely used in practice)Profile-guided automatic (real runtime data)
Startup speedFast (binary loads, runs)Slow (warm-up while JIT compiles)
Steady-state speedLimited by static analysisOften faster (specializes on actual types)
MemorySmaller — no compiler in processLarger — JIT compiler resident
ExamplesC/C++/Rust/Go executablesJVM, V8, Python (PyPy), .NET

Modern languages mix approaches — V8 first interprets JS, then JIT-compiles hot functions, then re-optimizes them as more profiling data arrives. Java's HotSpot has tiered compilation (interpreter → C1 → C2 → optimized C2). Hybrid is the norm for dynamic languages.

When understanding the pipeline matters

  • Debugging compiler errors. Each stage has distinct error messages; recognizing "lexer error" vs "parse error" vs "type error" tells you what part of your code to inspect.
  • Reading assembly to optimize hot code. gcc -S or godbolt.org shows the assembly your code compiles to. Spotting unnecessary memory loads, inefficient branches, missed vectorization — all require reading the codegen output.
  • Writing language tooling. Linters, formatters, refactoring tools all operate on AST. Understanding the AST shape of your language is a prerequisite.
  • Knowing which optimizations are automatic. Don't manually inline a function the compiler already inlines. Don't manually constant-fold; the compiler does it. Focus optimization effort on what compilers can't do (algorithmic improvements, data layout).
  • Cross-compilation and target tuning. Same source, different backends — x86 vs ARM, server vs embedded. The IR layer is what makes this practical.

Common compilation pitfalls

  • Thinking -O3 is always faster. Aggressive optimization can be slower in cache-pressured workloads (more inlining → larger code → more I-cache misses). Profile both.
  • Undefined behavior optimizations. C/C++ compilers exploit UB to skip checks ("this can't happen, so I'll assume it doesn't"). Signed integer overflow, null dereference, strict aliasing violations all get aggressively optimized. Use UB sanitizers (UBSan) to catch.
  • Forgetting that JIT warmup matters. JVM benchmarks must run for 30+ seconds before measurements; first iterations are interpreter-paced. Microbenchmark frameworks (JMH) handle this; ad-hoc measurements often don't.
  • Ignoring linker errors as "weird messages." "Undefined symbol" means a function/variable was declared but never defined — fix the linkage, not the calling code.
  • Compile times exploding from template instantiation. C++ templates and Rust generics produce one specialization per type combination. Heavy template use can blow compile times to minutes; LTO/PGO compounds this.

Frequently asked questions

What's the difference between an interpreter and a compiler?

A compiler translates source to machine code (or another lower-level representation) once; the result is run many times. An interpreter executes source directly, parsing and evaluating each statement on the fly. Modern systems blur the line — JVM and V8 compile bytecode just-in-time during execution, achieving near-compiled speeds. Pure interpreters are mostly historical (early BASIC, Tcl); modern "interpreters" are usually JIT compilers.

What's an AST?

Abstract Syntax Tree. The compiler's structured representation of the source code after parsing. Nodes are language constructs (function definitions, expressions, statements); children are sub-expressions. Easier to manipulate than raw source text — semantic analysis, type checking, and most optimizations work on the AST or downstream IR.

What's IR (intermediate representation) and why is it needed?

A platform-independent middle layer between the AST and machine code. The compiler does most optimizations on IR — independent of source language and target architecture. LLVM IR is the most famous; it lets dozens of frontends (Clang, Rust, Swift) share dozens of backends (x86, ARM, RISC-V) with optimizations written once.

What's SSA form?

Static Single Assignment — every variable is assigned exactly once. <code>x = 1; x = 2;</code> becomes <code>x_1 = 1; x_2 = 2;</code> Phi functions handle merge points (control-flow joins). SSA simplifies many optimizations because each variable's definition is unambiguous. Used by LLVM, GCC's GIMPLE, V8's TurboFan, and most modern optimizing compilers.

What does a linker do?

Combines object files (compiled translation units) plus libraries into a single executable. Resolves symbol references (function calls, global variables) across files. Performs relocations (adjusts addresses to final positions). Static linking copies all dependencies into the executable; dynamic linking leaves them as references resolved at load/run time. The "undefined symbol" error you see is the linker failing this step.

What's JIT compilation?

Just-In-Time. The compiler runs at execution time, compiling hot code paths from bytecode/source to native machine code. Lets the runtime exploit profile information (which branches are common, which functions are hot, which objects are real types). V8 (Chrome's JavaScript engine), HotSpot (Java), and PyPy (Python) all use JITs. Trade-off — startup is slower because compilation happens on first execution; steady-state is fast.

Why are compiler errors so cryptic?

Errors are reported by stage. Lexer errors are "unexpected character"; parser errors are "expected ; got }"; semantic errors are "undefined variable foo." Each stage has its own error vocabulary. A confusing error often means the compiler recovered incorrectly from an earlier mistake — the actual bug is several lines before where the error is reported. Pattern recognition over time helps; modern compilers (Rust, Clang) put significant effort into clearer messages.