Compilers

Recursive Descent Parser

One function per grammar rule — the parser you write by hand

A recursive descent parser is a top-down LL parser written by hand — one mutually-recursive function per grammar nonterminal. Used in GCC, Clang, TypeScript, and Rust. Predictive, fast, and easy to extend with great error messages.

  • Parser classTop-down LL — one function per nonterminal
  • LookaheadTypically LL(1), extended with backtracking
  • Time complexityO(n) for LL(1), O(n²) with backtracking
  • Cannot handleLeft-recursive rules directly (rewrite to iteration)
  • Used inGCC, Clang, TypeScript, rustc, V8 (JavaScript), CPython, Ruby MRI
  • Why hand-writtenBetter error messages, debugging, language-specific shortcuts

Interactive visualization

Token stream "1 + 2 * 3" enters the parser. Watch parseExpr call parseTerm call parseFactor, and an AST build bottom-up while the call stack tracks where we are.

Open visualization fullscreen ↗

How recursive descent works

You have a grammar. Each rule (nonterminal) becomes a function. To parse an expression, you call the function for the start symbol; it consumes tokens and calls other functions to recognize sub-structures. The call stack tracks where you are in the parse tree.

Take a tiny arithmetic grammar:

Expr   → Term ( ('+' | '-') Term )*
Term   → Factor ( ('*' | '/') Factor )*
Factor → NUMBER | '(' Expr ')'

Three rules, three functions. The grammar's recursion structure becomes the parser's call-graph structure. To parse 1 + 2 * 3:

parseExpr()
├── parseTerm()
│   └── parseFactor() → NUMBER(1)
├── peek '+' → consume, loop
└── parseTerm()
    ├── parseFactor() → NUMBER(2)
    ├── peek '*' → consume, loop
    └── parseFactor() → NUMBER(3)
    → BinOp(*, 2, 3)
→ BinOp(+, 1, BinOp(*, 2, 3))

The output is an AST that already encodes the right precedence — * binds tighter than + because parseTerm runs before the + loop body sees the next token.

The left-recursion problem

The naïve grammar form is left-recursive:

Expr → Expr '+' Term | Term     // LEFT-RECURSIVE — won't work

If you translate this directly to a parser function, parseExpr calls parseExpr as its first action with no progress made on the input. Stack overflow.

The standard transformation rewrites left recursion as iteration:

Expr → Term ( '+' Term )*

One Term, then zero or more "+" Term suffixes. The parser function becomes a Term call followed by a while loop that consumes additional + tokens. The call stack stays bounded; the loop captures the unbounded structure.

LL(1) and lookahead

An LL(1) grammar is one where the parser can pick the correct production by looking at one token of lookahead. peek() the next token; switch on its type:

parseFactor():
    t = peek()
    if t.type == NUMBER:
        consume()
        return Number(t.value)
    elif t.type == '(':
        consume('(')
        e = parseExpr()
        consume(')')
        return e
    else:
        error("expected number or '('")

Real languages aren't always LL(1). Java's "<" can be a generics opening or a less-than operator. C++ "A *B" can be a declaration or a multiplication. Hand-written parsers handle these with tentative parsing — speculatively try one production, snapshot the token state, and if it fails, restore and try the other. Clang's C++ parser has hundreds of these speculation points.

Recursive descent vs other parsers

Recursive descentLALR (yacc/bison)PEG / packratEarley
ImplementationHand-written or generatedTable-driven, generatedGeneratedAlgorithm, generated
Time complexityO(n) LL(1), O(n²) backtrackingO(n)O(n) packrat memoized, O(2ⁿ) naïve PEGO(n³) any context-free, O(n²) unambiguous, O(n) LR
MemoryO(depth) stackO(n) tableO(n × rules) memoO(n²) chart
Handles ambiguityProgrammer decidesConflicts → errorsAlways picks first alternativeReturns all parses
Left recursionNo — must rewriteYesNo (without extension)Yes
Error messagesExcellentMediocre — "syntax error"MediocreMediocre
Used inGCC, Clang, TypeScript, rustc, V8Old Bison-generated, GLR variantsLua, some researchResearch, natural language

The error-message column is what flipped the industry. Bison emits "syntax error at line 42" — useless. A hand-written recursive descent parser knows it's in parseFunctionDeclaration after the open paren, expecting a parameter, and can say "expected parameter type or ')', got '++'". That difference is why GCC, Clang, TypeScript, Rust's rustc, and every modern compiler ships hand-written recursive descent.

When to use recursive descent

  • Writing a compiler or interpreter. The default choice for the front end. Easy to debug (set a breakpoint in parseExpression); easy to extend (add a new keyword by adding a new branch).
  • Building a DSL parser. Configuration languages, query languages, template engines — recursive descent in 200 lines beats pulling in ANTLR or PEG.js for simple grammars.
  • Parsing a protocol or wire format. Many protocols (HTTP, SMTP, JSON) are LL(1) and parse cleanly with recursive descent. JSON parsers in Node and Python are essentially hand-written recursive descent.
  • You need helpful error messages. If your users are humans (not machines feeding you input), parser error quality matters more than parser theoretical elegance.

Skip recursive descent when: you have a heavily ambiguous grammar (use GLR or Earley), the grammar changes frequently and is owned by non-experts (parser generator), or performance matters more than maintainability and you can profile-justify an LALR table-driven parser (rare).

JavaScript implementation

// Recursive descent parser for arithmetic: 1 + 2 * (3 - 4) / 5

class Parser {
  constructor(tokens) {
    this.tokens = tokens;
    this.pos = 0;
  }

  peek() { return this.tokens[this.pos]; }
  consume(type) {
    const t = this.peek();
    if (!t || t.type !== type) throw new Error(`expected ${type}, got ${t?.type || 'EOF'}`);
    this.pos++;
    return t;
  }

  // Expr → Term (('+'|'-') Term)*
  parseExpr() {
    let left = this.parseTerm();
    while (this.peek() && (this.peek().type === '+' || this.peek().type === '-')) {
      const op = this.consume(this.peek().type);
      const right = this.parseTerm();
      left = { type: 'BinaryOp', op: op.type, left, right };
    }
    return left;
  }

  // Term → Factor (('*'|'/') Factor)*
  parseTerm() {
    let left = this.parseFactor();
    while (this.peek() && (this.peek().type === '*' || this.peek().type === '/')) {
      const op = this.consume(this.peek().type);
      const right = this.parseFactor();
      left = { type: 'BinaryOp', op: op.type, left, right };
    }
    return left;
  }

  // Factor → NUMBER | '(' Expr ')'
  parseFactor() {
    const t = this.peek();
    if (t.type === 'NUMBER') {
      this.consume('NUMBER');
      return { type: 'Number', value: t.value };
    }
    if (t.type === '(') {
      this.consume('(');
      const e = this.parseExpr();
      this.consume(')');
      return e;
    }
    throw new Error(`unexpected token ${t.type}`);
  }
}

// Tokens for "1 + 2 * 3":
const tokens = [
  { type: 'NUMBER', value: 1 }, { type: '+' },
  { type: 'NUMBER', value: 2 }, { type: '*' },
  { type: 'NUMBER', value: 3 }
];
const ast = new Parser(tokens).parseExpr();
// → BinaryOp(+, Number(1), BinaryOp(*, Number(2), Number(3)))

Python implementation with Pratt extension

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self, type=None):
        t = self.peek()
        if type and (not t or t.type != type):
            raise SyntaxError(f"expected {type}, got {t.type if t else 'EOF'}")
        self.pos += 1
        return t

    # Pratt-style: parse with minimum binding power
    PRECEDENCE = {'+': 10, '-': 10, '*': 20, '/': 20}

    def parse_expr(self, min_prec=0):
        left = self.parse_atom()
        while self.peek() and self.peek().type in self.PRECEDENCE:
            prec = self.PRECEDENCE[self.peek().type]
            if prec < min_prec: break
            op = self.consume()
            right = self.parse_expr(prec + 1)  # left-associative
            left = ('binop', op.type, left, right)
        return left

    def parse_atom(self):
        t = self.peek()
        if t.type == 'NUMBER':
            self.consume('NUMBER')
            return ('num', t.value)
        if t.type == '(':
            self.consume('(')
            e = self.parse_expr()
            self.consume(')')
            return e
        raise SyntaxError(f"unexpected {t.type}")

The Pratt variant collapses multiple precedence-level functions into one parameterized function. For a language with 10+ precedence levels (C, C++, JavaScript), Pratt parsing is far more compact than 10 separate parseLevelN functions.

Performance — how fast is hand-written recursive descent?

  • Throughput: A well-tuned recursive descent parser processes 5–20 MB/s of source. Clang's parser hits ~10 MB/s on real C++ code; rustc's around 3–5 MB/s (slower due to richer semantic checks).
  • Memory: O(parse-tree-depth) call-stack frames + O(AST-size) heap allocation. Typical AST is 8–15× the source size.
  • Parsing share of compile time: 5–15% of total compile time at -O0; falls to under 2% at -O3 because optimization dominates.
  • Function-call overhead: Modern CPUs predict and pipeline calls well, but extreme inlining (Pratt) shaves another 10–20% off mature recursive descent parsers.
  • Compilers vs parser generators: Hand-written recursive descent beats Bison-generated LALR by 20–40% on real workloads because of branch predictor friendliness and better cache locality (no jump tables).

Common recursive descent pitfalls

  • Forgetting to advance the parser. If a branch doesn't call consume(), you loop forever. Always ensure every successful production consumes at least one token.
  • Conflating LL ambiguity with grammar ambiguity. The dangling-else problem ("which if does this else attach to?") is fixed by parser convention, not grammar restructuring. Match else greedily — attach it to the nearest if.
  • Stack overflow on deeply nested input. ((((((1)))))) recurses 6 deep; (((..1..))) 10000 deep blows the stack. Hand-written parsers either bound recursion depth or convert deep parens to iteration.
  • Lookahead leak. If peek() requires consuming and putting back a token, you bypass tokenizer invariants (whitespace, position tracking). Use a proper lookahead buffer or token stream wrapper.
  • Error recovery. When a parse fails, where do you resume? Common heuristic: skip until you see a synchronization token (semicolon, close brace). Without recovery, one syntax error halts the whole compile — terrible UX.

Frequently asked questions

What's a recursive descent parser?

A top-down parser written by hand as a set of mutually-recursive functions — one function per nonterminal in the grammar. To parse an expression you call parseExpression(); it calls parseTerm() for each operand; parseTerm() calls parseFactor() for each operand of multiplication; parseFactor() recognizes numbers or recursively parses a parenthesized expression. The recursion mirrors the grammar's structure, and the call stack tracks where you are in the parse tree.

Why can't recursive descent handle left-recursive grammars?

A rule like 'Expr → Expr + Term' makes parseExpression() call parseExpression() as its first action with no progress — infinite recursion, stack overflow. The standard fix is to rewrite the grammar to use right recursion or iteration: 'Expr → Term ((+ | -) Term)*' — a Term followed by zero or more (+|-) Term suffixes. Hand-written parsers use a while-loop instead of recursion for the suffix part.

What's LL(k)?

L stands for "left-to-right scan" (input is read left to right), L stands for "leftmost derivation" (the parser builds the leftmost expansion of nonterminals), and k is the number of lookahead tokens. LL(1) parsers decide what rule to use based on the next single token; LL(k) parsers peek k tokens ahead. Recursive descent is naturally LL(1) but easy to extend to k=2 or higher with a buffered tokenizer.

Why do major compilers hand-write recursive descent?

Better error messages, debugging, and the ability to encode language-specific shortcuts. GCC switched from Bison to hand-written recursive descent for C and C++ around 2004 specifically for error recovery. Clang has always been hand-written. The TypeScript compiler, Rust's rustc, and the V8 JavaScript parser are all hand-written recursive descent. Parser generators are convenient for prototypes but most production compilers reject them for compilers people actually use.

How do you parse operator precedence in recursive descent?

One function per precedence level. parseExpression() handles + and -, calling parseTerm() for operands. parseTerm() handles * and /, calling parseFactor() for operands. parseFactor() handles primary tokens (numbers, identifiers, parenthesized expressions). Because parseTerm runs before each call to parseExpression's loop body, multiplication binds tighter. For many precedence levels, Pratt parsing (operator-precedence parsing with semantic actions) is a more compact alternative.

What's the lookahead problem?

Some grammars need more than one token of lookahead to decide which production to apply. C++ is famously not LL(k) for any fixed k — "A *B" could be a multiplication or a pointer declaration depending on whether A is a type. Hand-written parsers handle this by "backtracking" — speculatively trying one production, and if it fails, rewinding and trying the other. The C++ parser in Clang has many such backtracking sites.

What's a Pratt parser?

An elegant variant of recursive descent for expressions with many operators and precedence levels. Each token has an "nud" (null denotation, what to do when this token starts an expression) and an "led" (left denotation, what to do when this token follows another). A single parseExpression(min_prec) function handles all precedence levels by looping while the next token's binding power exceeds min_prec. Used in JavaScript engines, Crockford's old article popularized it.