Compilers

Lexer / Tokenizer

The first pass — where text becomes a stream of tokens

A lexer (tokenizer) is the first compiler stage: it scans raw source characters left-to-right with a finite-state machine and emits a stream of tokens — identifiers, numbers, operators — using maximal-munch to pick the longest match, in O(n) time.

  • Time complexityO(n)
  • Language classRegular
  • EngineDFA / hand-written switch
  • LookaheadUsually 1 char
  • Match ruleMaximal munch

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a lexer turns characters into tokens

Before a compiler can understand x = count * 2; as an assignment, something has to chop that string of 14 characters into meaningful chunks: the name x, the operator =, the name count, the operator *, the number 2, and the semicolon. That chopping is lexical analysis, and the program that does it is the lexer (also called the tokenizer or scanner). Its output is a flat list of tokens — the alphabet the parser actually reads.

The lexer is a finite-state machine. It keeps a single read position into the source and one piece of state: "what kind of thing am I in the middle of?" It reads one character, consults a transition rule, and either stays in the current token, finishes one, or starts a new one. Crucially, it never recurses and never looks at nesting — it has no notion that a ( must later be balanced by a ). That job belongs to the parser. The lexer only recognizes a regular language: things describable by regular expressions, with no memory of how deep you are.

A token is a small record with three or four fields: a type (NUMBER, IDENT, PLUS, …), the matched lexeme (the exact substring), the source position (line and column, for error messages), and often a decoded value — the integer 255 for the lexeme 0xff, so the parser never re-reads the text.

Maximal munch — the longest-match rule

The single most important rule in a lexer is maximal munch: at each position, consume the longest string of characters that forms a valid token, then stop. This is why == reads as one equality operator and not two assignments, why >>= reads as a single shift-assign token, and why 3.14 reads as one float, not a 3, a dot, and a 14.

Maximal munch occasionally bites. The classic case is C's a+++b. A human reads it as ambiguous, but the lexer is not ambiguous at all: at the first + it greedily grabs the longest operator it can, finds ++, and is done — so the stream is a ++ + b, never a + ++ b. Older C++ had a worse one: vector<vector<int>> lexed the >> as a right-shift operator, forcing programmers to write > > with a space until C++11 special-cased it in the parser.

When two token rules can match the same maximal-length string — say a keyword pattern and an identifier pattern both matching if — a second rule breaks the tie: rule priority, where the rule listed first (or the more specific one) wins. Most hand-written lexers sidestep this entirely by reading the identifier with one rule and then doing a keyword-table lookup.

When you need a separate lexing stage

  • Any compiler or interpreter. Splitting lexing from parsing keeps the parser's grammar small — it reasons about NUMBER PLUS NUMBER, not about which digits make a number.
  • Syntax highlighters and editors. They lex but rarely parse; coloring keywords and strings needs only the token stream, and lexing is fast enough to run on every keystroke.
  • Linters, formatters, and minifiers. A formatter often works at the token level so it can preserve comments and reflow whitespace without a full parse.
  • Config and data formats. JSON, TOML, and protocol parsers all start with a scanner.

You can skip a separate lexer when the input is trivially structured (split on spaces) or when you deliberately fuse lexing into the parser — a scannerless parser, common in PEG and parser-combinator libraries, where there is no token boundary at all. The trade-off: scannerless parsers handle some context-sensitive lexing more cleanly but lose the speed and simplicity of a dedicated O(n) DFA pass.

Lexer vs parser vs other text-processing stages

Lexer / tokenizerParserRegex engineScannerless parser
InputCharactersTokensCharactersCharacters
OutputToken stream (flat)Parse / syntax tree (nested)Match / capture groupsParse tree
Language classRegularContext-freeRegularContext-free
Computational modelFinite-state machine (DFA)Pushdown automaton (stack)NFA/DFA + backtrackingPushdown automaton
Handles nestingNoYesNo (classic regex)Yes
Time complexityO(n)O(n) (LL/LR)O(n) DFA, worst-case exp. backtrackingO(n) to O(n³)
MemoryO(1) state + token bufferO(depth) stackO(1) DFA / O(n) backtrackO(n) packrat memo
Typical useFirst compiler pass, highlightersGrammar checking, AST buildSearch, validationDSLs, parser combinators

The headline distinction is the language class. A lexer recognizes a regular language, so it needs no stack and runs in strict linear time. A parser recognizes a context-free language, so it needs a stack (explicit or via recursion) to track nesting. Trying to make a lexer balance parentheses is the canonical "use the wrong tool" mistake — it is provably impossible with a finite-state machine alone.

What the numbers actually say

  • One pass, one lookup per character. A table-driven DFA does a single array index per input byte. On modern hardware that is a handful of nanoseconds per character, so lexing a 100,000-line, ~5 MB source file is a few milliseconds — typically under 5–10% of total compile time.
  • Lexing is often the bottleneck in fast pipelines. Because everything downstream depends on it, real compilers obsess over it. Clang hand-writes its lexer in C++ and processes characters with a tight switch; the Go compiler's scanner reads UTF-8 runes inline rather than calling a generic decoder, precisely to shave per-character cost.
  • Lookahead is usually exactly 1. Most tokens are decided by the current character plus one peek (is the char after = another =?). Languages needing more — like distinguishing .. from ..< from a float 1.0 after an integer — push the lexer toward small amounts of buffered lookahead.
  • Generated vs hand-written. A tool like flex compiles your regexes into a DFA transition table; it is correct and fast to build but can be slightly slower than a hand-tuned scanner because of the indirect table dispatch. Most production language compilers (Clang, GCC, Go, Rust) hand-write the lexer for the last 10–20% of speed and for better error messages.

JavaScript implementation

A compact hand-written scanner for a small arithmetic language. Note the structure shared by almost every real lexer: a single index i, a dispatch on the current character, and helper loops that implement maximal munch by running while the character class continues.

const KEYWORDS = new Set(['let', 'if', 'else', 'return']);

function tokenize(src) {
  const tokens = [];
  let i = 0, line = 1, col = 1;
  const peek = () => src[i];
  const adv  = () => { const c = src[i++]; if (c === '\n') { line++; col = 1; } else col++; return c; };
  const push = (type, value) => tokens.push({ type, value, line, col });

  while (i < src.length) {
    const c = peek();

    // 1. whitespace — recognized, then discarded (no token)
    if (/\s/.test(c)) { adv(); continue; }

    // 2. line comment — consume to end of line, emit nothing
    if (c === '/' && src[i + 1] === '/') { while (i < src.length && peek() !== '\n') adv(); continue; }

    // 3. numbers — maximal munch over digits (+ one dot)
    if (/[0-9]/.test(c)) {
      let s = '';
      while (i < src.length && /[0-9.]/.test(peek())) s += adv();
      push('NUMBER', parseFloat(s));
      continue;
    }

    // 4. identifiers and keywords — one rule, then a table lookup
    if (/[A-Za-z_]/.test(c)) {
      let s = '';
      while (i < src.length && /[A-Za-z0-9_]/.test(peek())) s += adv();
      push(KEYWORDS.has(s) ? 'KEYWORD' : 'IDENT', s);
      continue;
    }

    // 5. multi-char operators — longest match first (maximal munch)
    const two = src.slice(i, i + 2);
    if (['==', '!=', '<=', '>=', '&&', '||'].includes(two)) { adv(); adv(); push('OP', two); continue; }
    if ('+-*/=<>(){};'.includes(c)) { adv(); push('OP', c); continue; }

    throw new SyntaxError(`Unexpected '${c}' at ${line}:${col}`);
  }
  push('EOF', null);
  return tokens;
}

Two details that separate a toy from the real thing. First, the operator step checks the two-character slice before the single character — that ordering is maximal munch in code form; flip it and == becomes two = tokens. Second, the EOF token: the parser wants a sentinel so it never indexes past the end.

Python implementation

The same scanner in Python, plus a famous follow-on problem: emitting INDENT / DEDENT tokens for a significant-whitespace language. This is the part of Python's own lexer that surprises people — indentation isn't free, the scanner manufactures synthetic tokens for it.

import re
from collections import namedtuple

Token = namedtuple('Token', 'type value line col')
KEYWORDS = {'let', 'if', 'else', 'return'}

def tokenize(src):
    tokens, i, line, col = [], 0, 1, 1
    while i < len(src):
        c = src[i]
        if c in ' \t':                      # whitespace: skip
            i += 1; col += 1; continue
        if c == '\n':
            i += 1; line += 1; col = 1; continue
        m = re.match(r'[0-9]+(\.[0-9]+)?', src[i:])   # maximal munch via regex
        if m:
            s = m.group(); tokens.append(Token('NUMBER', float(s), line, col))
            i += len(s); col += len(s); continue
        m = re.match(r'[A-Za-z_]\w*', src[i:])
        if m:
            s = m.group()
            t = 'KEYWORD' if s in KEYWORDS else 'IDENT'
            tokens.append(Token(t, s, line, col)); i += len(s); col += len(s); continue
        for op in ('==', '!=', '<=', '>='):          # two-char ops first
            if src.startswith(op, i):
                tokens.append(Token('OP', op, line, col)); i += 2; col += 2; break
        else:
            tokens.append(Token('OP', c, line, col)); i += 1; col += 1
    tokens.append(Token('EOF', None, line, col))
    return tokens


def emit_indents(lines):
    """Famous problem: turn leading whitespace into INDENT / DEDENT tokens."""
    out, stack = [], [0]
    for n, raw in enumerate(lines, 1):
        if not raw.strip():                 # blank lines never change indentation
            continue
        indent = len(raw) - len(raw.lstrip(' '))
        if indent > stack[-1]:
            stack.append(indent); out.append(Token('INDENT', indent, n, 1))
        while indent < stack[-1]:            # may close several blocks at once
            stack.pop(); out.append(Token('DEDENT', indent, n, 1))
        out.append(Token('LINE', raw.strip(), n, 1))
    while len(stack) > 1:                    # close everything at EOF
        stack.pop(); out.append(Token('DEDENT', 0, len(lines), 1))
    return out

Notice the DEDENT loop pops potentially several levels at once: closing a deeply nested block with a single outdent must emit one DEDENT per level. Getting this stack discipline wrong is the most common bug when people hand-roll a Python-like lexer.

Variants worth knowing

Hand-written vs generated. A generated lexer (flex, re2c, ANTLR's lexer) compiles your token regexes into a DFA table — fast to write, easy to maintain. A hand-written lexer is a giant switch on the current character — more code, but faster and far better at producing helpful error messages, which is why every major production compiler hand-writes it.

DFA vs NFA. Token regexes naturally compile to a nondeterministic finite automaton (NFA), which can be in several states at once. Tools convert the NFA to a deterministic one (DFA) via subset construction so each character causes exactly one transition. The DFA is what gives you guaranteed O(n) with no backtracking.

Scannerless / PEG. Parser-combinator and PEG libraries fold lexing into parsing — there are no tokens, the parser matches characters directly. This handles some context-sensitive cases (like a language where a word is a keyword only in some positions) more naturally, at a speed cost.

Lexer modes / start conditions. Real languages need a stateful lexer for string interpolation, here-docs, and template literals: inside a `${ ... }` the scanner switches back into "expression mode." flex calls these start conditions; they make the otherwise-flat scanner a small state machine of modes.

Incremental / re-lexing. Editors don't re-lex the whole file on every keystroke. They re-lex only from the changed line, reusing cached tokens before it — the basis of fast syntax highlighting in IDEs.

Common bugs and edge cases

  • Checking one-char operators before two-char ones. The textbook maximal-munch bug: test the longest candidate first, or <= silently becomes < then =.
  • Off-by-one on lookahead at end of input. Peeking src[i + 1] when i is the last index returns undefined/raises — guard the buffer length.
  • Forgetting to track line and column. Without position info, every downstream error says "syntax error" with no location. Update line/col inside the single advance helper so it can't be skipped.
  • Unterminated strings and comments. A /* with no */, or a string with no closing quote, must error at the opening position — not silently swallow the rest of the file.
  • Trying to balance brackets in the lexer. A finite-state machine cannot count nesting depth; matching {} is the parser's job. Attempting it in the lexer is the classic regular-vs-context-free confusion.
  • Number edge cases. 3., .5, 1e10, 0xFF, and 1_000 all need explicit handling; a naive "digits and one dot" loop mis-lexes range syntax like 1..5 by greedily eating the dots.
  • Unicode and encoding. Identifiers may contain non-ASCII letters; reading raw bytes instead of decoded code points breaks on UTF-8 source. Decode to runes first.

Frequently asked questions

What is the difference between a lexer and a parser?

A lexer turns characters into tokens; a parser turns tokens into a tree. The lexer is a finite-state machine recognizing a regular language (no nesting), so it runs in linear O(n) time. The parser handles a context-free grammar with nested structure — balanced parentheses, block scopes — which a lexer's flat state machine cannot describe.

What is maximal munch in a lexer?

Maximal munch (the longest-match rule) says the lexer consumes the longest sequence of characters that forms a valid token before stopping. It is why "==" lexes as one equality operator rather than two assignments, and why ">>=" is a single shift-assign token. The rule occasionally bites: C's "a+++b" tokenizes as "a ++ + b", never "a + ++ b".

How does a lexer distinguish a keyword from an identifier?

It doesn't, at first. The scanner reads the whole identifier with one rule, then looks the lexeme up in a fixed keyword table. "if" and "while" match the identifier pattern exactly like "width" does; the table lookup reclassifies them. Baking every keyword into the state machine bloats it for no benefit.

Why is a lexer O(n) and not slower?

A DFA-based scanner reads each input character exactly once and makes a single O(1) table lookup per character, so total work is linear in the source length. The only place this breaks is on languages needing backtracking; a true regular-language DFA never backtracks, which is the whole point of compiling regexes to a DFA up front.

Does the lexer handle whitespace and comments?

Usually yes — it recognizes them and then discards them, emitting no token. The exceptions are significant-whitespace languages like Python, where the lexer must track indentation and emit synthetic INDENT and DEDENT tokens, and languages that keep comments for documentation tooling.

What does a token actually contain?

A token is a small record: a type (e.g. NUMBER, IDENT, PLUS), the matched lexeme text, and source position (line and column) for error messages. Literal tokens often also carry a decoded value — the integer 255 for the lexeme "0xff" — so the parser doesn't re-parse the text.