Compilers
Lexer / Tokenizer
The first pass — where text becomes a stream of tokens
A lexer (tokenizer) is the first compiler stage: it scans raw source characters left-to-right with a finite-state machine and emits a stream of tokens — identifiers, numbers, operators — using maximal-munch to pick the longest match, in O(n) time.
- Time complexityO(n)
- Language classRegular
- EngineDFA / hand-written switch
- LookaheadUsually 1 char
- Match ruleMaximal munch
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a lexer turns characters into tokens
Before a compiler can understand x = count * 2; as an assignment, something has to chop that string of 14 characters into meaningful chunks: the name x, the operator =, the name count, the operator *, the number 2, and the semicolon. That chopping is lexical analysis, and the program that does it is the lexer (also called the tokenizer or scanner). Its output is a flat list of tokens — the alphabet the parser actually reads.
The lexer is a finite-state machine. It keeps a single read position into the source and one piece of state: "what kind of thing am I in the middle of?" It reads one character, consults a transition rule, and either stays in the current token, finishes one, or starts a new one. Crucially, it never recurses and never looks at nesting — it has no notion that a ( must later be balanced by a ). That job belongs to the parser. The lexer only recognizes a regular language: things describable by regular expressions, with no memory of how deep you are.
A token is a small record with three or four fields: a type (NUMBER, IDENT, PLUS, …), the matched lexeme (the exact substring), the source position (line and column, for error messages), and often a decoded value — the integer 255 for the lexeme 0xff, so the parser never re-reads the text.
Maximal munch — the longest-match rule
The single most important rule in a lexer is maximal munch: at each position, consume the longest string of characters that forms a valid token, then stop. This is why == reads as one equality operator and not two assignments, why >>= reads as a single shift-assign token, and why 3.14 reads as one float, not a 3, a dot, and a 14.
Maximal munch occasionally bites. The classic case is C's a+++b. A human reads it as ambiguous, but the lexer is not ambiguous at all: at the first + it greedily grabs the longest operator it can, finds ++, and is done — so the stream is a ++ + b, never a + ++ b. Older C++ had a worse one: vector<vector<int>> lexed the >> as a right-shift operator, forcing programmers to write > > with a space until C++11 special-cased it in the parser.
When two token rules can match the same maximal-length string — say a keyword pattern and an identifier pattern both matching if — a second rule breaks the tie: rule priority, where the rule listed first (or the more specific one) wins. Most hand-written lexers sidestep this entirely by reading the identifier with one rule and then doing a keyword-table lookup.
When you need a separate lexing stage
- Any compiler or interpreter. Splitting lexing from parsing keeps the parser's grammar small — it reasons about
NUMBER PLUS NUMBER, not about which digits make a number. - Syntax highlighters and editors. They lex but rarely parse; coloring keywords and strings needs only the token stream, and lexing is fast enough to run on every keystroke.
- Linters, formatters, and minifiers. A formatter often works at the token level so it can preserve comments and reflow whitespace without a full parse.
- Config and data formats. JSON, TOML, and protocol parsers all start with a scanner.
You can skip a separate lexer when the input is trivially structured (split on spaces) or when you deliberately fuse lexing into the parser — a scannerless parser, common in PEG and parser-combinator libraries, where there is no token boundary at all. The trade-off: scannerless parsers handle some context-sensitive lexing more cleanly but lose the speed and simplicity of a dedicated O(n) DFA pass.
Lexer vs parser vs other text-processing stages
| Lexer / tokenizer | Parser | Regex engine | Scannerless parser | |
|---|---|---|---|---|
| Input | Characters | Tokens | Characters | Characters |
| Output | Token stream (flat) | Parse / syntax tree (nested) | Match / capture groups | Parse tree |
| Language class | Regular | Context-free | Regular | Context-free |
| Computational model | Finite-state machine (DFA) | Pushdown automaton (stack) | NFA/DFA + backtracking | Pushdown automaton |
| Handles nesting | No | Yes | No (classic regex) | Yes |
| Time complexity | O(n) | O(n) (LL/LR) | O(n) DFA, worst-case exp. backtracking | O(n) to O(n³) |
| Memory | O(1) state + token buffer | O(depth) stack | O(1) DFA / O(n) backtrack | O(n) packrat memo |
| Typical use | First compiler pass, highlighters | Grammar checking, AST build | Search, validation | DSLs, parser combinators |
The headline distinction is the language class. A lexer recognizes a regular language, so it needs no stack and runs in strict linear time. A parser recognizes a context-free language, so it needs a stack (explicit or via recursion) to track nesting. Trying to make a lexer balance parentheses is the canonical "use the wrong tool" mistake — it is provably impossible with a finite-state machine alone.
What the numbers actually say
- One pass, one lookup per character. A table-driven DFA does a single array index per input byte. On modern hardware that is a handful of nanoseconds per character, so lexing a 100,000-line, ~5 MB source file is a few milliseconds — typically under 5–10% of total compile time.
- Lexing is often the bottleneck in fast pipelines. Because everything downstream depends on it, real compilers obsess over it. Clang hand-writes its lexer in C++ and processes characters with a tight switch; the Go compiler's scanner reads UTF-8 runes inline rather than calling a generic decoder, precisely to shave per-character cost.
- Lookahead is usually exactly 1. Most tokens are decided by the current character plus one peek (is the char after
=another=?). Languages needing more — like distinguishing..from..<from a float1.0after an integer — push the lexer toward small amounts of buffered lookahead. - Generated vs hand-written. A tool like
flexcompiles your regexes into a DFA transition table; it is correct and fast to build but can be slightly slower than a hand-tuned scanner because of the indirect table dispatch. Most production language compilers (Clang, GCC, Go, Rust) hand-write the lexer for the last 10–20% of speed and for better error messages.
JavaScript implementation
A compact hand-written scanner for a small arithmetic language. Note the structure shared by almost every real lexer: a single index i, a dispatch on the current character, and helper loops that implement maximal munch by running while the character class continues.
const KEYWORDS = new Set(['let', 'if', 'else', 'return']);
function tokenize(src) {
const tokens = [];
let i = 0, line = 1, col = 1;
const peek = () => src[i];
const adv = () => { const c = src[i++]; if (c === '\n') { line++; col = 1; } else col++; return c; };
const push = (type, value) => tokens.push({ type, value, line, col });
while (i < src.length) {
const c = peek();
// 1. whitespace — recognized, then discarded (no token)
if (/\s/.test(c)) { adv(); continue; }
// 2. line comment — consume to end of line, emit nothing
if (c === '/' && src[i + 1] === '/') { while (i < src.length && peek() !== '\n') adv(); continue; }
// 3. numbers — maximal munch over digits (+ one dot)
if (/[0-9]/.test(c)) {
let s = '';
while (i < src.length && /[0-9.]/.test(peek())) s += adv();
push('NUMBER', parseFloat(s));
continue;
}
// 4. identifiers and keywords — one rule, then a table lookup
if (/[A-Za-z_]/.test(c)) {
let s = '';
while (i < src.length && /[A-Za-z0-9_]/.test(peek())) s += adv();
push(KEYWORDS.has(s) ? 'KEYWORD' : 'IDENT', s);
continue;
}
// 5. multi-char operators — longest match first (maximal munch)
const two = src.slice(i, i + 2);
if (['==', '!=', '<=', '>=', '&&', '||'].includes(two)) { adv(); adv(); push('OP', two); continue; }
if ('+-*/=<>(){};'.includes(c)) { adv(); push('OP', c); continue; }
throw new SyntaxError(`Unexpected '${c}' at ${line}:${col}`);
}
push('EOF', null);
return tokens;
}
Two details that separate a toy from the real thing. First, the operator step checks the two-character slice before the single character — that ordering is maximal munch in code form; flip it and == becomes two = tokens. Second, the EOF token: the parser wants a sentinel so it never indexes past the end.
Python implementation
The same scanner in Python, plus a famous follow-on problem: emitting INDENT / DEDENT tokens for a significant-whitespace language. This is the part of Python's own lexer that surprises people — indentation isn't free, the scanner manufactures synthetic tokens for it.
import re
from collections import namedtuple
Token = namedtuple('Token', 'type value line col')
KEYWORDS = {'let', 'if', 'else', 'return'}
def tokenize(src):
tokens, i, line, col = [], 0, 1, 1
while i < len(src):
c = src[i]
if c in ' \t': # whitespace: skip
i += 1; col += 1; continue
if c == '\n':
i += 1; line += 1; col = 1; continue
m = re.match(r'[0-9]+(\.[0-9]+)?', src[i:]) # maximal munch via regex
if m:
s = m.group(); tokens.append(Token('NUMBER', float(s), line, col))
i += len(s); col += len(s); continue
m = re.match(r'[A-Za-z_]\w*', src[i:])
if m:
s = m.group()
t = 'KEYWORD' if s in KEYWORDS else 'IDENT'
tokens.append(Token(t, s, line, col)); i += len(s); col += len(s); continue
for op in ('==', '!=', '<=', '>='): # two-char ops first
if src.startswith(op, i):
tokens.append(Token('OP', op, line, col)); i += 2; col += 2; break
else:
tokens.append(Token('OP', c, line, col)); i += 1; col += 1
tokens.append(Token('EOF', None, line, col))
return tokens
def emit_indents(lines):
"""Famous problem: turn leading whitespace into INDENT / DEDENT tokens."""
out, stack = [], [0]
for n, raw in enumerate(lines, 1):
if not raw.strip(): # blank lines never change indentation
continue
indent = len(raw) - len(raw.lstrip(' '))
if indent > stack[-1]:
stack.append(indent); out.append(Token('INDENT', indent, n, 1))
while indent < stack[-1]: # may close several blocks at once
stack.pop(); out.append(Token('DEDENT', indent, n, 1))
out.append(Token('LINE', raw.strip(), n, 1))
while len(stack) > 1: # close everything at EOF
stack.pop(); out.append(Token('DEDENT', 0, len(lines), 1))
return out
Notice the DEDENT loop pops potentially several levels at once: closing a deeply nested block with a single outdent must emit one DEDENT per level. Getting this stack discipline wrong is the most common bug when people hand-roll a Python-like lexer.
Variants worth knowing
Hand-written vs generated. A generated lexer (flex, re2c, ANTLR's lexer) compiles your token regexes into a DFA table — fast to write, easy to maintain. A hand-written lexer is a giant switch on the current character — more code, but faster and far better at producing helpful error messages, which is why every major production compiler hand-writes it.
DFA vs NFA. Token regexes naturally compile to a nondeterministic finite automaton (NFA), which can be in several states at once. Tools convert the NFA to a deterministic one (DFA) via subset construction so each character causes exactly one transition. The DFA is what gives you guaranteed O(n) with no backtracking.
Scannerless / PEG. Parser-combinator and PEG libraries fold lexing into parsing — there are no tokens, the parser matches characters directly. This handles some context-sensitive cases (like a language where a word is a keyword only in some positions) more naturally, at a speed cost.
Lexer modes / start conditions. Real languages need a stateful lexer for string interpolation, here-docs, and template literals: inside a `${ ... }` the scanner switches back into "expression mode." flex calls these start conditions; they make the otherwise-flat scanner a small state machine of modes.
Incremental / re-lexing. Editors don't re-lex the whole file on every keystroke. They re-lex only from the changed line, reusing cached tokens before it — the basis of fast syntax highlighting in IDEs.
Common bugs and edge cases
- Checking one-char operators before two-char ones. The textbook maximal-munch bug: test the longest candidate first, or
<=silently becomes<then=. - Off-by-one on lookahead at end of input. Peeking
src[i + 1]wheniis the last index returns undefined/raises — guard the buffer length. - Forgetting to track line and column. Without position info, every downstream error says "syntax error" with no location. Update line/col inside the single advance helper so it can't be skipped.
- Unterminated strings and comments. A
/*with no*/, or a string with no closing quote, must error at the opening position — not silently swallow the rest of the file. - Trying to balance brackets in the lexer. A finite-state machine cannot count nesting depth; matching
{}is the parser's job. Attempting it in the lexer is the classic regular-vs-context-free confusion. - Number edge cases.
3.,.5,1e10,0xFF, and1_000all need explicit handling; a naive "digits and one dot" loop mis-lexes range syntax like1..5by greedily eating the dots. - Unicode and encoding. Identifiers may contain non-ASCII letters; reading raw bytes instead of decoded code points breaks on UTF-8 source. Decode to runes first.
Frequently asked questions
What is the difference between a lexer and a parser?
A lexer turns characters into tokens; a parser turns tokens into a tree. The lexer is a finite-state machine recognizing a regular language (no nesting), so it runs in linear O(n) time. The parser handles a context-free grammar with nested structure — balanced parentheses, block scopes — which a lexer's flat state machine cannot describe.
What is maximal munch in a lexer?
Maximal munch (the longest-match rule) says the lexer consumes the longest sequence of characters that forms a valid token before stopping. It is why "==" lexes as one equality operator rather than two assignments, and why ">>=" is a single shift-assign token. The rule occasionally bites: C's "a+++b" tokenizes as "a ++ + b", never "a + ++ b".
How does a lexer distinguish a keyword from an identifier?
It doesn't, at first. The scanner reads the whole identifier with one rule, then looks the lexeme up in a fixed keyword table. "if" and "while" match the identifier pattern exactly like "width" does; the table lookup reclassifies them. Baking every keyword into the state machine bloats it for no benefit.
Why is a lexer O(n) and not slower?
A DFA-based scanner reads each input character exactly once and makes a single O(1) table lookup per character, so total work is linear in the source length. The only place this breaks is on languages needing backtracking; a true regular-language DFA never backtracks, which is the whole point of compiling regexes to a DFA up front.
Does the lexer handle whitespace and comments?
Usually yes — it recognizes them and then discards them, emitting no token. The exceptions are significant-whitespace languages like Python, where the lexer must track indentation and emit synthetic INDENT and DEDENT tokens, and languages that keep comments for documentation tooling.
What does a token actually contain?
A token is a small record: a type (e.g. NUMBER, IDENT, PLUS), the matched lexeme text, and source position (line and column) for error messages. Literal tokens often also carry a decoded value — the integer 255 for the lexeme "0xff" — so the parser doesn't re-parse the text.