Theory

Regex Matching

Pattern languages compile to NFAs and DFAs — Ω(n) when done right

A regular expression compiles to a finite automaton (NFA or DFA) that consumes input one character at a time. NFA matching runs in O(m × n) — m pattern, n input — guaranteeing linear time. Backtracking implementations (PCRE, JavaScript) can be exponential on adversarial inputs ("ReDoS" — regular expression denial of service). The choice between NFA and backtracking is one of the most consequential decisions in modern systems software.

  • NFA matching timeO(m × n) — guaranteed linear
  • Backtracking time (worst)O(2^n) — exponential ReDoS risk
  • NFA vs DFANFA = thompsons; DFA = subset construction; DFA is faster but exponential to build
  • Used bygrep, ripgrep, RE2 (Go regex) — NFA-based; PCRE, JS, Java — backtracking
  • Famous attackReDoS — adversarial pattern + input causes seconds-to-hours of CPU
  • Pattern compiled toBytecode (PCRE), DFA table (RE2), NFA states (Thompson)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How regex matching works

A regular expression is a pattern language whose semantics map exactly to a finite automaton. a*b matches any string of zero-or-more 'a's followed by 'b' — that's a 3-state NFA:

     ε                  a
  s0 ─────► s1 ◄──────────►  s2 ────b────► s3 (accept)
            (loop on 'a')

To match input, simulate the NFA — track all states it could be in, advance them character by character. The match succeeds if any active state is an accept state at the end.

Two main algorithmic approaches:

  • NFA simulation (Thompson): O(m × n). Maintain a set of active states; for each input character, compute the new set of states. Linear in input size, regardless of pattern complexity.
  • Backtracking: Walk the pattern recursively, trying each possible match path; when a path fails, back up and try the next. Can be exponential in pathological cases. Used by PCRE, Perl, Python re, Java, JavaScript.

NFA simulation vs backtracking

NFA simulationBacktracking
Worst-case timeO(m × n)O(2^n) on adversarial input
Backreferences (\1)Not supported (NP-hard)Supported
LookaroundLimited / not supportedSupported
Common-case speedFastOften fast — but unpredictable
ReDoS resistantYesNo (mitigations exist)
Used byRE2 (Go, Chrome), grep, ripgrep, awkPCRE, Perl, Python, Java, JavaScript

The NFA approach gives up some features (backreferences, full lookaround) for guaranteed performance. The backtracking approach gives up performance guarantees for richer features. Modern systems increasingly bias toward the NFA approach because production systems can't afford ReDoS.

Catastrophic backtracking — the ReDoS attack

A regex like (a+)+$ on input "aaaaaaaaaaaaa!" takes:

Input lengthTime (Node.js)
10 a's~1 ms
20 a's~1 second
30 a's~17 minutes
40 a's~12 days (extrapolated)

Why? The pattern allows multiple ways to partition the a's. The engine tries every combination — 2^n possibilities. The trailing ! never matches, so every combination fails before the engine concludes failure. Adversarial inputs trigger this on real services.

Mitigations:

  • Use NFA-based engines (Go's regexp, RE2). Linear time, no backtracking.
  • Set timeouts on regex matching (Node.js doesn't natively support this; need worker threads).
  • Static analysis tools (rxxr2, redos-detector) flag suspicious patterns at lint time.
  • Avoid nested quantifiers in user-controlled patterns.
  • Use atomic groups ((?>...)) or possessive quantifiers (a++, a*+) to disable backtracking on specific subgroups.

Thompson's NFA construction

Each regex operator has a small construction:

  • Single character a: 2 states, transition labeled 'a'.
  • Concatenation AB: connect A's accept to B's start with epsilon.
  • Alternation A|B: new start with epsilon to both A and B; new accept reached by epsilon from both A and B's accepts.
  • Kleene star A*: new start with epsilon to A's start AND to new accept; loop epsilon from A's accept back to A's start AND forward to new accept.

The result is at most 2 × pattern_length states — linear in pattern size. Building takes O(pattern); matching takes O(pattern × input).

JavaScript: matching with the built-in engine

// Standard match
const re = /\b(\w+)@(\w+\.\w+)\b/g;
'Email me at [email protected] or [email protected]'.match(re);
// → ['[email protected]', '[email protected]']

// Named groups
const re2 = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
'2025-01-15'.match(re2)?.groups;
// → { year: '2025', month: '01', day: '15' }

// ReDoS-vulnerable pattern (do NOT use on user input)
const dangerous = /^(a+)+$/;
// dangerous.test('aaaaaaaaaaaaaaaaaaaaaaaaaa!');  // hangs

// Atomic group (disables backtracking) — not supported in JavaScript
// In Java/.NET/Perl: (?>a+)+$ — provably linear

Python re module

import re

# Compile once, match many
pattern = re.compile(r'(?P<name>\w+)@(?P<domain>[\w.]+)')
match = pattern.search('Send to [email protected]')
if match:
    print(match.group('name'), match.group('domain'))

# findall — all non-overlapping matches
emails = pattern.findall('[email protected] [email protected]')
# [('alice', 'example.com'), ('bob', 'example.com')]

# sub — substitute
re.sub(r'\d+', '[NUM]', 'order 42 has 7 items')  # 'order [NUM] has [NUM] items'

# Python's re uses backtracking — same ReDoS risks as PCRE
# For ReDoS safety, use the regex package (not re):
# pip install regex; import regex; regex.match(..., timeout=1.0)

When to use a regex (and when not)

  • Use: Pattern matching (extracting fields from text), validation (checking a string format), search-and-replace, lexers, log parsing, simple scraping. The regex is a clear, declarative spec.
  • Don't use: Parsing recursive structures (HTML, JSON, programming languages). Famous: "Regex can't parse HTML" — the language has nesting; finite automata can't count. Use a real parser.
  • Don't use: Complex multi-line context-sensitive matching. Stitching together regex with conditionals quickly becomes unreadable; switch to a real parser or state machine.
  • Don't use: User-supplied patterns in a backtracking engine. ReDoS risk. Either restrict to safe subsets or switch to RE2.

Famous regex engines and their algorithms

EngineAlgorithmReDoS-safeBackreferences
RE2 (Go regexp, Chrome)NFA simulationYesNo
grep, ripgrepNFA / DFA hybridYesLimited
PCRE (PHP, nginx)BacktrackingNo (mitigated)Yes
JavaScript (V8 / WebKit)Backtracking with optimizationsNoYes
Java java.util.regexBacktrackingNoYes
Python reBacktrackingNoYes
HyperscanMulti-pattern NFA, SIMDYesNo

Common regex bugs

  • Greedy vs lazy quantifiers. .* matches as much as possible; .*? matches as little. Wrong choice produces wrong matches. <.*> on "<a><b>" matches the whole thing; <.*?> matches just <a>.
  • Forgetting to escape special characters. ., *, (, +, ?, |, $, ^, {, }, [, ], \ all need escaping when meant literally. Use \., \*, etc.
  • ReDoS-vulnerable patterns. Nested quantifiers ((a+)+), overlapping alternations ((a|aa)*), excessive lookaheads. Easy to write, hard to spot in review.
  • Multiline vs singleline mode. . doesn't match newlines by default; ^ and $ match start/end of input by default. Flags vary by language: /m, /s, (?m), (?s).
  • Unicode handling. \w, \d, [a-z] may or may not match Unicode characters depending on flags and language. Always test with non-ASCII input if relevant.
  • Compiling in a hot loop. Compiling the same pattern thousands of times kills performance. Compile once outside the loop, match many times.

Frequently asked questions

What's the difference between an NFA and a DFA?

NFA (Non-deterministic Finite Automaton) can be in multiple states simultaneously and has epsilon (empty) transitions. DFA (Deterministic Finite Automaton) has exactly one state at a time and no epsilons. Every NFA can be converted to an equivalent DFA via subset construction — but the DFA can have up to 2^N states for an N-state NFA. NFAs are smaller (linear in pattern); DFAs are faster to match but expensive to build.

What's ReDoS?

Regular Expression Denial of Service. An attacker crafts input that causes catastrophic backtracking in a backtracking regex engine — exponential time on inputs that look benign. The classic vulnerable pattern is <code>(a+)+$</code> with input <code>"aaaaaaaaaaaaa!"</code> — the engine tries every way to split the as among the nested quantifiers. Real attacks have taken down production servers; CVE-2019-12041 (Stack Exchange), CVE-2017-15707 (LinkedIn).

Why does Go's regex engine refuse certain patterns?

Go's standard library uses RE2, an NFA-based engine that guarantees linear time. Backreferences (<code>\1</code> referring to a previously-matched group) and lookaround require backtracking, so RE2 doesn't support them. The trade-off is intentional — Go gives up some pattern expressiveness for guaranteed performance and ReDoS immunity.

How does Thompson's construction work?

Each regex operator becomes a small NFA building block. Concatenation glues two NFAs end-to-end. Alternation adds an epsilon-branching start state. The Kleene star creates a loop with epsilon back-edges. The constructed NFA has at most 2 × (length of pattern) states — linear, simple to build, simple to match.

When should I use a DFA over an NFA?

For static patterns matched many times against varying input — DFA matching is simpler (one state at a time) and faster per character. The build cost (subset construction, potentially exponential states) only pays off when amortized across many matches. Lexers in compilers convert their regex rules to a single combined DFA at compile time. For one-shot matches with dynamic patterns, NFA simulation is preferred.

Why are some regex engines so much faster than others?

Different algorithm. ripgrep / grep / RE2 use NFA simulation — guaranteed linear time. PCRE / Perl / JS / Java use backtracking — sometimes faster on simple patterns, exponentially slower on adversarial ones. For searching huge files (gigabytes), the NFA-based ones can be 10-100× faster — and never crash on adversarial patterns.

How does regex compilation work?

Parse the pattern into an AST. Lower the AST to NFA states (Thompson construction) or to DFA tables (subset construction) or to bytecode (PCRE). Cache the compiled representation. Match input against the compiled form. Compilation is one-time per pattern; matching is the hot path.