String Algorithms

Aho-Corasick Multi-Pattern Matching

Q: How is Aho-Corasick different from running KMP per pattern?

KMP per pattern is k separate scans of the text: O(k * n + sum(|P_i|)) — n is the text length and k the pattern count. For 1000 patterns over 1 GB of text that's a trillion character compares. Aho-Corasick replaces those k scans with one scan over the same n characters, supported by one combined automaton built once. Time becomes O(n + m + z) where m is total pattern length and z the number of matches. Crossover happens around 4 to 8 patterns; past that, AC dominates by orders of magnitude.

Q: Why do you need output links in addition to failure links?

A failure link sends you to the longest proper suffix that is also a prefix of some pattern. That gives you O(1) state transition but does not by itself tell you which patterns end at the current state's ancestors via fail edges. Without output links, reporting all matches at a node u costs O(depth) — walk fail(u), fail(fail(u)), ... checking each for accept status. Output (or 'dictionary suffix') links precompute that chain so reporting all matches at u takes O(matches at u). It is what turns the algorithm from O(n * depth) reporting into O(n + z) reporting.

Q: How are failure links constructed?

BFS from the root. The root's failure link is itself; every child of the root has failure pointing to the root. For a deeper node u reached from parent p via character c, walk f = fail(p), fail(fail(p)), ... until you reach a node f' that has a child via c; set fail(u) = that child. If you fall off the root without finding such a child, fail(u) = root. This mirrors KMP's prefix function computation but generalized across the entire trie. The whole BFS is O(total_pattern_length * alphabet_size) — linear in the size of the structure.

Q: What does the linear-time argument look like?

Each text character either advances the active state by one trie edge or follows one or more failure links. The active state's depth increases by at most one per consumed character. Failure links strictly decrease depth (they point to shorter prefixes). So the total depth decreases across the run cannot exceed the total depth increases, i.e. n. Therefore the cumulative failure-link work across all n text characters is bounded by n: O(n) total. Plus O(z) to emit matches via output links. The preprocessing is O(m) where m is total pattern length.

Q: What real systems use Aho-Corasick?

Snort and Suricata intrusion-detection engines build an AC automaton over tens of thousands of rule patterns and scan line-rate packet payloads in one pass. ClamAV's multi-pattern matcher scans email attachments against millions of malware signatures. Hyperscan (Intel) compiles literal subsets of regex sets to AC plus bit-parallel verifiers. Lucene's keyword analyzer uses AC for synonym and stop-word recognition. bowtie2 and BWA bioinformatics tools use double-array AC variants over k-mer dictionaries to seed read alignment. Search engines run AC for auto-complete prefix expansion.

Q: How is the automaton represented in memory?

Naive: per-node array of sigma child pointers — fast lookup, O(sigma) memory per state. Better: per-node hash map of children — slower constant-time lookup, O(degree) memory. Production: double-array trie packs the entire goto function into two integer arrays base[] and check[], so transition is base[s] + c with check[base[s] + c] == s as the validity guard. Eliminates pointer chasing and shrinks per-state memory to ~8 bytes amortized. ahocorasick-rs and Java's AhoCorasickDoubleArrayTrie ship multi-million-pattern dictionaries at hundreds of MB/s scan rates.

Trie + failure links + output links — find every occurrence of every pattern in one pass

Aho-Corasick scans a text once for every pattern in a dictionary — O(n + m + z). Build a trie, BFS the failure links, add output links; the search never re-examines a character.

TimeO(n + m + z) — n text, m total pattern length, z matches
PreprocessingO(m × σ) trie + BFS
MemoryO(m × σ) goto + O(m) fail
Failure linksBFS from root (KMP generalized)
Output linksAll matches at a node in O(matches)
Used infgrep, Snort, Suricata, ClamAV, Hyperscan

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Aho-Corasick works

You have a dictionary of patterns — say {he, she, his, hers} — and a text to search — say "ushers". The naive approach runs each pattern's matcher over the text, paying O(k * n) for k patterns. AC compiles all four patterns into one automaton that walks the text once, never going back, and emits every match the moment it ends.

Construction has three phases:

Trie. Insert every pattern character-by-character starting at the root. Mark each terminal node with the pattern ID(s) ending there.
Failure links. BFS the trie. For each non-root node u reached from parent p via character c, set fail(u) to the longest proper suffix of the string at u that is also a prefix of some pattern (i.e. exists as a path from root). Computed by walking p's failure chain.
Output links. Second BFS pass. For each node u, output(u) is the union of u's own terminal markers with output(fail(u)) — collecting every pattern that ends at u or at any failure-link ancestor.

Search: maintain an "active state" pointer s, starting at root. For each text character c: while s has no child labelled c and s is not root, set s = fail(s); if s does have a child labelled c, advance to that child. Emit every pattern in output(s). The state always knows the longest matched prefix of some pattern; failure links collapse irrelevant context in O(1) amortized.

Trie + failure links for {he, she, his, hers}

           root
         ╱  │  ╲
        h   s
       ╱│   ╲
      e i    h
      │ │    │
      r s    e ── fail: he  (output adds 'he')
      │
      s
     (hers)

Failure links (subset):
  fail(h)       = root
  fail(s)       = root
  fail(he)      = root
  fail(she.h)   = h         (because "h" is suffix-and-prefix)
  fail(she.he)  = he        ← critical: emits 'he' when 'she' completes
  fail(hers)    = s         (suffix "s" is also a prefix)

Run on "ushers":
  u → root (no child)
  s → root.s              (state: s)
  h → s.h via trie        (state: sh)
  e → sh.e via trie       (state: she) → output: [she, he]   ← two matches!
  r → she has no child r; fail(she) = he; he has child r? no; fail(he) = root; root.r? no → root
  s → root.s              (state: s)
  total matches: {she @ 1..3, he @ 2..3}

The "two matches in one step" moment is the algorithm's whole point: output links package up the dictionary suffixes ending at "she" so you don't have to walk the failure chain at runtime.

The pieces

Trie (goto function). The skeleton. Inserting all patterns is O(m * σ) where m is total pattern length and σ is alphabet size. Choice of representation — array per node, hash per node, or compressed double-array — drives memory and cache behavior.

Failure links (fail function). One pointer per non-root node. Together with the trie they form an NFA-like recognizer. Walked at runtime to compensate for "this character doesn't extend the current match".

Output links (dictionary-suffix links). Per-node list (or chain) of pattern IDs that end at this node or at any failure-suffix node. Precomputed via a second BFS so reporting all matches at a node takes O(matches at this node), not O(depth).

Strict DFA (optional). By precomputing goto(s, c) = trie_child_or_resolve(fail-chain(s), c) for every (state, character) pair, the runtime collapses to one table lookup per character — no fail-link walks needed. Memory cost: O(states * σ). Time cost: O(states * σ) one-time. Gain: ~3-5× faster scan throughput.

Why AC matters

Intrusion detection at line rate. Snort and Suricata index 50,000 to 200,000 attack signatures into a single AC automaton. A 10 Gbps link stays line-rate on commodity x86 because the search cost is independent of rule count.
Antivirus signature scanning. ClamAV's MPM (multi-pattern matcher) holds millions of literal signatures in one AC; scanning a megabyte attachment is one linear pass.
Content filtering / DLP. Data-loss-prevention products index tens of thousands of literal terms (credit-card prefixes, employee names, regulated phrases) and check every outbound HTTP/SMTP payload in one pass.
NLP and bioinformatics. Word segmenters for languages without spaces (Chinese, Japanese) build a dictionary AC. Genome aligners (bowtie2, BWA) use AC over k-mer dictionaries to filter candidate matches before detailed alignment.
Search-engine autocomplete. Index the prefix set of a query log; AC finds every completion candidate present in the current input string in O(input).

AC vs other multi/single pattern matchers

	Aho-Corasick	KMP × k	Boyer-Moore × k	Rabin-Karp set	Commentz-Walter	Hyperscan
Patterns	Any set	One at a time	One at a time	Any set (equal length)	Any set	Any regex set
Pre-processing	O(m × σ)	O(\|P\|) per pattern	O(\|P\| + σ) per pattern	O(m)	O(m)	Heavy compile
Search	O(n + m + z)	O(k × n + m)	O(k × n / \|P\|) average	O(n + z) expected	O(n) average, O(n × m) worst	O(n) with SIMD vectorization
Output overlapping	Native	Manual per pattern	Manual per pattern	Native if hashes match	Native	Native
Streaming	Yes — state-resumable	Yes	Limited	Yes	Yes	Yes
Used in	fgrep, Snort, ClamAV	Single-pattern grep	String literals in regex	Set membership checks	Some grep flavors	Suricata, AWS WAF

AC is the asymptotic winner whenever the pattern count and total pattern length matter; Boyer-Moore is faster for single very long patterns; Rabin-Karp is competitive when all patterns share a common length; Hyperscan extends AC with regex support and SIMD batching for production WAFs and IDS.

What AC actually costs

O(n + m + z) — n text, m patterns total length, z matches. The complexity bound that the algorithm is named for. Tight: the amortized argument shows total fail-link traversals across n characters is <= n.
Construction O(m × σ). Inserting each character of each pattern into the trie plus the BFS sweep. σ is alphabet size — 256 for byte-oriented matching, larger for Unicode.
Search throughput on commodity x86. Strict DFA representation: 500 MB/s to 1 GB/s per core with a 10,000-pattern dictionary. NFA representation (lazy fail-link walks): 200-400 MB/s. Hyperscan's vectorized variant: 5-10 GB/s.
Memory for a million-pattern dictionary. Double-array trie: 50-200 MB. Naive sigma-array trie: 1-2 GB. Difference is what production systems choose.
Pattern count scales for free. Adding a thousand patterns to a 10,000-pattern set increases preprocessing by milliseconds; search cost is unchanged because z (match count) is dominated by the actual hits in the text, not the dictionary size.
fgrep performance origin story. Alfred Aho and Margaret Corasick built the algorithm at Bell Labs in 1975 specifically to make fgrep -f patternsfile scale beyond running grep N times.

JavaScript implementation

// Aho-Corasick: insert, build, search. Reports (pattern_id, end_index).

class AhoCorasick {
  constructor() {
    this.goto = [new Map()];      // goto[state] : Map(char → next state)
    this.fail = [0];              // fail[state] = state
    this.out = [[]];              // out[state]  = [patternId, ...]
    this.patterns = [];
  }

  add(pattern) {
    const id = this.patterns.length;
    this.patterns.push(pattern);
    let s = 0;
    for (const c of pattern) {
      const m = this.goto[s];
      if (!m.has(c)) {
        m.set(c, this.goto.length);
        this.goto.push(new Map());
        this.fail.push(0);
        this.out.push([]);
      }
      s = m.get(c);
    }
    this.out[s].push(id);
    return id;
  }

  build() {
    const queue = [];
    // Depth-1 children point fail to root.
    for (const [c, s] of this.goto[0]) {
      this.fail[s] = 0;
      queue.push(s);
    }
    // BFS to set fail and merge output.
    while (queue.length) {
      const r = queue.shift();
      for (const [c, u] of this.goto[r]) {
        queue.push(u);
        // walk fail chain to find a state with edge c
        let state = this.fail[r];
        while (state !== 0 && !this.goto[state].has(c)) state = this.fail[state];
        let target = this.goto[state].get(c);
        if (target === undefined || target === u) target = 0;
        this.fail[u] = target;
        // accumulate output via dictionary-suffix links
        this.out[u] = [...this.out[u], ...this.out[target]];
      }
    }
  }

  * scan(text) {
    let s = 0;
    for (let i = 0; i < text.length; i++) {
      const c = text[i];
      // follow fail links until c is a valid edge from s (or s is root and stops)
      while (s !== 0 && !this.goto[s].has(c)) s = this.fail[s];
      if (this.goto[s].has(c)) s = this.goto[s].get(c);
      // emit matches recorded at this state
      for (const id of this.out[s]) {
        yield { patternId: id, pattern: this.patterns[id], endIndex: i };
      }
    }
  }
}

const ac = new AhoCorasick();
ac.add('he'); ac.add('she'); ac.add('his'); ac.add('hers');
ac.build();
for (const m of ac.scan('ushers')) console.log(m);
// → {pattern: "she", endIndex: 3}, {pattern: "he", endIndex: 3}, {pattern: "hers", endIndex: 5}

The runtime is dominated by the inner while walking the fail chain. For most inputs that loop runs zero times — the active state extends naturally. The amortized argument bounds total fail walks across the whole scan by n.

Python implementation — with strict-DFA flattening

from collections import deque
from typing import Generator

class AhoCorasick:
    """Aho-Corasick automaton with optional strict-DFA flattening."""

    def __init__(self):
        self.goto: list[dict[str, int]] = [{}]
        self.fail: list[int] = [0]
        self.out: list[list[int]] = [[]]
        self.patterns: list[str] = []
        self._dfa: list[dict[str, int]] | None = None

    def add(self, pattern: str) -> int:
        pid = len(self.patterns)
        self.patterns.append(pattern)
        s = 0
        for c in pattern:
            if c not in self.goto[s]:
                self.goto[s][c] = len(self.goto)
                self.goto.append({})
                self.fail.append(0)
                self.out.append([])
            s = self.goto[s][c]
        self.out[s].append(pid)
        return pid

    def build(self) -> None:
        q: deque[int] = deque()
        for c, s in self.goto[0].items():
            self.fail[s] = 0
            q.append(s)
        while q:
            r = q.popleft()
            for c, u in self.goto[r].items():
                q.append(u)
                state = self.fail[r]
                while state and c not in self.goto[state]:
                    state = self.fail[state]
                self.fail[u] = self.goto[state].get(c, 0) if self.goto[state].get(c, 0) != u else 0
                self.out[u] = self.out[u] + self.out[self.fail[u]]
        self._dfa = None  # invalidate any cached DFA

    def flatten_to_dfa(self) -> None:
        """Precompute goto(s, c) for every (state, char) — strict DFA, no fail walks."""
        # For each state, fill any 'missing' edge by resolving through the fail chain.
        # BFS so parents are processed first.
        dfa: list[dict[str, int]] = [dict(self.goto[s]) for s in range(len(self.goto))]
        q: deque[int] = deque([0])
        while q:
            r = q.popleft()
            for c, u in self.goto[r].items():
                if u != r: q.append(u)
            if r == 0:
                continue
            # Fill missing edges from fail[r]'s edges (already flattened by BFS order).
            f = self.fail[r]
            for c, target in dfa[f].items():
                dfa[r].setdefault(c, target)
        self._dfa = dfa

    def scan(self, text: str) -> Generator[tuple[int, str, int], None, None]:
        if self._dfa is not None:
            s = 0
            for i, c in enumerate(text):
                s = self._dfa[s].get(c, 0)
                for pid in self.out[s]:
                    yield pid, self.patterns[pid], i
            return
        s = 0
        for i, c in enumerate(text):
            while s and c not in self.goto[s]:
                s = self.fail[s]
            s = self.goto[s].get(c, 0)
            for pid in self.out[s]:
                yield pid, self.patterns[pid], i

The DFA flattening trades memory for throughput: one table lookup per character with no runtime fail traversals. Production implementations such as ahocorasick-rs ship both representations and pick by pattern-set size — small dictionaries run the strict DFA, million-pattern dictionaries run NFA mode to keep memory under a few hundred MB.

Variants and real-world deployments

Original Aho-Corasick (1975). Bell Labs CACM paper that introduced the algorithm for fgrep -f. Two automata: NFA with fail links, or strict DFA with flattened goto.

Double-array Aho-Corasick. The goto function packed into two integer arrays (base[] and check[]) — O(1) transitions, ~8 bytes per state. Used by ahocorasick-rs, Java's stupidcoder/AhoCorasickDoubleArrayTrie, and Chinese/Japanese word segmenters with multi-million-pattern dictionaries.

Hyperscan (Intel). Production WAF/IDS regex engine. Compiles regex sets into a hybrid of literal-set AC and bit-parallel matchers, vectorized with SSE/AVX. Used by Suricata, AWS WAF, F5.

ClamAV MPM. Aho-Corasick over million-signature dictionaries plus Bloom-filter prefilters for short signatures. The reason ClamAV scans gigabyte mailboxes in seconds.

Wu-Manber. Multi-pattern matcher using a block-based shift heuristic similar to Boyer-Moore. Faster than AC for very large alphabets and patterns with rare characters; loses on small alphabets.

Commentz-Walter. AC combined with Boyer-Moore right-to-left shifting. Common in some grep implementations for the case of moderate pattern count.

Aho-Corasick with mismatches. Generalized variants (Cole-Hariharan) that allow up to k mismatches; bioinformatics applications.

Suffix automata. Dual problem: AC indexes a small dictionary against a long text; a suffix automaton indexes a long text against many small pattern queries.

Common bugs and edge cases

Forgetting output links. Without precomputed dictionary-suffix links, reporting all matches at a state walks the failure chain at runtime, costing O(depth) per text position. Loses the O(n + z) bound.
Overlapping patterns. "he" and "she" both match at position 3 in "ushers". Beginners deduplicate too aggressively and lose matches. AC naturally reports overlaps; if you want only longest-leftmost, post-filter.
Wrong fail-link computation for depth-1 children. Their failure pointer is root, not back to themselves. Off-by-one here breaks every subsequent fail chain.
Alphabet mismatch between build and scan. Building over ASCII but scanning UTF-8 multibyte sequences double-counts characters and misses matches across multi-byte boundaries.
Streaming state checkpoint. AC's state is a single integer; you can stop and resume across chunked input. Implementations that reset state to root at chunk boundaries miss patterns spanning the boundary.
Memory blowup for σ = 256 nodes. Per-node 256-pointer arrays at millions of states explode to GB of RAM. Switch to hash maps or double arrays.
Constructing the DFA on Unicode. Flattening goto over a 1M-codepoint alphabet costs absurd memory. Production Unicode AC keeps NFA mode and resolves fail walks at runtime.
Empty pattern. Insertion of an empty string puts the accept marker at root; every position trivially matches. Reject empty patterns or document the behavior.

Frequently asked questions

How is Aho-Corasick different from running KMP per pattern?

KMP per pattern is k separate scans of the text: O(k * n + sum(|P_i|)) — n is the text length and k the pattern count. For 1000 patterns over 1 GB of text that's a trillion character compares. Aho-Corasick replaces those k scans with one scan over the same n characters, supported by one combined automaton built once. Time becomes O(n + m + z) where m is total pattern length and z the number of matches. Crossover happens around 4 to 8 patterns; past that, AC dominates by orders of magnitude.

Why do you need output links in addition to failure links?

A failure link sends you to the longest proper suffix that is also a prefix of some pattern. That gives you O(1) state transition but does not by itself tell you which patterns end at the current state's ancestors via fail edges. Without output links, reporting all matches at a node u costs O(depth) — walk fail(u), fail(fail(u)), ... checking each for accept status. Output (or 'dictionary suffix') links precompute that chain so reporting all matches at u takes O(matches at u). It is what turns the algorithm from O(n * depth) reporting into O(n + z) reporting.

How are failure links constructed?

BFS from the root. The root's failure link is itself; every child of the root has failure pointing to the root. For a deeper node u reached from parent p via character c, walk f = fail(p), fail(fail(p)), ... until you reach a node f' that has a child via c; set fail(u) = that child. If you fall off the root without finding such a child, fail(u) = root. This mirrors KMP's prefix function computation but generalized across the entire trie. The whole BFS is O(total_pattern_length * alphabet_size) — linear in the size of the structure.

What does the linear-time argument look like?

Each text character either advances the active state by one trie edge or follows one or more failure links. The active state's depth increases by at most one per consumed character. Failure links strictly decrease depth (they point to shorter prefixes). So the total depth decreases across the run cannot exceed the total depth increases, i.e. n. Therefore the cumulative failure-link work across all n text characters is bounded by n: O(n) total. Plus O(z) to emit matches via output links. The preprocessing is O(m) where m is total pattern length.

What real systems use Aho-Corasick?

Snort and Suricata intrusion-detection engines build an AC automaton over tens of thousands of rule patterns and scan line-rate packet payloads in one pass. ClamAV's multi-pattern matcher scans email attachments against millions of malware signatures. Hyperscan (Intel) compiles literal subsets of regex sets to AC plus bit-parallel verifiers. Lucene's keyword analyzer uses AC for synonym and stop-word recognition. bowtie2 and BWA bioinformatics tools use double-array AC variants over k-mer dictionaries to seed read alignment. Search engines run AC for auto-complete prefix expansion.

How is the automaton represented in memory?

Naive: per-node array of sigma child pointers — fast lookup, O(sigma) memory per state. Better: per-node hash map of children — slower constant-time lookup, O(degree) memory. Production: double-array trie packs the entire goto function into two integer arrays base[] and check[], so transition is base[s] + c with check[base[s] + c] == s as the validity guard. Eliminates pointer chasing and shrinks per-state memory to ~8 bytes amortized. ahocorasick-rs and Java's AhoCorasickDoubleArrayTrie ship multi-million-pattern dictionaries at hundreds of MB/s scan rates.