String Algorithms

Aho-Corasick Algorithm

Build a trie, add failure links, find all occurrences of k patterns in O(n + m + z)

Aho-Corasick is a string-matching algorithm that locates all occurrences of any pattern from a finite set within a text in a single pass, with time complexity O(n + m + z) — n = text length, m = total pattern length, z = number of matches. Built in 1975 by Alfred Aho and Margaret Corasick (Bell Labs) for fgrep. The data structure is a trie of all patterns augmented with failure links (analogous to KMP's failure function but generalized to multiple patterns), enabling the search to fall back instantly without re-examining text characters.

  • Time complexityO(n + m + z)
  • Pre-processingO(m × σ)
  • MemoryO(m × σ) goto, O(m) failure
  • InventedAho & Corasick 1975
  • Used infgrep, malware scanners, IDS
  • Failure linksBFS from root

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why Aho-Corasick matters

  • Network intrusion detection. Snort 3 and Suricata index 50,000 to 200,000 rule patterns into a single AC automaton; line-rate matching on 10 Gbps links is achievable on commodity x86 because the search is one pass over the packet stream regardless of rule-set size. Add a pattern and the matching cost stays O(n).
  • Antivirus signatures. ClamAV's MPM (multi-pattern matcher) builds an Aho-Corasick automaton over its 8 million-signature database, augmented with prefilter Bloom hashing. A Boyer-Moore loop per signature would take 8 million passes; AC takes one.
  • Content filters and DLP. Data-loss-prevention products check outbound email/HTTP for tens of thousands of regex'd or literal terms (credit-card prefixes, employee names, project codes). AC compiles the literal alternation into a single automaton.
  • Bioinformatics. bowtie2 and BWA use AC variants over k-mer dictionaries to seed read alignment; this preselects candidate genome positions before performing detailed alignment with Smith-Waterman.
  • Auto-complete and tokenizers. Chinese/Japanese word segmenters (Jieba, MeCab) use double-array AC to find all dictionary words in a sentence simultaneously, then choose a segmentation by dynamic programming over the candidate set.
  • Compiler/lexer warm-up. Lex-style tools that match a fixed set of keywords build an AC automaton; the failure links collapse to the standard DFA edges that recognize the alternation.
  • Log mining. Splunk, Elastic, and custom log pipelines run AC for "matched any of these literal phrases" filters because it is the asymptotically optimal multi-pattern primitive.

Common misconceptions

  • Only for fgrep. Modern WAFs (Cloudflare, AWS WAF), log analyzers (Splunk, Datadog), bioinformatics tools, NLP tokenizers, and search-engine query analyzers all use Aho-Corasick under the hood. fgrep is the historical reference, not the only deployment.
  • Memory-heavy. Naively each node stores a sigma-sized child array — for 8-bit alphabets that is 256 pointers per state. Production implementations use double-array tries, hash-mapped children for sparse states, and quantized state IDs to fit a million-pattern dictionary in 50 to 200 MB. Hyperscan compresses further with bytecode.
  • Same speed as running KMP k times. KMP repeated k times is O(kn) per query. AC is O(n + z) per query after a one-time O(m) build. For k = 1000 patterns and n = 1 GB, that's three orders of magnitude difference.
  • Only handles exact matches. Wu-Manber, Commentz-Walter, and Aho-Corasick-with-mismatches generalize to k-mismatch matching. Hyperscan compiles regex sets to AC plus bit-parallel acceptors for non-anchored regex matching.
  • Output links are optional. Without precomputed output (dictionary-suffix) links, finding all matches at a given state requires walking the failure chain, costing O(depth) per state. Output links collapse this to O(1) per match and are required to attain the O(n + z) bound.
  • The trie has to be a literal pointer tree. Once built, the goto function can be flattened into a sparse-matrix DFA that fits in CPU caches. ripgrep's aho-corasick crate ships both noncontiguous-NFA and DFA representations and switches by pattern-set size.

Construction and search in detail

Construction proceeds in three phases. Phase 1: insert each pattern into a trie, marking the terminal node with the pattern ID. Phase 2: BFS over the trie computing fail(u) for every non-root node. The root and its children point fail to root. For deeper u with parent p reached via character c, walk f = fail(p), fail(fail(p)), ... until you reach a node f' with a child via c (call it g) — set fail(u) = g — or until you fall off the root, in which case fail(u) = root. Phase 3: compute output links by another BFS, where output(u) = u if u is terminal, else union of output(fail(u)) with the pattern-IDs at u.

Search reads the text one character at a time, maintaining a current state s. To consume character c: while s has no child labelled c and s is not root, set s = fail(s); then if s does have a child labelled c, set s = that child. Emit every pattern in output(s). The amortized argument: state depth increases by at most 1 per text character and decreases monotonically along fail edges, so the total fail-link work across all n characters is O(n).

Practical implementation notes

  • Choose representation by k and sigma. Small k and small sigma — fully compiled DFA fits easily. Large k (millions) and small sigma — double-array trie. Large sigma (Unicode) — hash-mapped children.
  • Pre-flatten goto. Replace runtime fail-link walks by precomputing goto(s, c) for every (state, character) pair. This is a strict DFA: O(1) work per text character, at the cost of O(states × sigma) memory.
  • Aho-Corasick handles overlapping matches naturally. Output links report all patterns ending at a given state, including patterns that are suffixes of other patterns. Be careful when you only want non-overlapping matches — those need an external longest-leftmost greedy pass.
  • Streaming. Unlike Boyer-Moore, AC is purely online: state s captures everything you need to know about the past. You can checkpoint and resume across reads with a single integer.
  • Memory-aligned states. Pack state IDs into 32-bit integers, lay out goto tables in row-major order, and many queries become single cache-line loads. Hyperscan goes further with bit-parallel state vectors.

Frequently asked questions

How does Aho-Corasick build the failure links?

Failure links are computed by a breadth-first sweep of the trie. The root and its direct children get failure pointers to the root. For each deeper node u with parent p reached via character c, fail(u) is computed by following p's failure chain, looking for a node that has a child via c, and pointing fail(u) to that child (or root if none exists). This mirrors KMP's prefix function but is built across the entire pattern set in O(total_pattern_length × alphabet_size). Each node's fail link points to the longest proper suffix of the string ending at that node that is also a prefix of some pattern.

Why is it linear in input + matches?

Each text character is consumed exactly once and triggers at most one trie transition or one failure-link traversal. An amortized argument identical to KMP's shows the total fail-link traversal across all n input characters is O(n). Reporting the z matches takes O(z) total because output links (a precomputed list at each node of all patterns ending in any suffix of the path to that node) let you emit them in constant time per match. The O(m) preprocessing dominates only when patterns are large; the O(n) search is independent of how many patterns there are.

When does it beat naive multi-pattern?

Naive multi-pattern is k applications of single-pattern search — O(k × n) for k patterns and text length n. Aho-Corasick is O(n + m + z), where m is the sum of pattern lengths. Crossover happens around k = 4 to 8 patterns of equal length. For an antivirus database with 100,000 signatures, AC scans a 1 MB file in a single pass; Boyer-Moore would need 100,000 passes. ClamAV's MPM (multi-pattern matcher) is essentially Aho-Corasick augmented with bloom-filter fast paths for short signatures.

What is double-array Aho-Corasick?

A double-array trie stores transitions in two integer arrays (base[] and check[]) instead of per-node hash maps or pointer arrays. State s with character c transitions to base[s] + c, valid if check[base[s] + c] = s. This eliminates pointer chasing and shrinks per-state memory from O(sigma) to O(1) amortized, at the cost of more elaborate construction. Production implementations such as ahocorasick-rs and Java's stupidcoder/AhoCorasickDoubleArrayTrie handle multi-million-pattern dictionaries (Chinese word segmentation, gene sequences) at hundreds of megabytes per second.

Where is it used in practice?

Snort and Suricata intrusion-detection systems run AC over packet payloads to match thousands of attack signatures. ClamAV scans email attachments and files for malware. Hyperscan (Intel's regex engine, used by Suricata and CDN WAFs) compiles literal subsets of regex sets to AC. Lucene's payload analyzer uses AC for stop-word and synonym detection. Bioinformatics tools (bowtie2's seed matcher, BLAST's prefilters) use AC to locate k-mers. Search engines run AC for query-suggestion auto-complete. Anywhere you need to find any of N strings inside text in a single pass, AC is the textbook answer.

How does it relate to suffix automata?

An Aho-Corasick automaton is a DFA that accepts the language {x : x ends with at least one pattern in the dictionary}. Adding the failure links and following them while reading input produces the same language; the automaton can be made into a strict DFA by precomputing goto(s, c) = either the trie child if it exists, or goto(fail(s), c) recursively, eliminating runtime fail-link traversal. A suffix automaton instead recognizes every substring of a fixed text — a complementary problem. AC indexes a small set of strings and scans a long text; the suffix automaton indexes a long text and answers many string queries.