String Algorithms

Suffix Automaton

O(n) construction, O(n) states — minimal automaton accepting every substring

A suffix automaton (SAM) of string s is the smallest deterministic finite automaton that recognizes every suffix (and therefore every substring) of s. It has at most 2n − 1 states and 3n − 4 transitions, and is built online in O(n × σ) time where σ = alphabet size — invented by Anatoly Blumer in 1985 and refined by Crochemore. Each state corresponds to an equivalence class of right-extensions; the automaton compactly encodes all substrings in linear space, enabling O(n+m) substring search, O(n²) longest-common-substring of two strings (or O(n log n) via SAM intersection), and online queries.

  • States≤ 2n−1
  • Transitions≤ 3n−4
  • ConstructionOnline O(n × σ)
  • ApplicationSubstring queries, LCS, distinct substrings
  • InventedBlumer 1985
  • Equivalent toSuffix tree of reverse + DAWG

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why suffix automaton matters

  • Compressed substring index. SAM stores all O(n²) substrings of s in O(n) space — the strongest possible representation. A 1 MB string has up to 500 GB of distinct substrings stored implicitly in roughly 16 MB of SAM nodes and edges.
  • Online construction. Append a character, update SAM in amortized O(σ) extra work. No batch rebuild required, unlike suffix arrays. Crucial for streaming text — e.g., real-time log search, chat-server search-index updates, IDE incremental indexers.
  • Competitive programming workhorse. Russian and Eastern European competitive programming (Codeforces, ACM ICPC) treat SAM as a default tool for any string problem more sophisticated than KMP. Problems involving "count distinct substrings", "longest common substring of two strings", "kth lexicographic substring", or "occurrence count of substring" reduce to SAM in 30 lines.
  • Bioinformatics motif search. Building a SAM over a chromosome lets you query every k-mer in O(k) time, supporting motif scans, primer-design candidate filtering, and approximate-repeat detection. SuffixSAM-rs ships SAM-based gene-annotation tools.
  • Search-engine index compaction. Lucene's auto-suggest builds a SAM-like FST (Finite State Transducer) over the lexicon. Same min-DFA principle, allows O(prefix-length) lookup over millions of terms in tens of megabytes.
  • Plagiarism detection and string similarity. Longest common substring of two strings A and B in O(|A| + |B|): build SAM(A), feed B through it tracking maximum match length. Classic 30-line solution.
  • Duality with suffix tree. Many algorithms expressed on suffix trees translate directly to SAMs and run with smaller constants. The Crochemore-Vérin construction merges the two.

Common misconceptions

  • Harder than suffix tree. SAM's online construction by Blumer is in fact simpler than Ukkonen's online suffix-tree construction — about 50 lines of code versus 200, with no edge labels and no implicit-leaf tricks. Most CP libraries include SAM but not Ukkonen for this reason.
  • Only academic. Used in indexers (Lucene/Solr's FST), Chinese word segmenters, code-search tools (e.g., the literal-string fast paths in Hyperscan), and pattern miners. The data structure is operational, not just theoretical.
  • Need full alphabet table per state. Naively each state has σ outgoing pointers, costing O(n × σ) memory. Production SAMs use hash maps or sorted small-vectors per state, dropping to O(n) memory at the cost of O(log σ) per transition. For ASCII (σ = 128) the dense table is fine.
  • SAM is just a suffix tree. Distinct structure. SAM is a DAG of states keyed by endpos equivalence. The suffix tree of reverse(s) is structurally isomorphic to SAM(s), but the algorithms and the "what each state represents" stories differ. Use whichever fits the query.
  • You always need cloning. The clone step in Blumer's construction is conditional — only when the existing transition target's len is strictly greater than parent.len + 1. Roughly half of insertions don't clone in practice. Avoiding unnecessary clones is the difference between 2n − 1 states and a much larger automaton.
  • Memory is small. Per-state overhead is len, link, and a transition map: typically 30 to 80 bytes. A 100 MB string yields a SAM with up to 200M states and ~16 GB. Compact representations (bit-packed transitions, varint-encoded lens) shave this to 4 GB. Still substantial.

Construction in detail

Initialize with one state, the initial state q0 with len = 0 and link = null. Maintain a pointer 'last' to the state that corresponds to the entire current string seen so far (initially q0). To extend the string by character c: create cur with len(cur) = len(last) + 1; let p = last and walk along suffix links: while p is not null and p has no c-transition, set next(p, c) = cur and p = link(p). If p becomes null, set link(cur) = q0. Otherwise, let q = next(p, c). If len(q) = len(p) + 1, set link(cur) = q. Otherwise, clone q: create clone with len = len(p) + 1, copy q's transitions and link to clone, set link(q) = link(cur) = clone, then while p is not null and next(p, c) = q, set next(p, c) = clone and p = link(p). Set last = cur. Each insertion does O(amortized 1) suffix-link walks, plus the transition-redirect loop, totaling O(n × σ) for the entire construction.

To find a pattern P in s: walk SAM(s) starting at q0 reading P character by character. If a transition exists at every step, P is a substring of s. Cost is O(|P| log σ) using map-based transitions or O(|P|) with array-based.

Applications and idioms

  • Count distinct substrings. Sum of (len(v) − len(link(v))) over all non-initial states. O(n).
  • Longest common substring of two strings. Build SAM(A). Walk B through SAM(A) tracking current state v and length L: on character c, while v has no c-transition follow link, decrementing L appropriately; record max L. O(|A| + |B|).
  • k-th lexicographic substring. Topologically sort SAM, count number of paths from each state, then descend choosing the smallest character at each step until you reach the kth path.
  • Number of occurrences of pattern P in s. Walk to the state for P, return |endpos(state)|. Endpos sizes are computed by topological sort and DAG aggregation in O(n).
  • Generalized SAM. Build a SAM over a set of strings simultaneously by sharing the initial state. Common in plagiarism-detection over corpora, where many strings need joint substring analysis.

Frequently asked questions

How does a suffix automaton differ from a suffix tree?

A suffix tree is a compacted trie of all suffixes of s — a tree with O(n) leaves and internal nodes. A suffix automaton is a DAG of states; each state is an equivalence class of substrings sharing the same set of right-extension positions in s. The two structures are duals: SAM(s) is isomorphic in structure to a compacted form of the suffix tree of reverse(s), and vice versa. SAM has at most 2n − 1 states and 3n − 4 transitions versus the suffix tree's 2n − 1 nodes and 2n − 2 edges; the constants are similar but the SAM construction (Blumer's online algorithm) is significantly easier to implement than suffix-tree construction (Ukkonen, McCreight).

What is a 'state' in SAM?

A state represents an equivalence class of substrings of s. Two substrings are equivalent if they have exactly the same set of ending positions in s — formally, the same 'endpos' set. For each state v we record three things: 'len(v)' the maximum length of a substring in its class; 'link(v)' the suffix-link pointer to the state of the longest proper suffix not in v's class; and 'next(v, c)' the transitions out. The number of distinct endpos sets is at most 2n − 1, which is why SAM has linear states.

How is online construction done in linear time?

Blumer's online algorithm extends SAM(s) to SAM(s + c) in amortized O(σ) time per appended character. Maintain 'last' — the state corresponding to the entire current string. To add character c: create a new state cur with len = len(last) + 1. Walk up the suffix-link chain from last, adding transitions to cur for any state without a c-transition. When you find a state p with a c-transition to q: if len(q) = len(p) + 1, set link(cur) = q; else clone q into a new state q' with len = len(p) + 1, retarget p's c-transition to q', set link(q) = link(cur) = q', and continue up the chain redirecting any other p with c-transition to q. Total work across all insertions amortizes to O(n × σ); aggregate analysis using the suffix-link depths bounds it.

How do you count distinct substrings using SAM?

The number of distinct substrings of s equals the sum over all non-initial states v of (len(v) − len(link(v))) — that is, the number of distinct substring lengths represented at v. This evaluates in O(n) over the SAM. Equivalently, each substring corresponds to a unique path of length equal to its length from the initial state, and counting paths gives the same total. For a string of length n the count can be as large as n(n + 1)/2, so storing all substrings explicitly would be O(n²); SAM gives O(n) representation and O(n) query.

When is SAM preferable to suffix array?

SAM is online — you can append characters and update the automaton in amortized O(σ) per character. Suffix arrays require a full re-sort or DC3 reconstruction. SAM is preferable for streaming inputs (text being typed, log streams) and for problems that need substring queries during construction. SAM is also more natural for problems involving counts of substrings or distinct-substring enumeration. Suffix arrays win for problems that need lexicographic ordering of suffixes, range LCP queries with sparse tables, or pattern matching when memory layout matters more than online updates. Most competitive-programming problems can be solved either way; the choice often comes down to which the contestant has memorized.

What's the link to Directed Acyclic Word Graphs (DAWGs)?

A Directed Acyclic Word Graph (DAWG, also called a Directed Acyclic Subsequence Graph for some variants) is the same structure as a suffix automaton but recognizing all substrings rather than only suffixes. The two terms are used interchangeably for SAM in much of the literature; Blumer's 1985 paper is titled 'The smallest automaton recognizing the subwords of a text' and used 'DAWG'. Modern competitive-programming references (e-maxx, CP-Algorithms) prefer 'suffix automaton'. The DAWG name persists in lexicon-compression libraries (e.g., MARISA-trie, dawg-rust) where the same machinery compresses dictionaries.