String Algorithms
Suffix Tree (Ukkonen's Algorithm)
Every suffix in one tree — built left to right, in linear time
A suffix tree is a compressed trie of every suffix of a string, built in O(n) time by Ukkonen's algorithm, that answers "is P a substring?" in O(|P|) and powers longest-common-substring, repeats, and matching statistics.
- ConstructionO(n)
- Substring queryO(|P|)
- Nodes≤ 2n
- Memory≈ 20n bytes
- InventedWeiner 1973 · Ukkonen 1995
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
A trie of every suffix, squeezed
Take the string banana. Its suffixes are banana, anana, nana, ana, na, and a. If you insert all six into an ordinary trie, you get a tree where any substring of banana is exactly a path that starts at the root — because every substring is a prefix of some suffix. That single observation is the whole point: a substring search becomes a root-to-somewhere walk, costing one step per character of the pattern, regardless of how long the text is.
The naive trie is too big — it can have O(n²) nodes, one per character of every suffix. The fix is path compression: any chain of single-child nodes collapses into one edge labeled with a whole substring. A compressed suffix trie is a suffix tree. After compression the tree has at most 2n nodes and exactly n leaves (with a sentinel — more on that below). Edge labels are stored not as copied strings but as a pair of indices [start, end) into the original text, so the whole structure is O(n) space even though the suffixes together contain O(n²) characters.
Peter Weiner built the first linear-time suffix tree in 1973 — Donald Knuth reportedly called it "the algorithm of 1973." McCreight simplified it in 1976. But both build the tree right-to-left and are notoriously hard to follow. In 1995 Esko Ukkonen published an online, left-to-right construction that processes the text one character at a time and is the version everyone teaches today.
How Ukkonen's algorithm works
Ukkonen builds the suffix tree of S[0..i] incrementally, extending it to S[0..i+1] at each step. The brilliance is three tricks that turn an obvious O(n²) (or O(n³)) process into O(n):
- Open-ended leaf edges ("global end"). Once an edge runs to a leaf, that leaf's suffix keeps growing as the string grows. Rather than re-extending it each step, store the leaf edge's end as a shared variable
endthat you increment once per phase. This applies Rule 1 ("leaf stays a leaf") to all leaves in O(1). Gusfield calls this "once a leaf, always a leaf." - Suffix links. When an internal node for string
xαexists, it gets a pointer — a suffix link — to the node forα. After inserting one suffix, you follow the suffix link to land near the insertion point of the next-shorter suffix instead of descending from the root again. - The active point and "skip/count." Instead of re-reading characters one at a time when walking down an edge, you jump by edge length (you already know the substring matches), so a down-walk costs O(1) amortized per node passed.
The state between phases is captured by an active point: a triple (active_node, active_edge, active_length) describing where in the tree the next insertion begins, plus a remainder counter of suffixes still owed. Each phase i does remainder insertions, and three rules govern them:
- Rule 1 (leaf extension): handled implicitly by the global
end. - Rule 2 (new branch): the next character isn't on the active edge → split the edge, create an internal node and a new leaf. If we created an internal node on the previous insertion in this phase, set its suffix link to this one.
- Rule 3 (already there): the next character is already present → stop the phase early (a "show stopper"), bump
active_length, and carry the remaining suffixes into the next phase.
Why it's linear — the accounting
Each of the n phases looks like it could do up to n insertions, suggesting O(n²). The key amortized arguments:
- Rule 1 leaf extensions are free (the shared
end), so they never appear in the phase loop. - Every Rule 2 insertion creates exactly one leaf, and a leaf is never destroyed. There are at most
nleaves, so there are at mostnRule-2 operations across the entire run. - Rule 3 ends a phase immediately, so it happens at most once per phase:
ntimes total. - The skip/count down-walk's total cost telescopes: each suffix-link jump can only decrease the current depth by at most a constant, and depth increases at most
ntimes, so all walking is O(n) summed.
Net result: O(n) time and O(n) space for a constant-size alphabet. For a general alphabet of size σ, the child-lookup at each node costs O(log σ) with a balanced map (or O(σ) space with arrays), giving O(n log σ) time. Substring search is O(|P|) — purely the length of the pattern, never the text. That is the headline: searching a 3-billion-character genome for a 30-character read costs 30 steps once the tree is built.
When to use a suffix tree
- Repeated substring queries against a fixed text. Build once, query forever: each query is O(|P|), independent of text length.
- Longest repeated substring. The deepest internal node (by string depth) spells the answer — a single tree traversal.
- Longest common substring of two (or k) strings. Build a generalized suffix tree over
S1#S2$and find the deepest internal node whose subtree contains leaves from both strings. - Matching statistics, maximal repeats, Lempel-Ziv factorization, and constant-time longest-common-extension when paired with an LCA structure.
- Online / streaming construction. Ukkonen's left-to-right build is the only classic linear method that ingests characters as they arrive.
Skip it when memory is tight or the text is huge — a suffix array or FM-index gives the same answers in a fraction of the space with far better cache behavior. And for one-off "does P occur?" on a small text, plain KMP or Z-algorithm is simpler and avoids the index.
Suffix tree vs other substring indexes
| Suffix tree | Suffix array | Suffix automaton (DAWG) | Trie of suffixes | FM-index | |
|---|---|---|---|---|---|
| Build time | O(n) | O(n) (SA-IS) | O(n) | O(n²) | O(n) |
| Substring query | O(|P|) | O(|P| + log n) | O(|P|) | O(|P|) | O(|P|) backward |
| Memory (typical) | ≈ 20n bytes | ≈ 4n bytes (+4n LCP) | ≈ 8n–12n bytes | O(n²) bytes | ≈ n + o(n) bits |
| Cache locality | Poor (pointer chasing) | Excellent (flat arrays) | Poor | Poor | Moderate |
| Node count | ≤ 2n | n (array slots) | ≤ 2n − 1 states | up to O(n²) | — |
| Online build | Yes (Ukkonen) | No (needs full text) | Yes (incremental) | Yes | No |
| Best for | Tree-structure queries, LCA, repeats | General substring search | Counting distinct substrings, automaton ops | Teaching only | Compressed genomics search |
The practical takeaway: the suffix array (with an LCP array) replicates almost everything a suffix tree does at one-fifth the memory and with sequential access patterns the CPU loves. The suffix tree earns its keep when you genuinely need the explicit tree — suffix links, subtree marking for multi-string problems, or O(1) lowest-common-ancestor between two suffixes.
What the numbers actually say
- Memory blow-up is real. Even a carefully engineered implementation costs roughly 20 bytes per input character (a node carries several index and pointer fields, and there are up to 2n nodes), so a suffix tree over the 3.1-billion-base human genome lands on the order of 60 GB. A 32-bit suffix array of the same genome is ~12.4 GB, and an FM-index compresses it to roughly 1–4 GB. This memory gap is exactly why BWA, Bowtie, and other read aligners use FM-indexes, not suffix trees.
- Query cost is text-independent. Matching a 100-character pattern against a 1 GB text is 100 character comparisons after the build — about 100 nanoseconds of pure comparison work, ignoring cache misses. A linear scan would touch all 10⁹ bytes.
- Construction throughput. Ukkonen's algorithm does O(1) amortized work per character but with poor locality; in practice it builds a few million characters per second, slower per byte than SA-IS suffix-array construction despite the same asymptotic bound, because of pointer chasing and cache misses.
- Leaf count is exactly n. With a unique sentinel, the tree has precisely n leaves (one per suffix) and at most n − 1 internal branching nodes, bounding total nodes at 2n − 1.
JavaScript implementation
A compact Ukkonen builder. Edge labels are [start, end) index pairs; leaf edges share a mutable leafEnd so Rule 1 is automatic.
function buildSuffixTree(text) {
const s = text + ' '; // unique sentinel
const n = s.length;
const ROOT = 0;
// node: { children: Map(char->edge), link }; edge: { start, end (fn|null), node }
const nodes = [{ children: new Map(), link: -1 }];
const newNode = () => (nodes.push({ children: new Map(), link: ROOT }), nodes.length - 1);
let leafEnd = -1; // global end for all leaf edges
let active = { node: ROOT, edge: -1, len: 0 };
let remainder = 0, lastNew = -1;
const edgeLen = (e) => (e.end === null ? leafEnd : e.end) - e.start + 1;
for (let i = 0; i < n; i++) {
leafEnd = i; // Rule 1: extend every leaf for free
remainder++;
lastNew = -1;
while (remainder > 0) {
if (active.len === 0) active.edge = i;
const ac = s[active.edge];
let e = nodes[active.node].children.get(ac);
if (!e) { // Rule 2: no edge -> new leaf
const leaf = newNode();
nodes[active.node].children.set(ac, { start: i, end: null, node: leaf });
if (lastNew !== -1) { nodes[lastNew].link = active.node; lastNew = -1; }
} else {
const len = edgeLen(e);
if (active.len >= len) { // skip/count: hop to the next node
active.edge += len; active.len -= len; active.node = e.node; continue;
}
if (s[e.start + active.len] === s[i]) { // Rule 3: already present
if (lastNew !== -1) nodes[lastNew].link = active.node;
active.len++; break; // show stopper — end the phase
}
// Rule 2 with split: edge -> internal -> (old, new leaf)
const split = newNode();
const internalEdge = { start: e.start, end: e.start + active.len - 1, node: split };
nodes[active.node].children.set(ac, internalEdge);
const leaf = newNode();
nodes[split].children.set(s[i], { start: i, end: null, node: leaf });
e.start += active.len;
nodes[split].children.set(s[e.start], e);
if (lastNew !== -1) nodes[lastNew].link = split;
lastNew = split;
}
remainder--;
if (active.node === ROOT && active.len > 0) { // rooted: shrink the active length
active.len--; active.edge = i - remainder + 1;
} else if (active.node !== ROOT) { // follow suffix link
active.node = nodes[active.node].link;
}
}
}
return { nodes, text: s, edgeLen };
}
// Substring query: walk edges, matching pattern characters.
function contains(tree, pat) {
const { nodes, text } = tree;
let node = 0, k = 0;
while (k < pat.length) {
const e = nodes[node].children.get(pat[k]);
if (!e) return false;
const end = (e.end === null ? text.length - 1 : e.end);
for (let j = e.start; j <= end && k < pat.length; j++, k++)
if (text[j] !== pat[k]) return false;
node = e.node;
}
return true;
}
Two things to watch. First, leaf edges carry end: null and read leafEnd dynamically — never hard-code a leaf's end. Second, after each insertion you adjust the active point one of two ways: from the root you decrement active_length; from any internal node you follow its suffix link. Mixing these up is the classic bug.
Python implementation
The same algorithm in Python, structured the same way so you can read them side by side.
class Node:
__slots__ = ('children', 'link')
def __init__(self):
self.children = {} # char -> [start, end_or_None, node]
self.link = 0
def build_suffix_tree(text):
s = text + '\0' # unique sentinel
n = len(s)
nodes = [Node()] # root = index 0
def new_node():
nodes.append(Node()); return len(nodes) - 1
leaf_end = [-1] # boxed so edges can read it live
a_node, a_edge, a_len = 0, -1, 0
remainder = 0
def edge_len(e):
end = leaf_end[0] if e[1] is None else e[1]
return end - e[0] + 1
for i in range(n):
leaf_end[0] = i # Rule 1: free leaf extension
remainder += 1
last_new = -1
while remainder > 0:
if a_len == 0:
a_edge = i
e = nodes[a_node].children.get(s[a_edge])
if e is None: # Rule 2: brand-new leaf
nodes[a_node].children[s[a_edge]] = [i, None, new_node()]
if last_new != -1:
nodes[last_new].link = a_node; last_new = -1
else:
if a_len >= edge_len(e): # skip/count down-walk
a_edge += edge_len(e); a_len -= edge_len(e); a_node = e[2]
continue
if s[e[0] + a_len] == s[i]: # Rule 3: already present
if last_new != -1:
nodes[last_new].link = a_node
a_len += 1
break # show stopper
split = new_node() # Rule 2 with split
nodes[a_node].children[s[a_edge]] = [e[0], e[0] + a_len - 1, split]
nodes[split].children[s[i]] = [i, None, new_node()]
e[0] += a_len
nodes[split].children[s[e[0]]] = e
if last_new != -1:
nodes[last_new].link = split
last_new = split
remainder -= 1
if a_node == 0 and a_len > 0:
a_len -= 1; a_edge = i - remainder + 1
elif a_node != 0:
a_node = nodes[a_node].link
return nodes, s
def longest_repeated_substring(text):
"""Deepest internal node spells the longest repeat."""
nodes, s = build_suffix_tree(text)
best = ''
def dfs(idx, depth):
nonlocal best
node = nodes[idx]
if not node.children: # leaf — no repeat through here
return
if depth and len(depth) > len(best):
best = depth
for start, end, child in node.children.values():
end = (len(s) - 1) if end is None else end
dfs(child, depth + s[start:end + 1])
dfs(0, '')
return best
longest_repeated_substring shows the payoff of the tree shape: a single depth-first walk, stopping at the deepest branching node, yields the answer in O(n). The same traversal, with leaf-subtree counts, gives all maximal repeats.
Variants worth knowing
Generalized suffix tree. Index several strings at once by concatenating them with distinct sentinels (S1#S2$). Marking which input each leaf belongs to turns it into a longest-common-substring and document-retrieval machine.
Suffix array + LCP array. The flat, cache-friendly cousin. Built in O(n) by SA-IS or DC3, it answers the same substring queries in ~4n bytes. Most production systems use this instead — see our suffix array deep dive.
Suffix automaton (DAWG). The minimal deterministic automaton recognizing every substring, with ≤ 2n − 1 states. Built online in O(n), it counts distinct substrings and does automaton-style queries the tree can't express as cleanly.
FM-index / compressed suffix tree. Built on the Burrows-Wheeler transform, these shrink the index to near the entropy of the text — n + o(n) bits — at the cost of slower constant factors. This is what real DNA aligners ship.
Weiner and McCreight. The two earlier linear-time constructions. Weiner (1973) builds right-to-left; McCreight (1976) is more memory-frugal but offline. Ukkonen wins on being online and easiest to explain.
Common bugs and edge cases
- Forgetting the sentinel. Without a unique terminator, a suffix that is a prefix of another (the
aninsideanan) ends mid-edge and has no leaf — the tree is "implicit" and many queries break. Always append a character that appears nowhere else. - Hard-coding leaf-edge ends. Leaf edges must read the shared
leafEnd/global end; storing a fixed value defeats Rule 1 and reintroduces O(n²) work. - Wrong active-point update. From the root, decrement
active_lengthand recomputeactive_edge; from an internal node, follow the suffix link. Swapping these silently produces a malformed tree. - Setting suffix links late. An internal node created by a split must receive its suffix link on the next Rule-2 or Rule-3 step of the same phase. Track the "last new internal node" and wire it up before moving on.
- Skip/count off-by-one. When
active_lengthequals the edge length exactly, you must hop to the child node and reset, not stop on the edge. - Reaching for a suffix tree when an array would do. If you don't need suffix links or LCA, a suffix array is smaller, faster to build, and far kinder to the cache.
Frequently asked questions
How does Ukkonen's algorithm build a suffix tree in linear time?
It builds the tree online, one character at a time, keeping all leaf edges open-ended so they grow for free (the 'global end' trick), and using suffix links to jump between the active points of consecutive suffixes instead of re-walking from the root. The amortized work per character is O(1) over an alphabet of constant size, giving O(n) total.
What is a suffix link?
If an internal node represents the string xα (a single character x followed by a possibly-empty string α), its suffix link points to the node representing α. Suffix links let Ukkonen's algorithm move from the locus of one suffix to the locus of the next-shorter suffix in O(1), which is the key to linear-time construction.
Why do you append a sentinel character like $ to the string?
Without a unique terminator, a suffix that is a prefix of another suffix (for example 'an' inside 'anan') has no leaf of its own — it ends in the middle of an edge. Appending a sentinel that occurs nowhere else forces every suffix to end at a distinct leaf, so the tree has exactly n leaves and the structure is well defined.
Suffix tree vs suffix array — which should I use?
A suffix array stores the same information in about 4n bytes versus 20 or more bytes per character for a pointer-heavy suffix tree, and it has far better cache behavior. Use a suffix array (plus an LCP array) for almost all practical substring indexing. Reach for an explicit suffix tree when you need its tree structure directly — suffix links, lowest-common-ancestor queries, or streaming online construction.
How much memory does a suffix tree use?
A suffix tree has at most 2n nodes and 2n − 1 edges. Storing children in hash maps or arrays, each node typically costs 20 to 40 bytes, so a realistic implementation uses roughly 20n bytes — about 20 GB for a 1 GB genome, which is why suffix arrays and FM-indexes dominate in bioinformatics.
What problems do suffix trees solve in linear time?
Substring existence and counting in O(|P|), longest repeated substring (deepest internal node), longest common substring of two strings (a generalized suffix tree with two sentinels), all maximal repeats, matching statistics, and the Lempel-Ziv factorization — each in time linear or near-linear in the input.