Data Structures

Suffix Array

All n suffixes, sorted, in 4n bytes

A suffix array is the sorted list of every suffix of a string, stored as starting indices. Combined with an LCP array, it answers substring search, longest-repeated-substring, and many string-matching problems in O(log n) per query and O(n) total memory.

  • Build (SA-IS)O(n)
  • Build (sort-based)O(n log² n)
  • Substring searchO(m log n)
  • LCP build (Kasai)O(n)
  • Memory≈4n bytes

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a suffix array works

A suffix of a string is whatever's left when you delete a prefix. The string "banana$" (with sentinel) has seven suffixes: banana$, anana$, nana$, ana$, na$, a$, $. Sort them lexicographically and you get:

SA[0] = 6  $
SA[1] = 5  a$
SA[2] = 3  ana$
SA[3] = 1  anana$
SA[4] = 0  banana$
SA[5] = 4  na$
SA[6] = 2  nana$

The suffix array is just the column of starting indices: SA = [6, 5, 3, 1, 0, 4, 2]. Despite throwing away the actual suffix strings, the array carries enough information to (a) binary-search for any pattern in O(m log n), (b) find the longest repeated substring once you also have an LCP array, and (c) answer dozens of other string queries in time that the underlying string itself wouldn't naively support.

The naive build is "sort all n suffixes lexicographically" — O(n² log n) because each comparison is up to n characters. Better: prefix-doubling sorts in rounds, where round k orders suffixes by their first 2^k characters using ranks computed in round k-1. Total: O(n log² n) with a simple sort, or O(n log n) with radix sort. SA-IS and DC3 reach the asymptotic floor of O(n).

When to use a suffix array

  • Indexing a static or rarely-changing text for many subsequent pattern queries (genome indexing, full-text search).
  • Longest repeated substring, longest common substring of two strings, count of distinct substrings.
  • Computing the Burrows-Wheeler transform — both bzip2 and the FM-index in Bowtie/BWA depend on it.
  • Anywhere a suffix tree would work but you can't afford 20-40 bytes per character of memory overhead.

Skip the suffix array when the text changes frequently — rebuilding is O(n). Use a suffix automaton or compressed suffix tree instead. For one-shot single-pattern search, KMP or Boyer-Moore are simpler and don't require the linear-time preprocessing.

Suffix array vs suffix tree vs FM-index

Suffix array (+ LCP)Suffix treeFM-index
Memory (bytes per char)~4-9~20-40~0.5-2 (compressed)
Construction timeO(n) with SA-ISO(n) UkkonenO(n) — derived from SA
Substring searchO(m log n) plain, O(m) with LCPO(m)O(m) backward search
Find all k occurrencesO(m + k)O(m + k)O(m + k)
Distinct substringsn(n+1)/2 - sum(LCP)Sum of edge labelsSame as suffix array
Streaming / appendNot reallyOnline with UkkonenNo
Ease of implementationMedium (SA-IS is fiddly)Hard (Ukkonen has 5 cases)Hard (depends on SA + BWT)
Best forGeneral string indexingOnline queries while buildingGenomes, compressed indexes

Real bioinformatics tools (BWA, Bowtie 2, minimap2) use FM-indexes built from suffix arrays. They get suffix-tree-like query times with sub-byte-per-character memory — the human genome (3 Gbp) fits in roughly 3-4 GB of RAM with the right tuning, versus 60+ GB for a naive suffix tree.

Pseudo-code (prefix doubling)

function buildSA(s):
    n = len(s)
    sa = [0..n-1]                      # initial: indices in input order
    rank = [s[i] for i in 0..n-1]      # ranks by single chars
    k = 1
    while k < n:
        # Compare suffixes by (rank[i], rank[i+k])
        sort sa by key i -> (rank[i], rank[i+k] if i+k<n else -1)
        new_rank = [0]*n
        new_rank[sa[0]] = 0
        for j in 1..n-1:
            prev = (rank[sa[j-1]],   rank[sa[j-1]+k]   if sa[j-1]+k<n   else -1)
            curr = (rank[sa[j]],     rank[sa[j]+k]     if sa[j]+k<n     else -1)
            new_rank[sa[j]] = new_rank[sa[j-1]] + (1 if curr != prev else 0)
        rank = new_rank
        if rank[sa[n-1]] == n - 1: break   # all unique
        k *= 2
    return sa

JavaScript implementation — prefix doubling

function buildSuffixArray(s) {
  const n = s.length;
  let sa = Array.from({ length: n }, (_, i) => i);
  let rank = Array.from(s, c => c.charCodeAt(0));
  const tmp = new Int32Array(n);

  for (let k = 1; k < n; k *= 2) {
    const cmp = (a, b) => {
      if (rank[a] !== rank[b]) return rank[a] - rank[b];
      const ra = a + k < n ? rank[a + k] : -1;
      const rb = b + k < n ? rank[b + k] : -1;
      return ra - rb;
    };
    sa.sort(cmp);
    tmp[sa[0]] = 0;
    for (let i = 1; i < n; i++) {
      tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) !== 0 ? 1 : 0);
    }
    rank = Array.from(tmp);
    if (rank[sa[n - 1]] === n - 1) break;
  }
  return sa;
}

// Kasai's O(n) LCP construction given SA and the original string
function buildLCP(s, sa) {
  const n = s.length;
  const rank = new Int32Array(n);
  for (let i = 0; i < n; i++) rank[sa[i]] = i;
  const lcp = new Int32Array(n);
  let h = 0;
  for (let i = 0; i < n; i++) {
    if (rank[i] > 0) {
      const j = sa[rank[i] - 1];
      while (i + h < n && j + h < n && s[i + h] === s[j + h]) h++;
      lcp[rank[i]] = h;
      if (h > 0) h--;
    } else {
      h = 0;
    }
  }
  return lcp;
}

This prefix-doubling implementation is O(n log² n) — log n rounds, each with an O(n log n) sort. For n = 10^5 that's a few hundred million ops; comfortable in JavaScript. For n = 10^6+ on a tight time limit, port to a radix-sort-based round and you get O(n log n), or implement SA-IS for true O(n).

Python implementation — longest repeated substring

The canonical "interview problem" application: given a string, find its longest substring that occurs at least twice. With SA + LCP, the answer is one max-and-argmax pass.

def build_suffix_array(s):
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    k = 1
    while k < n:
        def key(i):
            return (rank[i], rank[i + k] if i + k < n else -1)
        sa.sort(key=key)
        tmp[sa[0]] = 0
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i - 1]] + (1 if key(sa[i - 1]) != key(sa[i]) else 0)
        rank = tmp[:]
        if rank[sa[-1]] == n - 1:
            break
        k *= 2
    return sa

def build_lcp(s, sa):
    n = len(s)
    rank = [0] * n
    for i, idx in enumerate(sa):
        rank[idx] = i
    lcp = [0] * n
    h = 0
    for i in range(n):
        if rank[i] > 0:
            j = sa[rank[i] - 1]
            while i + h < n and j + h < n and s[i + h] == s[j + h]:
                h += 1
            lcp[rank[i]] = h
            if h:
                h -= 1
        else:
            h = 0
    return lcp

def longest_repeated_substring(s):
    sa = build_suffix_array(s)
    lcp = build_lcp(s, sa)
    if not lcp or max(lcp) == 0:
        return ""
    i = lcp.index(max(lcp))
    return s[sa[i]:sa[i] + lcp[i]]

print(longest_repeated_substring("banana"))  # "ana"

For a 1-million-character text the sort-based pipeline finishes in ~3 seconds in Python, and a few hundred ms in C++. SA-IS would beat it by 5-10×, which only matters past 10^7 characters or in tight contest budgets.

Variants

Naive suffix sort (O(n² log n))

Sort the list of n actual suffix strings using the language's default comparator. One line of code, fastest to type. Acceptable for n up to a few thousand. For larger n the per-comparison cost (up to n characters) dominates.

Prefix doubling with radix sort (O(n log n))

Each round sorts pairs of integer ranks. With radix sort instead of comparison sort, each round is O(n), so the whole build is O(n log n). Standard in competitive-programming templates.

DC3 / Skew (O(n))

Kärkkäinen-Sanders (2003) split the suffix indices by i mod 3, sort the i ≡ 1, 2 (mod 3) suffixes recursively at one third the size, then merge with the i ≡ 0 (mod 3) suffixes. Total time O(n) but with a hefty constant factor.

SA-IS (O(n))

Suffix Array via Induced Sorting (Nong, Zhang, Chan 2009) classifies positions as L-type or S-type, recursively sorts only the LMS substrings (a specific subset), then induces the rest in two linear passes. Tighter constants than DC3 and the de facto choice in production. The implementation is roughly 200 lines and correctly debugging it is a rite of passage.

Compressed suffix array (CSA)

Stores the suffix array implicitly using BWT and the rank/select primitives of succinct data structures. Memory drops from 4n bytes to roughly H₀(s) · n bits — close to the empirical entropy of the string. The trade-off is slower queries (O(m log n) becomes O(m log² n)) and significantly harder implementation.

2D suffix array

For 2D pattern matching (image patches in a larger image), you can build a Burrows-Wheeler-style index over row-by-row scans and a column-by-column refinement. Useful in computational geometry and a few image-recognition niches but rarely worth implementing from scratch.

Common bugs and edge cases

  • Forgetting the sentinel. A common convention is to append a unique sentinel character (typically $ or \0) that's lexicographically smaller than every other character. Without it, the algorithm can't distinguish "ab" from "abab" when one is a prefix of the other.
  • Using chr(0) in a Python string and forgetting to test. Pythons strings handle null bytes fine; many C parsers do not. Use a sentinel that's outside the alphabet but valid in your representation.
  • Off-by-one in Kasai. The h variable can drop below 0 if you decrement unconditionally — guard with if h > 0.
  • Comparing pairs with -1 sentinel. When the suffix is shorter than k characters, the synthetic "missing" rank must be smaller than every real rank. Use -1 (or any reserved minimum), not 0.
  • Trying to use a suffix array on a streaming or growing text. Suffix arrays don't support efficient append or insert. Switch to a suffix automaton (online, O(n) total) or rebuild lazily.
  • Memory blow-up in JavaScript. Using regular arrays (boxed Int) for n = 10^6 costs ~80 MB. Switch to Int32Array and you drop to ~4 MB.
  • Naive search forgetting to verify the prefix. Binary search in the suffix array is comparing on the prefix of s[sa[mid]..]; if you compare full suffixes by accident, each comparison is O(n) and the search is O(n log n).

Frequently asked questions

What is the difference between a suffix array and a suffix tree?

Both index every suffix of a string. A suffix tree is a compressed trie that uses 20-40 bytes per character and answers substring search in O(m). A suffix array is a sorted list of suffix start positions, uses 4 bytes per character, and answers substring search in O(m log n) — or O(m) when augmented with an LCP array. In practice suffix arrays win on memory and constant factors, suffix trees win on a few specialised queries.

What is an LCP array?

The longest-common-prefix array stores, for each adjacent pair of suffixes in the sorted suffix array, the length of their shared prefix. Kasai's algorithm computes the entire LCP array in O(n). With it, you can answer "longest repeated substring" as max(LCP[]) and "longest common substring of two strings" by concatenating with a separator.

How do you build a suffix array in O(n)?

SA-IS (Suffix Array via Induced Sorting, Nong et al. 2009) and DC3 / Skew (Kärkkäinen-Sanders, 2003) both run in O(n) on integer alphabets. SA-IS has the smallest constant factor in practice and is the default in competitive bioinformatics. The naive O(n log² n) sort-based version is usually good enough for inputs under 10^7 characters.

Why are suffix arrays popular in bioinformatics?

Genomes are gigabytes of mostly-static text on a 4-letter alphabet. Suffix arrays use 4× the genome size for the index, versus 20×+ for suffix trees. The Burrows-Wheeler transform (the trick behind bzip2) can be derived from a suffix array and powers BWA and Bowtie, the dominant short-read alignment tools.

Can a suffix array find all occurrences of a pattern?

Yes — binary-search the suffix array for the leftmost suffix that starts with the pattern, then again for the rightmost. Every index in that range is an occurrence. Total time O(m log n) for setup plus O(occ) to list occurrences. With LCP-augmented binary search it drops to O(m + log n).

What is the longest repeated substring problem?

Given a string s, find the longest substring that appears at least twice. With a suffix array plus LCP array, the answer is the substring of length max(LCP[i]) starting at SA[argmax]. The whole pipeline runs in O(n) once SA-IS and Kasai's algorithm are in place.