String Algorithms
Z-Algorithm
Compute Z[i] = longest substring at i matching a prefix — O(n+m) pattern search
The Z-algorithm computes the Z-array Z[i] of a string s — defined as the length of the longest substring starting at position i that matches a prefix of s. It runs in O(n) time using a sliding [l, r] window of the rightmost-known prefix match. For pattern search, concatenate pattern + '$' + text, compute Z, and locate every position where Z equals pattern length. Discovered by Gusfield in 1997 (book Algorithms on Strings, Trees, and Sequences) but the technique is folklore. Cleaner to teach than KMP.
- TimeO(n)
- MemoryO(n)
- Z[i] definitionLongest prefix match
- Pattern searchO(n + m)
- WindowRightmost match interval [l, r]
- NotableSimpler than KMP for many uses
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why Z-algorithm matters
- Cleaner intuition than KMP. The Z-array is symmetric: Z[i] simply asks "how far does s, starting at i, agree with s starting at 0." KMP's failure function answers a less directly stated question ("longest proper prefix-suffix"); proving correctness of KMP's preprocessing trips up most students. The [l, r] window argument for Z is one short paragraph.
- Competitive programming staple. The Z-algorithm fits in 15 lines, has no off-by-one quirks, and solves period-finding, longest-prefix-also-suffix, and pattern occurrences with one pass. Codeforces, AtCoder and ICPC editorials default to Z-array because it is harder to write incorrectly than KMP.
- Subtask in suffix-tree construction. Gusfield's textbook builds a suffix tree in O(n) by computing the Z-array of a related string and using it to derive suffix links. Without Z, the linear-time construction (Ukkonen, McCreight) requires considerably more bookkeeping.
- String period detection. The smallest period p of a string s of length n is the smallest p such that Z[p] = n - p. This is the basis for fast periodicity tests used in compression and signal processing.
- Detecting tandem repeats. Bioinformatics tools find sequences such as ACTACTACT by scanning for indices i with high Z[i]. Run-length-encoded blocks in compression algorithms similarly leverage Z to find self-similar regions.
- String hashing alternative. Some problems that competitive programmers solve with rolling hashes (collision-prone) can be solved with Z-arrays deterministically, eliminating false positives.
- Two-way matching. Compute Z on s and on the reverse of s to get prefix and suffix information about every position — useful in palindrome detection and approximate matching.
Common misconceptions
- Slower than KMP. Same asymptotic complexity O(n + m). The constant factors are nearly identical because both algorithms make at most 2n character comparisons across the whole input. Benchmarks on English-language text show typical differences under 5%.
- Only academic. Used in Gusfield-style suffix construction, in many competitive-programming libraries, and as a pedagogical baseline before introducing suffix trees and arrays. While production string-search code more often uses Boyer-Moore or memchr-vectorized matchers, Z-array is the standard reference for textbook linear-time matching.
- Z[0] is the length of the string. Z[0] is conventionally undefined (or set to 0); some references set it to n, which causes off-by-one errors when used with the pattern-search trick. Stick to one convention and document it.
- You always need a separator character. The pattern + $ + text trick requires a separator that occurs in neither string. For binary alphabets choose a third symbol (e.g., 2). For arbitrary input, hash to integers and use n + 1 as the separator. Or use a slight variant that avoids the separator by constraining Z[i] ≤ min(length(P), n - i).
- The window [l, r] is two-sided. [l, r] is monotonically forward — l only ever increases when r jumps. There is no left-shrink. Implementations that fail to maintain this invariant accidentally report O(n²) behavior.
- Z-array can find approximate matches. Pure Z handles only exact matches. Approximate matching needs additional machinery (bitmap-based shift-or, Levenshtein automaton, or banded DP).
Walkthrough of the [l, r] update
Initialize Z as an array of zeros and set l = r = 0. For i from 1 to n - 1: if i > r, restart with k = 0 and a fresh comparison loop while i + k < n and s[k] = s[i + k], incrementing k; assign Z[i] = k; if k > 0 update l = i, r = i + k - 1. If i ≤ r, mirror by computing k = i - l and looking at Z[k]. Two sub-cases. If Z[k] < r - i + 1, copy: Z[i] = Z[k] and r remains valid. If Z[k] ≥ r - i + 1, extend: start with k = r - i + 1 and continue character comparisons forward of r, updating r and l afterwards.
The amortized analysis. Every successful character comparison either stays inside [l, r] (free, by the mirror) or extends r forward by exactly one position. Since r is monotonically nondecreasing and bounded by n - 1, the total number of "extending" comparisons is O(n). Each "mirror" operation is O(1). Therefore total work is O(n).
Worked example
Let s = "aabcaabxaab". Indices 0 through 10. Z[1]: s[0] = 'a' equals s[1] = 'a', s[1] = 'a' equals s[2] = 'b'? No, mismatch — Z[1] = 1. Update l = 1, r = 1. Z[2]: i > r so fresh compare: s[0] = 'a' vs s[2] = 'b' — Z[2] = 0. Z[3]: fresh — Z[3] = 0. Z[4]: fresh — s[0] = 'a' vs s[4] = 'a', s[1] = 'a' vs s[5] = 'a', s[2] = 'b' vs s[6] = 'b', s[3] = 'c' vs s[7] = 'x' — Z[4] = 3, l = 4, r = 6. Z[5]: i ≤ r so mirror — k = 1, Z[1] = 1, 1 < r - i + 1 = 2, so Z[5] = 1. Z[6]: mirror — k = 2, Z[2] = 0, copy Z[6] = 0. Z[7]: i > r, fresh — Z[7] = 0. Z[8]: fresh — Z[8] = 3, update l = 8, r = 10. The complete Z-array is [0, 1, 0, 0, 3, 1, 0, 0, 3, 1, 0].
Frequently asked questions
How is the Z-array computed in O(n)?
Maintain a window [l, r] representing the rightmost prefix-match interval encountered so far. For each new index i: if i is greater than r, fall back to a fresh character-by-character comparison from index 0 of the string; otherwise i lies inside [l, r] and we can mirror Z[i - l] from earlier in the array. If Z[i - l] is less than r - i + 1, copy it; otherwise extend the match beyond r to find the true value. Each character of s is examined at most twice across all iterations — once during the mirroring copy and once during a window-extending comparison — proving O(n) total work.
Why is it equivalent to KMP for pattern search?
Z-search runs on the concatenation P$T where $ is a separator that does not occur in either string. Compute the Z-array of P$T. Every index i in T-region with Z[i] equal to length(P) corresponds to an occurrence of P at offset i - length(P) - 1 in T. The total time is O(length(P) + 1 + length(T)) = O(n + m), exactly matching KMP. The two are different algorithmic descriptions of the same complexity class but Z-array computation is symmetric in pattern and text whereas KMP's failure function is asymmetric.
When is Z-algorithm easier than KMP?
When you also need information about the original string itself — not just where a pattern occurs but, e.g., where every substring of length k that equals a prefix occurs. The Z-array directly answers many such questions. Pedagogically the [l, r] window is more concrete than KMP's failure function, which often confuses students; the cleaner invariant 's[i..i + Z[i]) = s[0..Z[i])' is easier to prove correct. In competitive programming the Z-algorithm is commonly the first tool reached for because its 15-line implementation is harder to get wrong than KMP's.
How does the [l, r] window work?
[l, r] is the leftmost-l, rightmost-r interval such that s[l..r] equals s[0..r - l] — that is, the rightmost prefix-match interval discovered so far. Whenever processing index i computes a Z[i] whose match extends beyond the current r, update l to i and r to i + Z[i] - 1. The next index can then mirror inside [l, r] for free. The amortized argument: each step either extends r or operates inside [l, r] using the mirror, so the right boundary is monotonically nondecreasing — every text position is included in [l, r] at most once.
Can it find all occurrences with overlaps?
Yes — overlapping matches are reported naturally. After computing Z(P$T), every index i with Z[i] equal to length(P) is a valid match offset, including overlapping ones (e.g., pattern 'aba' in text 'ababa' produces matches at positions 0 and 2, both reported). Z-array does not collapse overlapping matches the way some greedy implementations do; it is up to the caller to filter to non-overlapping if that is desired.
What problems beyond pattern matching does it solve?
Period-finding: smallest period of s is the smallest p with Z[p] = n - p. Longest substring that is both prefix and suffix: max Z[i] over indices i with i + Z[i] = n. Counting distinct substrings via suffix array construction: Z-array is a building block in Gusfield's linear-time suffix-tree construction. String tandem-repeat finding. Run-length compression heuristics. Fast computation of periodicity arrays in bioinformatics. Many string DP problems reduce to one Z-array pass.