String Algorithms

Boyer-Moore String Search

Skip ahead by entire pattern lengths — O(n/m) best case, the secret behind grep

Boyer-Moore is a string-search algorithm that compares the pattern to the text from right to left and uses two heuristics — bad-character and good-suffix — to skip over portions of the text that cannot match. Best case O(n/m), where n is text length and m is pattern length, making it sublinear. Designed by Robert Boyer and J Strother Moore in 1977 and used in grep, agrep, and most text editors. The key insight: if the last character of the pattern doesn't match a character that doesn't appear in the pattern at all, you can skip the entire pattern length.

  • Best caseO(n/m) sublinear
  • Worst caseO(nm)
  • HeuristicsBad-character, good-suffix
  • InventedBoyer & Moore 1977
  • Pre-processingO(m + σ) where σ = alphabet size
  • Used ingrep, agrep, most editors

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why Boyer-Moore matters

  • Sublinear in practice. On English text with a 256-byte alphabet, Boyer-Moore inspects about n/m text characters when the pattern is 8 or longer. For a 1 GiB log and a 16-byte search term, that is roughly 64 MiB of memory traffic versus 1 GiB for a left-to-right scanner — a 16x reduction.
  • Default in grep. GNU grep uses a Galil-augmented Boyer-Moore for plain (non-regex, non-fixed-string) patterns. fgrep falls back to Aho-Corasick when there are multiple needles, but a single literal needle goes through the same Boyer-Moore kernel that has been hardened since 1990.
  • Galil's variant achieves O(n) worst case. Plain Boyer-Moore is O(nm) on adversarial periodic inputs. Zvi Galil's 1979 modification records partial matches across alignments, eliminating redundant comparisons and proving a true O(n) worst-case bound while preserving the O(n/m) best case.
  • Good with long patterns. Pre-processing is O(m + σ), so doubling the pattern length doubles only the small one-time cost. The search loop benefits, because the maximum legal shift grows linearly with m.
  • Cache-friendly memory access. Skipping forward by m positions per mismatch means contiguous strides, often crossing cache lines and pages in fewer steps than left-to-right scanning. Branch predictors also benefit from the regular shift pattern.
  • Trivial to vectorize. Modern implementations replace the inner mismatch loop with SSE/AVX comparisons against the last pattern character, then verify candidates. memmem in glibc and strstr in many libcs are essentially Two-Way (Crochemore-Perrin) plus a Boyer-Moore-style sentinel.
  • Foundation for other algorithms. Sunday's Quick Search, Horspool's variant, Turbo Boyer-Moore, and the Apostolico-Giancarlo algorithm all start from the right-to-left mismatch idea and trade preprocessing complexity for tighter shift bounds.

Common misconceptions

  • Always faster than KMP. Not on small alphabets. With a 4-letter DNA alphabet (sigma = 4) the bad-character shift is rarely close to m because the mismatched character almost always occurs in the pattern; KMP's strict O(n + m) wins for short patterns. Boyer-Moore needs roughly sigma ≥ m to dominate.
  • Linear time guaranteed. Only with Galil's rule. The textbook 1977 Boyer-Moore is O(nm). Pattern aaab in text aaaaa…aaaa is the canonical adversarial input that defeats unmodified Boyer-Moore.
  • Bad-character alone is enough. Bad-character can suggest a negative shift, which the algorithm clamps to 1; in periodic patterns this collapses to repeated 1-character shifts. Good-suffix is what guarantees nonzero progress on those inputs.
  • Right-to-left means we read text backwards. The text is read in forward order across alignments, but within each alignment the m comparisons proceed from the rightmost pattern character leftward. The pattern always advances forward across the text.
  • One table fits all alphabets. The bad-character table indexes by character code. For Unicode you cannot allocate a 1.1 million entry table; production engines hash the last byte or use a compact map. ripgrep works on UTF-8 byte streams precisely so it can use a 256-entry bad-character table.
  • Boyer-Moore is dead. SIMD-based engines (Hyperscan, glibc's memmem) descend from Crochemore's Two-Way algorithm, but they still embed the right-to-left mismatch idea and a bad-character-style shift. The algorithm is over 45 years old and still ships in every major libc.

How the two heuristics combine

The skeleton of Boyer-Moore is short. Align the pattern P[0..m-1] under the text T at offset i = 0. Compare P[m-1] to T[i + m - 1], P[m-2] to T[i + m - 2], and so on, walking left. If you reach j = 0 with everything equal, report a match at i. If P[j] != T[i + j], compute two shifts: bad_shift(T[i + j], j) and good_shift(j). The bad shift moves the rightmost occurrence of the mismatch character in P[0..j-1] under T[i + j]; if that character is absent, shift by j + 1. The good shift uses the precomputed good-suffix table to slide so the matched suffix re-aligns with another copy or with a pattern prefix. Set i to i plus the larger of the two shifts and repeat until i > n - m.

The bad-character preprocessing is one pass over the pattern: bad[c] = m - 1 - max_index(c) for every character c that appears in P, otherwise bad[c] = m. The good-suffix tables are computed by a Z-array-style sweep in O(m). Total preprocessing is O(m + sigma), and the search loop performs at most 3n character comparisons in the unmodified algorithm, exactly n with Galil's rule, and roughly n/m on average for English text.

Practical implementation notes

  • Skip the good-suffix table for short patterns. If m < 8, the good-suffix tables are larger than the saved comparisons. Most production code falls back to Horspool below this threshold.
  • Use memchr as a sentinel. Before running the full Boyer-Moore inner loop, scan for the last pattern byte with a SIMD-accelerated memchr. This is a 30 to 60% wall-clock speedup on long texts where most alignments will mismatch on the first comparison anyway.
  • Hot path is the bad-character lookup. Inline it. Profile-guided optimization in clang produces 1.5x speedups versus naive lookups by hoisting the table and unrolling the inner mismatch walk.
  • Streaming text. The algorithm is not online — it needs to look back j characters within the current alignment. To search a stream, buffer the last m characters; flush them when committing a match-free region.
  • Multiple needles. Use Aho-Corasick for k > 4 patterns. Boyer-Moore generalizations (Commentz-Walter, set Boyer-Moore) exist but are dominated by AC at moderate k.

Frequently asked questions

How does Boyer-Moore differ from KMP?

KMP scans the text strictly left to right and never re-examines a text character, guaranteeing O(n + m) in the worst case. Boyer-Moore scans the pattern right to left after aligning it to the text and, on a mismatch, slides the pattern forward by the larger of two heuristic shifts. The right-to-left scan is what enables skipping multiple text characters per mismatch — KMP's left-to-right scan can only ever advance by one position per character it reads. In practice on natural-language text with a 256-symbol alphabet, Boyer-Moore inspects roughly n/m characters, while KMP inspects all n. The trade is that Boyer-Moore's basic form is O(nm) worst-case until you bolt on Galil's rule.

What is the bad-character rule?

When the right-to-left scan finds a mismatch at text character c, the pattern is shifted so that the rightmost occurrence of c in the pattern lines up with that text position. If c does not appear in the pattern at all, the entire pattern (m characters) is shifted past it. The table is precomputed in O(m + sigma) time where sigma is the alphabet size, indexed as bad[c] = m - 1 - last_occurrence(c) or m if c is absent. This single rule alone gives the Boyer-Moore-Horspool variant its sublinear behavior on most real text.

What is the good-suffix rule?

When a suffix of the pattern of length k has matched and then a mismatch occurs, the good-suffix rule shifts the pattern so that another occurrence of that matched suffix (or its prefix that is also a pattern suffix) aligns with the text. Two cases are precomputed in O(m). Case 1: another copy of the matched suffix exists earlier in the pattern — shift so that copy aligns. Case 2: only a prefix of the suffix is itself a pattern prefix — shift so the prefix aligns. The good-suffix table mirrors the failure function in KMP and contributes the second of Boyer-Moore's two shift candidates; the algorithm always takes the maximum of bad-character and good-suffix shifts.

When does worst-case O(nm) actually happen?

When the pattern is highly periodic and the text is constructed adversarially. Pattern aaab in text aaaa…aaaa: every alignment matches three characters (aaa) before mismatching at b, and the bad-character shift is only 1, so the work per alignment is m and there are n - m + 1 alignments — O(nm). Galil's modification fixes this by recording, after a partial match of length k, that the next k characters of text agree with the prefix of the pattern, so the next alignment can skip those comparisons. With Galil's rule Boyer-Moore is O(n) worst case while preserving the O(n/m) best case. GNU grep ships the Galil-augmented version.

Why is Boyer-Moore-Horspool a simplification?

Horspool's 1980 simplification keeps only the bad-character heuristic, but uses the rightmost text character of the current alignment (not the mismatch character) to compute the shift. Implementation collapses to a single 256-entry table and a tight inner loop — about 20 lines in C. Empirically on English text Horspool runs within 5 to 15% of the speed of full Boyer-Moore for patterns up to 32 characters, while being substantially easier to write correctly. It is the version many textbooks teach and many production search libraries (e.g. Python's str.find for short needles in CPython) use as the inner kernel.

Why is right-to-left scanning crucial?

Right-to-left lets the very first comparison disqualify a long stretch of text. If you align the pattern so its last character lands on text position i and the text has a character c there that does not occur anywhere in the pattern, you can move the entire pattern past i — m positions in a single step — without ever inspecting positions i - 1 through i - m + 1. Left-to-right scanning has no analogous shortcut, because the first mismatch happens at the leftmost character and gives no information about what's further right. The right-to-left choice is what turns string search from O(n) to O(n/m) on average.