Algorithms

Rabin-Karp String Search

Slide a hash, not the alphabet

Rabin-Karp finds a pattern inside a longer text by hashing a sliding window. A rolling hash updates in O(1) per shift, giving O(n+m) average time and making it the algorithm of choice for multi-pattern and plagiarism detection.

  • Time (average)O(n + m)
  • Time (worst)O(n · m)
  • Time (k patterns)O(n + km)
  • SpaceO(1) (single pattern)
  • OnlineYes

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Rabin-Karp works

Naive substring search slides a length-m pattern across a length-n text and compares character-by-character at every position. That's O(n · m). Rabin-Karp does something cleverer: it computes a hash of the pattern, then walks the text and hashes each m-character window. If the window's hash matches the pattern's hash, it does the full character comparison; otherwise it slides on.

The breakthrough is the rolling hash. If the window "abcde" at position i has hash H, then the window "bcdef" at position i+1 doesn't need a from-scratch recomputation. You subtract the contribution of 'a', multiply the remainder by the base, and add 'f'. One multiplication, one addition, one subtraction — independent of m.

Concretely, the polynomial hash of a window s[i..i+m-1] is:

H = (s[i]·b^(m-1) + s[i+1]·b^(m-2) + ... + s[i+m-1]·b^0) mod p

where b is a base (often 31, 257, or 1000003) and p is a large prime modulus (often 10^9+7 or a 64-bit prime). The recurrence to slide is:

H' = ((H - s[i]·b^(m-1)) · b + s[i+m]) mod p

Why it's fast in practice

Each shift costs 1 multiplication, 1 addition, 1 subtraction, and 1 modulo — call it five cycles on a modern CPU. Compare that with naive matching, which on average compares 1.06 characters per shift on random text but can blow up to m comparisons on pathological input like searching "aaaab" in "aaaaaa...aaaab".

On commodity hardware, a tight Rabin-Karp loop runs at roughly 100 million characters per second. The dominant cost is memory bandwidth, not arithmetic. For multi-pattern search of k patterns of equal length, you preload all k pattern hashes into a hash set and check membership on each window in O(1) — so total work stays O(n + km), independent of how many patterns you're hunting.

When to use Rabin-Karp

  • You need to search the same text for many patterns — plagiarism detection, dictionary screening, virus signature scanning.
  • The patterns share a fixed length, or can be bucketed by length.
  • You need an algorithm that's easy to extend to 2D matching (image patches) or k-mer counting in genomics.
  • You don't need worst-case guarantees, only fast average performance.

If you only have one pattern and want a hard worst-case bound, KMP or Boyer-Moore are usually the better picks. For modern hardware with SIMD, the simple naive loop with vectorised 16-byte compares often beats both.

Rabin-KarpKMPBoyer-Moore
Worst caseO(n · m)O(n + m)O(n · m)
Average caseO(n + m)O(n + m)O(n / m) sublinear
PreprocessingO(m)O(m) failure functionO(m + σ) bad-char table
Multi-patternNative — one pass for k patternsAho-Corasick variant neededCommentz-Walter variant needed
Best forMany patterns, same textOne pattern, online streamingOne pattern, large alphabet (English text)
2D extensionTrivial (hash 2D windows)Requires Bird-BakerPossible but messy
Real-world usegit diff, MOSS, dedupgrep -F (with extensions)GNU grep, ripgrep (with simd)

Boyer-Moore's sublinear average is a curiosity: it skips ahead by a full pattern length on a mismatch in the bad-character table. That works brilliantly on English (large alphabet, low repetition) and badly on DNA (4-letter alphabet, high repetition).

Pseudo-code

function rabinKarp(text, pattern):
    n = len(text); m = len(pattern)
    if m > n: return -1
    base = 256          # number of possible byte values
    prime = 1_000_000_007
    h_pat = h_win = 0
    high_pow = 1        # base^(m-1) mod prime

    for i in 0..m-1:
        h_pat = (h_pat * base + pattern[i]) mod prime
        h_win = (h_win * base + text[i])    mod prime
        if i < m-1:
            high_pow = (high_pow * base) mod prime

    for i in 0..n-m:
        if h_pat == h_win and text[i..i+m] == pattern:
            return i
        if i < n-m:
            h_win = ((h_win - text[i]*high_pow) * base + text[i+m]) mod prime
            if h_win < 0: h_win += prime
    return -1

JavaScript implementation — rolling hash

function rabinKarp(text, pattern) {
  const n = text.length, m = pattern.length;
  if (m === 0) return 0;
  if (m > n) return -1;

  const BASE = 256n;
  const MOD  = 1000000007n;

  let hPat = 0n, hWin = 0n, highPow = 1n;
  for (let i = 0; i < m; i++) {
    hPat = (hPat * BASE + BigInt(pattern.charCodeAt(i))) % MOD;
    hWin = (hWin * BASE + BigInt(text.charCodeAt(i)))    % MOD;
    if (i < m - 1) highPow = (highPow * BASE) % MOD;
  }

  for (let i = 0; i <= n - m; i++) {
    if (hPat === hWin && text.substr(i, m) === pattern) return i;
    if (i < n - m) {
      hWin = (hWin - BigInt(text.charCodeAt(i)) * highPow) % MOD;
      hWin = (hWin * BASE + BigInt(text.charCodeAt(i + m))) % MOD;
      if (hWin < 0n) hWin += MOD;
    }
  }
  return -1;
}

The BigInt calls look heavy but JavaScript engines fold them into native 64-bit ops as long as values stay under 2^53. If you're hashing megabytes, switch to a 32-bit base × 32-bit char product into a 64-bit accumulator with a Mersenne prime like 2^61 - 1 for branch-free modulo.

Python implementation — plagiarism detection

The classic application: scan a corpus of student submissions, hash every k-shingle (sliding window of k words), and report which submissions share suspiciously many fingerprints.

def shingle_hashes(text, k=5, base=257, mod=10**9 + 7):
    """Yield rolling hashes of every k-word shingle."""
    words = text.split()
    if len(words) < k:
        return
    high_pow = pow(base, k - 1, mod)
    h = 0
    for i in range(k):
        h = (h * base + hash(words[i]) % mod) % mod
    yield h
    for i in range(k, len(words)):
        h = (h - hash(words[i - k]) % mod * high_pow) % mod
        h = (h * base + hash(words[i]) % mod) % mod
        yield h

def jaccard_overlap(doc_a, doc_b, k=5):
    a = set(shingle_hashes(doc_a, k))
    b = set(shingle_hashes(doc_b, k))
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)

# Real moss-style detector keeps only every w-th shingle (winnowing)
# to bound storage at O(n / w) hashes per document.

Stanford's MOSS plagiarism detector uses exactly this construction — k-gram rolling hashes plus the winnowing scheme of Schleimer, Wilkerson and Aiken (2003) — to compare millions of submissions against each other in roughly linear time.

Variants and design choices

Choosing base and modulus

The base should be larger than the alphabet (256 for bytes, 27+ for lowercase English, 4 for DNA). The modulus should be prime and roughly 10^9 to 10^18. Composite moduli or moduli with small factors leak structure that adversaries can exploit.

Single vs double hashing

A single 64-bit hash collides on roughly 1 in 10^9 pairs. For competitive programming or untrusted input, compute two hashes with different (base, prime) pairs and only flag a match when both agree — collision probability drops to ~10^-18. The cost is exactly 2× the rolling step, which is still negligible.

Anti-hash test cases

Codeforces problem-setters know how to construct "anti-hash" inputs that force collisions when the modulus is known. Defenses: use a randomly chosen prime at startup, use 64-bit Mersenne primes, or run two hashes with independent randomly chosen bases. Pre-set 10^9+7 with base 31 is a classic anti-hash target — avoid it for contests.

2D Rabin-Karp

To find an a×b pattern in an h×w image, hash each row with a 1D rolling hash, then run a vertical rolling hash over the column of row-hashes. Total time O(h·w + a·b) regardless of pattern size — useful for image patch matching and 2D pattern recognition.

Multi-pattern search

Hash all k patterns of length m into a hash set. As you slide the length-m window over the text, check whether the window hash is in the set. Total time O(n + km) versus O(n·k) for running k separate searches. This is how a primitive antivirus signature scanner works.

Common bugs and edge cases

  • Negative hashes after subtraction. In languages with signed integers, (h - x) % p can be negative. Always add p and re-modulo, or use unsigned types.
  • Forgetting the verification step. Hashing alone is not matching. After every hash hit, do a character-by-character compare. Skipping this turns rare collisions into wrong answers.
  • Recomputing high_pow inside the loop. The factor base^(m-1) mod p is constant — precompute it once. A surprising number of slow Rabin-Karp submissions recompute it every iteration.
  • Using String.substring on the verify step in Java. Java's String.substring allocates; use regionMatches instead. Same trap exists in JavaScript with substr versus a manual loop.
  • Pattern longer than text. Guard against m > n before the precompute loop, or you'll index out of bounds.
  • Empty pattern. By convention indexOf("") returns 0. Decide your contract and test it.

Frequently asked questions

Why does Rabin-Karp use modular hashing instead of normal hashing?

A naive polynomial value like 31^m for m=1000 has hundreds of digits. Modular arithmetic keeps every intermediate hash inside a 64-bit integer, so every operation is constant-time on real hardware. The cost is occasional collisions, which the algorithm catches with a verification step.

What is a rolling hash?

A rolling hash is a hash function that, when the input window slides one character to the right, can be updated in O(1) by subtracting the contribution of the leaving character and adding the new one. Rabin-Karp uses a polynomial rolling hash with a prime modulus.

Is Rabin-Karp faster than KMP?

For one pattern in one text, KMP is faster and has guaranteed O(n+m) worst case. Rabin-Karp shines when you search for many patterns in the same text — you hash all patterns once, then slide a single window across the text and look up each hash in a hash set.

What is Rabin-Karp's worst-case complexity?

O(n·m) — every window can hash-collide with the pattern and require character-by-character verification. With a well-chosen prime modulus around 10^9, collisions are rare enough that the average case stays at O(n+m).

Why use double hashing in competitive programming?

Single-modulus hashes can be defeated by adversarial inputs that deliberately collide. Computing two independent hashes (different bases or different primes) and requiring both to match drops collision probability to roughly 1 in 10^18, which is safe even against contest attackers.

Can Rabin-Karp find approximate matches?

Not directly — a one-character change produces a totally different hash. For approximate search use locality-sensitive hashing (MinHash, SimHash) which is what production plagiarism detectors and document deduplicators actually use.