Probabilistic Algorithms
MinHash
Estimate set similarity in a few hundred bytes, not gigabytes
MinHash estimates Jaccard similarity between two sets in tiny constant memory. Hash every element, keep the minimum, repeat with k functions. P[match] = J(A, B). Foundation of LSH and near-duplicate detection at web scale.
- EstimatesJaccard similarity J(A, B) = |A∩B|/|A∪B|
- Signature sizek integers (typically 128–1024)
- Time per elementO(k) hashes per insert
- Estimator error±O(1/√k) with high probability
- CompositionPairs with LSH banding for sub-linear search
- InventorAndrei Broder (AltaVista, 1997)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How MinHash works
Suppose two news sites publish nearly the same wire-service article. Comparing them character by character takes time proportional to article length. Comparing every pair across a million-article archive takes time proportional to a trillion. MinHash collapses each article into a fingerprint of, say, 200 integers, and lets you compare any two articles in 200 cheap integer compares — while still recovering the true Jaccard similarity to within a percent.
The trick is this: pick a random hash function h. Apply it to every element of set A; remember only the minimum. Apply the same h to every element of set B; remember only the minimum. Ask: are the two minima equal?
Imagine the elements of A ∪ B sorted by their hash value. Exactly one element holds the overall minimum. That element is in A ∩ B iff it's in both sets. Since the hash function shuffles elements uniformly, every element is equally likely to be the minimum — so the probability that the winner sits in the intersection equals |A ∩ B| / |A ∪ B|. That ratio is the Jaccard similarity. So P[minHash(A) = minHash(B)] = J(A, B) exactly.
One hash function gives one Bernoulli trial. Repeat with k independent hash functions, count how many positions agree, divide by k — that's an unbiased estimate of Jaccard, with standard error O(1/√k).
The signature
The signature of a set A is the vector:
sig(A) = [ min_{x ∈ A} h_1(x),
min_{x ∈ A} h_2(x),
...,
min_{x ∈ A} h_k(x) ]
For k = 128 32-bit hashes, the signature is 512 bytes — regardless of whether the underlying set has 100 elements or 100 million. Two signatures are compared in O(k) time. Storage is independent of element count or element size.
To estimate J(A, B), count the matching positions and divide:
estimate = #{ i : sig(A)[i] = sig(B)[i] } / k
Turning documents into sets
MinHash measures set similarity. To apply it to text you first turn each document into a set. The standard trick is shingling: slide a window of w consecutive tokens across the document; each window is a shingle. The document's set is the set of all its shingles.
doc = "the quick brown fox jumps"
shingles(w=3) = { "the quick brown",
"quick brown fox",
"brown fox jumps" }
w = 5 words or 8 characters are common choices. Smaller w means more overlap (high similarity for similar topics); larger w means stricter equality (only near-verbatim copies match).
MinHash vs SimHash vs naive comparison
| MinHash | SimHash | Exact Jaccard | |
|---|---|---|---|
| Similarity measured | Jaccard (sets) | Cosine (weighted vectors) | Jaccard |
| Signature size | k × 32 or 64 bits | 64 bits (single fingerprint) | Full set |
| Distance metric | Hamming on signature positions | Hamming on bit vector | |A∩B|/|A∪B| direct |
| Insertion cost | O(k) hashes per element | O(d) ops per feature | O(1) per element |
| Pairwise compare | O(k) | O(64) — one xor + popcount | O(|A| + |B|) |
| Error | ±O(1/√k) | ±O(1/√bits) | Zero |
| Famous deployment | AltaVista web clustering | Google near-dup crawler | Doesn't scale to billions |
| Pairs well with | LSH banding | LSH on bit slices | Nothing — too expensive |
MinHash and SimHash both compress sets/vectors into fingerprints small enough that pairwise comparison becomes feasible. The pick depends on whether your similarity is set-based (Jaccard, MinHash) or weighted/angular (cosine, SimHash).
Walk-through example
Two sets:
A = {apple, banana, cherry, date}
B = {banana, cherry, date, elderberry}
|A ∩ B| = 3 |A ∪ B| = 5
True Jaccard = 3 / 5 = 0.60
Pick 4 hash functions and apply them (using a toy hash for clarity):
h1 h2 h3 h4
apple 12 45 23 67
banana 3 78 9 12
cherry 18 11 34 8
date 25 33 17 90
elderberry 14 7 55 21
sig(A) = [ min(12,3,18,25), min(45,78,11,33), min(23,9,34,17), min(67,12,8,90) ]
= [ 3, 11, 9, 8 ]
sig(B) = [ min(3,18,25,14), min(78,11,33,7), min(9,34,17,55), min(12,8,90,21) ]
= [ 3, 7, 9, 8 ]
Matches at positions 1, 3, 4 (positions 1-indexed).
Estimate = 3 / 4 = 0.75
The estimate (0.75) overshoots the true value (0.60) — with only 4 hash functions, variance is huge. Real systems use 128–1024 functions and the law of large numbers kicks in.
Pseudo-code
// One-pass signature builder.
minhash_signature(set S, hash_functions H[1..k]):
sig = [ +∞ for i in 1..k ]
for x in S:
for i in 1..k:
sig[i] = min(sig[i], H[i](x))
return sig
jaccard_estimate(sig_A, sig_B):
matches = 0
for i in 1..k:
if sig_A[i] == sig_B[i]: matches += 1
return matches / k
JavaScript implementation
// 32-bit MurmurHash3-style mixer; cheap, decent.
function mix32(x, seed) {
let h = (x ^ seed) >>> 0;
h = Math.imul(h ^ (h >>> 16), 0x85ebca6b);
h = Math.imul(h ^ (h >>> 13), 0xc2b2ae35);
return (h ^ (h >>> 16)) >>> 0;
}
class MinHash {
constructor(k = 128, seed = 42) {
this.k = k;
this.seeds = Array.from({ length: k }, (_, i) => mix32(seed, i + 1));
this.sig = new Uint32Array(k).fill(0xFFFFFFFF);
}
add(elementHash) {
// elementHash is a pre-hashed 32-bit integer for the element.
for (let i = 0; i < this.k; i++) {
const h = mix32(elementHash, this.seeds[i]);
if (h < this.sig[i]) this.sig[i] = h;
}
}
static jaccard(a, b) {
if (a.k !== b.k) throw new Error('signature length mismatch');
let matches = 0;
for (let i = 0; i < a.k; i++) if (a.sig[i] === b.sig[i]) matches++;
return matches / a.k;
}
}
// Usage — shingled documents
function shingles(text, w = 5) {
const tokens = text.toLowerCase().split(/\s+/).filter(Boolean);
const out = new Set();
for (let i = 0; i + w <= tokens.length; i++) out.add(tokens.slice(i, i + w).join(' '));
return out;
}
function fingerprint(text) {
const mh = new MinHash(128);
for (const sh of shingles(text)) {
// simple string hash → 32-bit int
let h = 2166136261;
for (let i = 0; i < sh.length; i++) { h ^= sh.charCodeAt(i); h = Math.imul(h, 16777619); }
mh.add(h >>> 0);
}
return mh;
}
Python implementation
import hashlib, struct
class MinHash:
def __init__(self, k=128, seed=42):
self.k = k
self.seeds = [self._mix(seed, i + 1) for i in range(k)]
self.sig = [0xFFFFFFFF] * k
@staticmethod
def _mix(x, s):
return int.from_bytes(
hashlib.blake2b(struct.pack('>QQ', x, s), digest_size=4).digest(),
'big',
)
def add(self, element):
h_elem = int.from_bytes(
hashlib.blake2b(element.encode(), digest_size=4).digest(),
'big',
)
for i in range(self.k):
h = self._mix(h_elem, self.seeds[i])
if h < self.sig[i]:
self.sig[i] = h
@staticmethod
def jaccard(a, b):
assert a.k == b.k
return sum(1 for i in range(a.k) if a.sig[i] == b.sig[i]) / a.k
def shingles(text, w=5):
toks = text.lower().split()
return { ' '.join(toks[i:i+w]) for i in range(len(toks) - w + 1) }
def fingerprint(text, k=128):
mh = MinHash(k)
for sh in shingles(text):
mh.add(sh)
return mh
When to use MinHash
- Document near-duplicate detection. Shingled news articles, blog posts, scraped pages. The web is full of near-duplicates; MinHash plus LSH finds them in sub-linear time.
- LLM training-data deduplication. Pipelines like CCNet, C4, and RefinedWeb run MinHash + LSH over web crawls before training. Removing ~30% near-duplicates measurably improves perplexity and reduces memorization.
- Recommendation overlap. Each user's interaction set (items clicked, songs liked) becomes a MinHash signature. Similar users — high Jaccard on these sets — are nearest neighbors for collaborative filtering.
- Plagiarism / copy detection. Academic and journalistic plagiarism detectors fingerprint documents this way; the signatures compare in milliseconds against millions of source documents.
- Genomic similarity (k-mer sets). Mash and sourmash use MinHash signatures to compare microbial genomes in seconds — what BLAST would take days to do exactly.
Pairing with LSH banding
MinHash gives signatures that you can compare in O(k). But comparing N documents pairwise is still O(N²·k). The standard accelerator is LSH banding:
- Choose
bbands ofrrows each, wherek = b · r. - For each band, hash the
rsignature integers in that band to a bucket. - Two documents become candidate pairs if any band hashes them to the same bucket.
Probability of becoming a candidate when true Jaccard is s: 1 − (1 − s^r)^b. That's an S-curve. Choose b, r so the inflection point lands at your target similarity threshold (e.g. 0.8). See Locality-Sensitive Hashing for the full treatment.
Common pitfalls
- Reusing the same seed across hash functions. The k functions must be independent — using the same seed gives k identical results and zero variance reduction. Use independent seeds, or parameterize a universal family like
h_i(x) = (a_i · x + b_i) mod pwith distinct random(a_i, b_i). - Cryptographic overkill. SHA-256 per element per hash function is wasteful — MurmurHash, xxhash, or affine-permutation tricks are an order of magnitude faster and equally good for similarity.
- Forgetting to normalize input. Two documents differing only in whitespace, capitalization, or HTML tags will produce wildly different shingles. Always tokenize and casefold before shingling.
- Wrong shingle width. w = 1 (just words) gives bag-of-words similarity, near 1.0 for any two English texts. w = 20 gives only exact-paragraph matches. Most papers settle on 5–9 words or 8–12 characters.
- Treating MinHash as exact. Two documents with signature Jaccard 0.7 might have true Jaccard anywhere between 0.6 and 0.8 if you only used k = 100. Always verify candidate pairs with an exact computation before flagging duplicates.
Performance analysis
Insertion cost per element: k hash evaluations, k comparisons, k potential writes. With a good mixer (~3 ns per hash on modern CPUs), k = 128 costs roughly 400 ns per element added. A 10,000-shingle document fingerprints in about 4 ms.
Storage per signature: k × 4 bytes for 32-bit hashes. k = 128 → 512 bytes per document. A 100-million-document corpus needs 51 GB of signatures — fits in RAM on a single beefy node. The same corpus exact-Jaccard would need every document's shingles in memory, easily petabyte-scale.
Pairwise comparison: O(k) integer compares — k = 128 is ~50 ns with SIMD. Without LSH banding, comparing 100 million documents pairwise costs 5 × 10¹⁵ ns ≈ 60 days on a single core. With LSH banding (b = 20, r = 6.4 effectively), the candidate set shrinks to roughly k² near-duplicate pairs, finished in minutes.
Estimator variance: with k independent hashes and true Jaccard s, the estimator's standard deviation is √(s(1-s)/k). At s = 0.5, k = 128, σ ≈ 0.044 — typical 95% confidence interval is ±0.087. Doubling k shrinks the interval by √2.
Frequently asked questions
Why does P[minHash(A) = minHash(B)] equal J(A, B)?
Imagine the elements of A ∪ B sorted by their hash values. The minimum hash across A ∪ B comes from exactly one element. It's the same minimum for both sets iff that element belongs to A ∩ B. Out of |A ∪ B| equally-likely candidates for 'has the smallest hash,' |A ∩ B| of them are in the intersection. So the probability of a match is |A ∩ B| / |A ∪ B| — that's the definition of Jaccard.
How accurate is MinHash with k hash functions?
Each signature position is a Bernoulli trial with success probability J(A, B). The estimate is the fraction of matching positions across k functions. By Hoeffding's inequality, the additive error is bounded by O(1/√k) with high probability. k = 100 gives roughly ±0.1 error at 95% confidence; k = 400 gives ±0.05. Real systems use k between 128 and 1024.
What's the difference between MinHash and SimHash?
Both produce compact fingerprints, but for different similarities. MinHash estimates Jaccard similarity over sets — great for shingled text where order doesn't matter. SimHash estimates cosine similarity over weighted feature vectors and produces a single binary fingerprint where Hamming distance approximates angular distance. Google's web crawl uses SimHash for near-duplicate detection; AltaVista and Broder's original web-clustering work used MinHash.
How do you turn documents into sets for MinHash?
Shingling. Slide a window of k consecutive words (or characters) across the text — each window is a shingle. k=5 words is a common choice. A document becomes the set of all its shingles. Two documents with similar shingle sets have similar wording at the local level; MinHash then estimates that overlap without storing every shingle.
Do you actually use k separate hash functions?
Conceptually yes, but in practice you usually parameterize one good hash family with k different seeds — h_i(x) = (a_i * h(x) + b_i) mod p for random pairs (a_i, b_i). Cheaper than picking k independent functions, and the universal-hashing analysis still goes through. Cryptographic hashes like SHA-256 are overkill; xxhash, MurmurHash, or simple linear congruential variants are standard.
How is MinHash combined with LSH for fast retrieval?
MinHash gives signatures of length k. To find candidate pairs quickly, split each signature into b bands of r rows (k = b·r). Two documents become candidates if any band hashes to the same bucket. Choosing b and r tunes the recall-precision tradeoff — a band match happens with probability 1 − (1 − s^r)^b where s is the true Jaccard. Pick b and r so that this S-curve crosses 0.5 near your target similarity.
Where is MinHash deployed in real systems?
AltaVista used Broder's MinHash to cluster the entire 1997 web crawl into near-duplicate groups. Modern uses — Common Crawl deduplication, training-data filtering for large language models (the C4 and RefinedWeb pipelines run MinHash + LSH to remove paraphrases), recommendation systems comparing user-interest sets, Cassandra's secondary-index dedup, and academic-paper plagiarism detectors. Whenever the entities are sets, MinHash applies.