Probabilistic Algorithms
Locality-Sensitive Hashing
Hashes that collide on similar inputs — sub-linear nearest-neighbor search
LSH is a family of hashes where similar inputs collide with high probability, dissimilar with low. Banding on MinHash, hyperplanes for cosine, p-stable for Euclidean. Powers near-duplicate detection, recommendations, and ANN search at scale.
- Property(r1, r2, p1, p2)-sensitive: similar collide more
- Query timeO(N^ρ) sub-linear in N, ρ = log p1 / log p2
- Hash familiesMinHash · random hyperplane · p-stable
- For JaccardMinHash + banding (b·r = k signature length)
- For cosineRandom hyperplanes (sign of v·x)
- OriginatorsIndyk & Motwani 1998
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How LSH works
A regular hash function tries to spread inputs uniformly — even small input changes produce wildly different outputs (that's the avalanche property). LSH does the opposite: similar inputs are designed to collide. If you can build such a hash, you can use it as an index for similarity search: lookup the bucket and the items inside are your candidate near-neighbors.
Formally, a hash family H is (r1, r2, p1, p2)-sensitive for distance d when:
- If
d(x, y) ≤ r1thenP[h(x) = h(y)] ≥ p1. - If
d(x, y) ≥ r2thenP[h(x) = h(y)] ≤ p2. r1 < r2andp1 > p2.
The trick is then to amplify the gap between p1 and p2 by concatenating and OR-ing hashes (AND-construction within a band, OR-construction across bands). Indyk and Motwani proved this gives sub-linear O(N^ρ) query time with ρ = log p1 / log p2 < 1.
Banding on MinHash signatures
The most-used LSH scheme starts with MinHash signatures and applies banding:
- Compute a MinHash signature of length
kfor each set. - Split the signature into
bbands ofrrows each (k = b · r). - For each band, hash the
rintegers in that band to a bucket. Each band uses its own hash table. - Two items are candidate pairs if any band hashes them to the same bucket.
The collision probability for one band, given true similarity s, is s^r (all r positions must match — AND construction). The probability of being a candidate is 1 − (1 − s^r)^b across all b bands (OR across bands). Plot this against s and you get a sharp S-curve. The inflection point sits near s ≈ (1/b)^(1/r).
Tune your S-curve:
s = target similarity threshold (e.g. 0.85)
k = signature length (e.g. 100)
Pick b, r with b·r = k so (1/b)^(1/r) ≈ s.
Example — s = 0.85:
b = 20, r = 5 → inflection at (1/20)^(1/5) ≈ 0.55, recall at 0.85 ≈ 99.5%
b = 25, r = 4 → inflection at (1/25)^(1/4) ≈ 0.45, more candidates, more recall
b = 10, r = 10 → inflection at (1/10)^(1/10) ≈ 0.79, fewer candidates, less recall
Random hyperplanes for cosine
For vectors compared by cosine similarity, the natural LSH is the random hyperplane hash. Pick a random unit vector v; the hash of x is the sign of v · x (one bit). The probability that two vectors agree on the sign is:
P[ sign(v·x) = sign(v·y) ] = 1 − θ(x, y) / π
Where θ is the angle between them. Smaller angle → higher probability of match. Concatenate b independent random hyperplanes to get a b-bit fingerprint; build L independent hash tables; candidates are anything that matches in any table. This is essentially how SimHash works under the hood.
p-stable LSH for Euclidean
For Lp norms (with p ∈ (0, 2]), Datar et al. introduced p-stable LSH. Draw a random vector a from a p-stable distribution (Gaussian for L2, Cauchy for L1). The hash of x is:
h(x) = ⌊ (a · x + b) / w ⌋
Where b ~ Uniform[0, w] shifts buckets and w sets bucket width. Closer points project to nearby values and land in the same bucket. Larger w → more recall but bigger candidate sets.
LSH variants compared
| MinHash + banding | Hyperplane | p-stable | SimHash | |
|---|---|---|---|---|
| Distance/similarity | Jaccard (sets) | Cosine (angles) | L1, L2 (vectors) | Hamming-on-cosine |
| Input type | Sets / shingles | Real vectors | Real vectors | Weighted features |
| Hash output | k integers, grouped into bands | b bits | integer per dimension | 64 bits (one fingerprint) |
| Tune-able knob | b, r in banding | b bits, L tables | bucket width w | bit length |
| Famous user | AltaVista web clustering, LLM dedup | Music similarity (Spotify) | FAISS HKM, image retrieval | Google near-dup crawl |
| Theoretical ρ | ~log(1/s) / log(b) | ~log p1 / log p2 | Depends on c, w | Hyperplane analysis |
Why LSH beats tree methods in high dimensions
K-d trees, ball trees, and cover trees give exact nearest-neighbor in low dimensions. They degrade catastrophically above ~20 dimensions — the curse of dimensionality means every query touches nearly the whole tree. LSH sidesteps this by being randomized and approximate: it accepts that you might miss a few neighbors in exchange for query times that don't blow up with dimension.
For dimension d and N points, k-d trees query in Θ(N^(1−1/d)), which approaches Θ(N) for large d. LSH queries in O(N^ρ · d) where ρ < 1 depends on the gap between target and threshold similarity. For typical settings (similarity 0.8 vs 0.5), ρ sits around 0.4 — a query touches roughly N^0.4 items, e.g. ~1000 candidates out of a billion.
Pseudo-code — banded MinHash LSH
build(documents, b, r):
signatures = [ minhash_signature(d) for d in documents ] // each length k = b·r
buckets = [ HashMap() for i in 1..b ]
for doc_id, sig in enumerate(signatures):
for band in 1..b:
band_slice = sig[(band-1)*r : band*r]
key = hash(band_slice)
buckets[band].setdefault(key, []).append(doc_id)
return buckets
query(q, b, r, buckets):
sig = minhash_signature(q)
candidates = set()
for band in 1..b:
band_slice = sig[(band-1)*r : band*r]
key = hash(band_slice)
if key in buckets[band]:
candidates.update(buckets[band][key])
// Verify each candidate with exact Jaccard or full MinHash compare
return [ c for c in candidates if exact_jaccard(q, c) ≥ threshold ]
Python implementation
from collections import defaultdict
import hashlib
class BandedLSH:
def __init__(self, b, r):
self.b, self.r = b, r
self.k = b * r
self.tables = [defaultdict(list) for _ in range(b)]
def _band_key(self, band_slice):
return hashlib.blake2b(
b'|'.join(str(x).encode() for x in band_slice),
digest_size=8,
).digest()
def insert(self, doc_id, signature):
assert len(signature) == self.k
for i in range(self.b):
band = signature[i*self.r:(i+1)*self.r]
self.tables[i][self._band_key(band)].append(doc_id)
def query(self, signature):
cands = set()
for i in range(self.b):
band = signature[i*self.r:(i+1)*self.r]
cands.update(self.tables[i].get(self._band_key(band), []))
return cands
# Tuning helper: pick (b, r) so the S-curve crosses 0.5 at threshold s
def tune(k, s):
best = None
for r in range(1, k + 1):
if k % r: continue
b = k // r
# inflection where 1 − (1−s^r)^b ≈ 0.5
threshold = (1 / b) ** (1 / r)
gap = abs(threshold - s)
if best is None or gap < best[2]:
best = (b, r, gap, threshold)
return best[:2], best[3]
JavaScript implementation
function hashBand(band) {
// FNV-1a 32-bit for the band tuple
let h = 2166136261;
for (const v of band) {
// mix the integer in 4 bytes
for (let s = 24; s >= 0; s -= 8) {
h ^= (v >>> s) & 0xff;
h = Math.imul(h, 16777619);
}
}
return h >>> 0;
}
class BandedLSH {
constructor(b, r) {
this.b = b; this.r = r; this.k = b * r;
this.tables = Array.from({ length: b }, () => new Map());
}
insert(id, sig) {
for (let i = 0; i < this.b; i++) {
const band = sig.slice(i * this.r, (i + 1) * this.r);
const key = hashBand(band);
let list = this.tables[i].get(key);
if (!list) { list = []; this.tables[i].set(key, list); }
list.push(id);
}
}
query(sig) {
const cands = new Set();
for (let i = 0; i < this.b; i++) {
const band = sig.slice(i * this.r, (i + 1) * this.r);
const list = this.tables[i].get(hashBand(band));
if (list) for (const id of list) cands.add(id);
}
return cands;
}
}
When to use LSH
- Large-scale near-duplicate detection. Web crawls, training-corpus deduplication, plagiarism detection — anywhere N is in the millions to billions and pairwise comparison is impossible.
- Streaming similarity search. Append-only datasets where you can't rebuild an index every day. LSH inserts are O(b) per item; queries are O(b + candidate_size).
- Approximate nearest neighbor in high dimensions. Word embeddings (300d), image features (2048d), user-behavior vectors. LSH gives you a candidate set in
O(N^ρ)time; you verify with exact distance. - Privacy-preserving similarity. Hashes don't reveal originals. LSH lets you exchange signatures without exposing raw data — used in private record linkage, federated entity resolution.
- Out-of-core scenarios. When data won't fit in RAM. The hash tables can shard onto disk or across machines; bucket lookups are local.
Common pitfalls
- Wrong choice of (b, r). The most common bug. Plot the S-curve
1 − (1 − s^r)^bagainstsbefore deploying. Make sure the inflection lands at your threshold, not above or below. - Hashing the band tuple incorrectly. If your band hash collides for different tuples, you'll get spurious candidates. Use a strong 64-bit hash (Blake2b, xxhash64) on the concatenated bytes.
- Forgetting to verify candidates. LSH gives candidate pairs — most have similarity well below threshold. Always run an exact computation (full Jaccard, full cosine) on each candidate before reporting.
- Using cosine LSH on raw counts. Cosine on un-normalized count vectors weights long documents disproportionately. Normalize first (TF-IDF, L2 norm).
- Confusing distance and similarity directions. LSH families are sometimes stated for distances (closer = more likely to collide), sometimes for similarities. Mix them up and your S-curve points the wrong way.
- Skipping the dedup step. The same pair can appear as a candidate in multiple bands. Dedup the candidate set before verification, or you'll do duplicate work.
Performance analysis
Insertion: O(b) bucket inserts per item. With b = 20, that's 20 hash-map writes — sub-microsecond. Building an index of 100 million documents takes minutes once signatures exist.
Memory: O(N · b) entries across all tables. For N = 100 million and b = 20, that's 2 billion entries, roughly 32 GB with 16-byte keys plus integer ids — fits comfortably in RAM. Adding more bands grows linearly.
Query: O(b) bucket lookups, returning a candidate set. Expected candidate-set size depends on the data distribution and S-curve choice. For typical near-duplicate detection (target similarity 0.85), each query yields tens to hundreds of candidates from a 100M-document corpus; verification (exact Jaccard) costs O(|cands| · k).
End-to-end speedup: pairwise exact-Jaccard over N items is O(N²). LSH-banded MinHash brings that down to O(N · b + total_candidates · k). On the C4 deduplication pipeline (~400 million docs), LSH found near-dups in ~12 hours on a single 64-core machine — the brute-force equivalent would have taken years.
Modern context — HNSW, FAISS, and the rise of vector DBs
LSH is no longer the state of the art for in-memory ANN benchmarks. HNSW (Hierarchical Navigable Small World) graphs and IVF-PQ (FAISS) generally win recall-vs-latency comparisons in the milliseconds-per-query regime. LSH remains essential in three places:
- Set-similarity dedup. MinHash + banding is unbeatable for Jaccard. No HNSW variant matches.
- Streaming/append-only. HNSW's incremental insertion is messy; LSH's is trivial.
- Theoretical guarantees. LSH has provable sub-linear bounds; HNSW is empirically excellent but theoretically opaque.
Frequently asked questions
What does 'locality-sensitive' actually mean?
A hash family H is (r1, r2, p1, p2)-sensitive for a distance d if d(x, y) ≤ r1 implies P[h(x) = h(y)] ≥ p1, and d(x, y) ≥ r2 implies P[h(x) = h(y)] ≤ p2, with p1 > p2 and r1 < r2. In plain terms — similar points collide often, dissimilar points rarely. Different distances (Jaccard, cosine, Euclidean) have different hash families that exhibit this property.
How does LSH banding work on MinHash signatures?
Split a k-position MinHash signature into b bands of r rows (k = b·r). Hash each band independently to a bucket. Two documents are candidate pairs if any band hashes to the same bucket. Probability of being a candidate, given true Jaccard s: 1 − (1 − s^r)^b. The S-curve has an inflection near s ≈ (1/b)^(1/r) — tune b and r to place it at your similarity threshold.
How do you LSH on cosine similarity?
Random hyperplanes. Pick a random unit vector v. The hash of point x is the sign of (v · x) — one bit. The probability that two points hash the same is 1 − θ/π, where θ is the angle between them. Cosine-similar pairs (small θ) collide often. Concatenate b independent random hyperplanes to get b-bit hashes; repeat with L tables and OR the results.
What is p-stable LSH?
For Euclidean and Lp distances, project the point onto a random vector drawn from a p-stable distribution (Gaussian for L2, Cauchy for L1), then quantize the projection into buckets of width w. Closer points project to nearby values and end up in the same bucket. Datar-Indyk-Immorlica-Mirrokni 2004. The width w trades off recall against bucket size.
How is LSH different from k-d trees or ball trees?
Tree-based methods build a deterministic partition of space; they're exact but suffer the curse of dimensionality — for dimensions above 20, nearly every query visits the whole tree. LSH is randomized and approximate, but query time stays sub-linear up to dimensions in the thousands. The trade-off — you get back candidate near-neighbors, then verify exactly. Modern alternatives include HNSW graphs and IVF-PQ; LSH is still the simplest method with provable guarantees.
How many LSH tables do you actually need?
Recall scales as 1 − (1 − p^k)^L, where k is hash length per table and L is number of tables, and p is collision probability at the target similarity. For typical web-scale near-duplicate detection (target similarity 0.85), b = 20 bands of r = 5 rows on a k = 100 MinHash gives roughly 99% recall and only a few candidate pairs per document. Memory is O(N·L); pick L to balance recall against RAM.
What replaces LSH today?
For high-dimensional nearest-neighbor search, HNSW (Hierarchical Navigable Small World) and FAISS's IVF-PQ generally beat LSH on recall-vs-latency benchmarks. But LSH remains the only family with strong sub-linear theoretical guarantees, and it stays dominant for streaming and out-of-core scenarios (massive datasets, GPU-unfriendly workloads, append-only indexes). MinHash + banding LSH is still the standard for set-similarity deduplication.