Data Structures
Bloom Filter
Probably yes, definitely no — set membership in 10 bits per item
A Bloom filter is a compact probabilistic data structure that tells you whether an element is "probably in the set" or "definitely not in the set" — never gives a false negative, but allows tunable false positives. Uses ~10 bits per item for a 1% false-positive rate, 100× more space-efficient than storing the items directly. Used by browsers for safe-browsing checks, databases to skip disk reads, and CDN caches to avoid origin requests.
- LookupO(k) — k hash functions
- InsertO(k)
- Space~10 bits per item for 1% FPR; ~14 bits for 0.1% FPR
- False positivesTunable — fewer = more space
- False negativesNever (definitive "not in set")
- Cannot doDelete (without counting variants), enumerate, count
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a Bloom filter works
Three pieces:
- A bit array of size m, initially all zeros.
- k independent hash functions, each producing an index into the bit array.
- An empty initial set with no elements.
Insert(x): compute hash_1(x), hash_2(x), ..., hash_k(x) modulo m. Set those k bits to 1.
Query(x): compute the same k indices. If any is 0, x is definitely not in the set. If all are 1, x is probably in the set (might be a false positive).
The asymmetry is the key insight. False positives happen because some other elements happened to set the same bits. False negatives can't happen because bits never go from 1 back to 0.
The false positive math
After inserting n items into an m-bit array with k hash functions, the probability that a specific bit is still 0 is roughly (1 − 1/m)^(kn) ≈ e^(−kn/m).
The probability of a false positive (all k checked bits happen to be 1 for an element that wasn't inserted) is approximately:
p = (1 − e^(−kn/m))^k
To minimize p for given m and n, the optimal k is (m/n) × ln(2) ≈ 0.693 × (m/n). At that optimum, p ≈ (1/2)^k.
| Bits per item (m/n) | Optimal k | False positive rate |
|---|---|---|
| 4 | 3 | ~14% |
| 8 | 6 | ~2% |
| 10 | 7 | ~1% — sweet spot |
| 14 | 10 | ~0.1% |
| 20 | 14 | ~0.0005% |
| 32 | 22 | ~10^-9 |
The standard "10 bits per item, ~1% FPR" balance is what most production deployments target. Cutting FPR by 10× costs ~4 more bits per item; cutting it by 100× costs ~10 more bits.
Bloom filter vs hash table for membership
| Bloom filter | Hash set | |
|---|---|---|
| Space (1M strings) | ~1.2 MB at 1% FPR | ~80 MB |
| Insert | O(k) — set k bits | O(L) where L is key length |
| Query | O(k) — read k bits | O(L) |
| False positives | Tunable rate | None |
| Delete | Not possible (without counting variant) | Trivial |
| Enumerate elements | No — element data not stored | Yes |
| Best for | "Probably in or definitely not" | Exact membership + lookups |
The 10-100× space saving is what matters. For a billion-element set, the difference is 10 GB vs 100 GB. The trade-off — tunable false positives — is acceptable when downstream code can handle them.
When to use a Bloom filter
- Cascade in front of slow lookups. The textbook pattern. Database key not found in Bloom filter? Skip the disk read. Browser URL not in malware list? Skip the network call. CDN cache miss prediction? Probably not in cache.
- Distributed system membership. Each node maintains a Bloom filter of its keys. Other nodes ask "do you have X?" by querying the filter — no full state synchronization needed. Cassandra, RocksDB, Apache Spark all do this.
- Spam filtering. Maintain a Bloom filter of known-malicious URLs, IPs, or hashes. Lookup is O(k); the filter fits in tens of MB even for hundreds of millions of entries.
- Block storage deduplication. Before checking if a block is already stored, query the Bloom filter. Misses skip the deduplication index lookup entirely.
- Web crawlers. "Have I already crawled this URL?" queried billions of times. A Bloom filter prevents revisits without storing the full URL set.
JavaScript implementation
class BloomFilter {
constructor(expectedItems, falsePositiveRate = 0.01) {
// m = -n × ln(p) / (ln 2)^2; k = (m/n) × ln 2
this.m = Math.ceil(-expectedItems * Math.log(falsePositiveRate) / Math.LN2 ** 2);
this.k = Math.ceil((this.m / expectedItems) * Math.LN2);
this.bits = new Uint8Array(Math.ceil(this.m / 8));
}
// FNV-1a hash with two seeds — gives k pseudo-independent hashes via double hashing
_hashes(item) {
const str = typeof item === 'string' ? item : JSON.stringify(item);
let h1 = 2166136261, h2 = 5381;
for (let i = 0; i < str.length; i++) {
h1 = ((h1 ^ str.charCodeAt(i)) * 16777619) >>> 0;
h2 = ((h2 << 5) + h2 + str.charCodeAt(i)) >>> 0;
}
const result = [];
for (let i = 0; i < this.k; i++) {
result.push((h1 + i * h2) % this.m);
}
return result;
}
add(item) {
for (const idx of this._hashes(item)) {
this.bits[idx >> 3] |= 1 << (idx & 7);
}
}
has(item) {
for (const idx of this._hashes(item)) {
if ((this.bits[idx >> 3] & (1 << (idx & 7))) === 0) return false;
}
return true; // probably
}
// Estimated cardinality (number of items inserted) — useful for monitoring
approximateSize() {
let setBits = 0;
for (const byte of this.bits) {
let b = byte;
while (b) { setBits += b & 1; b >>= 1; }
}
return Math.round(-this.m * Math.log(1 - setBits / this.m) / this.k);
}
}
const bf = new BloomFilter(1000000, 0.01);
bf.add('[email protected]');
bf.has('[email protected]'); // true
bf.has('[email protected]'); // false (or rarely true — false positive)
Bloom filter variants
| Variant | Adds capability | Cost |
|---|---|---|
| Counting Bloom filter | Delete | 4× space (4-bit counters instead of 1-bit) |
| Scalable Bloom filter | Grows automatically | Slightly higher FPR; multiple filters in series |
| Cuckoo filter | Delete + better cache locality | More complex insertion logic; slightly more memory |
| Quotient filter | Delete + resize + merge | More complex; comparable space |
| HyperLogLog | Cardinality estimation (not membership) | O(log log n) space; different problem |
For most use cases, plain Bloom filter or Cuckoo filter are the practical choices. Cuckoo filters have edged ahead in some benchmarks due to better cache behavior on modern hardware.
Common Bloom filter bugs
- Saturating the filter. If you insert more items than the filter was designed for, the false positive rate spikes — eventually every query returns "probably yes." Monitor the fill ratio; rebuild when it exceeds the design point.
- Non-independent hash functions. Using k hashes that are correlated (e.g., k+1 different prefixes of one hash) produces clustered bit patterns and elevated FPR. Use double hashing (two independent hashes combined linearly) for cheap pseudo-independent k hashes.
- Confusing "probably yes" with "yes." Always assume some queries return false positives. Downstream code must verify or be tolerant. The Bloom filter is a fast pre-filter, not the source of truth.
- Trying to delete. If you need delete, use a counting Bloom filter or a Cuckoo filter. Naively clearing bits in a plain Bloom filter creates false negatives — far worse than false positives.
- Mismatched hashes when sharing. If two systems share a Bloom filter, they must use the exact same hash functions and bit array size. Subtle differences (one uses MurmurHash3, the other CityHash) make the filter useless.
- Querying without first inserting. Querying an empty filter for many items returns "probably yes" only at random rates. The benefit appears when there's a meaningful inserted set; for tiny n, just use a hash set.
Frequently asked questions
How does a Bloom filter avoid false negatives?
When you insert an element, k bits are set in the bit array (one per hash function). Querying tests those same k bits — if any is 0, the element was never inserted (definitively absent). False positives happen when an element's k bits all happen to be set by other inserted items, giving a false "yes." The asymmetry — bits only ever go from 0 to 1, never reverse — is what eliminates false negatives.
What's the optimal number of hash functions?
For m bits and n inserted items, optimal k = (m/n) × ln(2) ≈ 0.693 × (m/n). For 10 bits per item, that's ~7 hash functions. Too few hash functions and the bit array fills slowly but each lookup misses; too many and the bit array saturates quickly and false positives spike. The 7-hash setup at 10 bits per item gives ~1% false positive rate — the standard balance.
Why can't you delete from a Bloom filter?
Setting a bit back to 0 might also "delete" some other element that happened to have its bits set in the same positions. Counting Bloom filters use small counters (4 bits) instead of single bits per slot, allowing decrement on delete — at the cost of 4× memory. For pure-bit Bloom filters, deletion is impossible without reconstruction.
How is a Bloom filter different from a hash table?
A hash table stores the actual elements and their values; a Bloom filter stores no elements, only bit fingerprints. Hash tables answer "what value is associated with key X?" and "is X in the set?" exactly. Bloom filters answer only "is X probably in the set?" Bloom filters are 10-100× smaller than hash tables for the same n elements, but lose exact membership and value lookup.
When should I use a Bloom filter?
When the set is huge and you can tolerate occasional false positives. Common pattern — cascade a small Bloom filter in front of an expensive lookup (database query, disk read, network call). Most negative results are caught by the Bloom filter; the few false positives proceed to the slower path. Net win: 10-100× faster average for negative-heavy workloads.
How are Bloom filters used in browsers?
Google Chrome's Safe Browsing checks every URL you visit against a list of known malicious sites. The list has ~5M entries; storing it as a hash set would be 50+ MB. The Bloom-filter-based version is ~2 MB. False positives trigger a second-stage check (real network request); ~99% of safe URLs short-circuit at the local Bloom filter.
Can multiple Bloom filters be combined?
Yes — bitwise OR for set union (with the same hash functions and same bit array size). Bitwise AND for intersection (gives an upper bound — items definitely in neither aren't in the AND). Cannot do exact difference. This is why Bloom filters scale well horizontally — many machines build local filters, then OR them together for the global view.