Data Structures

Bloom Filter

Probably yes, definitely no — set membership in 10 bits per item

A Bloom filter is a compact probabilistic data structure that tells you whether an element is "probably in the set" or "definitely not in the set" — never gives a false negative, but allows tunable false positives. Uses ~10 bits per item for a 1% false-positive rate, 100× more space-efficient than storing the items directly. Used by browsers for safe-browsing checks, databases to skip disk reads, and CDN caches to avoid origin requests.

LookupO(k) — k hash functions
InsertO(k)
Space~10 bits per item for 1% FPR; ~14 bits for 0.1% FPR
False positivesTunable — fewer = more space
False negativesNever (definitive "not in set")
Cannot doDelete (without counting variants), enumerate, count

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a Bloom filter works

Three pieces:

A bit array of size m, initially all zeros.
k independent hash functions, each producing an index into the bit array.
An empty initial set with no elements.

Insert(x): compute hash_1(x), hash_2(x), ..., hash_k(x) modulo m. Set those k bits to 1.

Query(x): compute the same k indices. If any is 0, x is definitely not in the set. If all are 1, x is probably in the set (might be a false positive).

The asymmetry is the key insight. False positives happen because some other elements happened to set the same bits. False negatives can't happen because bits never go from 1 back to 0.

The false positive math

After inserting n items into an m-bit array with k hash functions, the probability that a specific bit is still 0 is roughly (1 − 1/m)^(kn) ≈ e^(−kn/m).

The probability of a false positive (all k checked bits happen to be 1 for an element that wasn't inserted) is approximately:

p = (1 − e^(−kn/m))^k

To minimize p for given m and n, the optimal k is (m/n) × ln(2) ≈ 0.693 × (m/n). At that optimum, p ≈ (1/2)^k.

Bits per item (m/n)	Optimal k	False positive rate
4	3	~14%
8	6	~2%
10	7	~1% — sweet spot
14	10	~0.1%
20	14	~0.0005%
32	22	~10^-9

The standard "10 bits per item, ~1% FPR" balance is what most production deployments target. Cutting FPR by 10× costs ~4 more bits per item; cutting it by 100× costs ~10 more bits.

Bloom filter vs hash table for membership

	Bloom filter	Hash set
Space (1M strings)	~1.2 MB at 1% FPR	~80 MB
Insert	O(k) — set k bits	O(L) where L is key length
Query	O(k) — read k bits	O(L)
False positives	Tunable rate	None
Delete	Not possible (without counting variant)	Trivial
Enumerate elements	No — element data not stored	Yes
Best for	"Probably in or definitely not"	Exact membership + lookups

The 10-100× space saving is what matters. For a billion-element set, the difference is 10 GB vs 100 GB. The trade-off — tunable false positives — is acceptable when downstream code can handle them.

When to use a Bloom filter

Cascade in front of slow lookups. The textbook pattern. Database key not found in Bloom filter? Skip the disk read. Browser URL not in malware list? Skip the network call. CDN cache miss prediction? Probably not in cache.
Distributed system membership. Each node maintains a Bloom filter of its keys. Other nodes ask "do you have X?" by querying the filter — no full state synchronization needed. Cassandra, RocksDB, Apache Spark all do this.
Spam filtering. Maintain a Bloom filter of known-malicious URLs, IPs, or hashes. Lookup is O(k); the filter fits in tens of MB even for hundreds of millions of entries.
Block storage deduplication. Before checking if a block is already stored, query the Bloom filter. Misses skip the deduplication index lookup entirely.
Web crawlers. "Have I already crawled this URL?" queried billions of times. A Bloom filter prevents revisits without storing the full URL set.

JavaScript implementation

class BloomFilter {
  constructor(expectedItems, falsePositiveRate = 0.01) {
    // m = -n × ln(p) / (ln 2)^2; k = (m/n) × ln 2
    this.m = Math.ceil(-expectedItems * Math.log(falsePositiveRate) / Math.LN2 ** 2);
    this.k = Math.ceil((this.m / expectedItems) * Math.LN2);
    this.bits = new Uint8Array(Math.ceil(this.m / 8));
  }

  // FNV-1a hash with two seeds — gives k pseudo-independent hashes via double hashing
  _hashes(item) {
    const str = typeof item === 'string' ? item : JSON.stringify(item);
    let h1 = 2166136261, h2 = 5381;
    for (let i = 0; i < str.length; i++) {
      h1 = ((h1 ^ str.charCodeAt(i)) * 16777619) >>> 0;
      h2 = ((h2 << 5) + h2 + str.charCodeAt(i)) >>> 0;
    }
    const result = [];
    for (let i = 0; i < this.k; i++) {
      result.push((h1 + i * h2) % this.m);
    }
    return result;
  }

  add(item) {
    for (const idx of this._hashes(item)) {
      this.bits[idx >> 3] |= 1 << (idx & 7);
    }
  }

  has(item) {
    for (const idx of this._hashes(item)) {
      if ((this.bits[idx >> 3] & (1 << (idx & 7))) === 0) return false;
    }
    return true; // probably
  }

  // Estimated cardinality (number of items inserted) — useful for monitoring
  approximateSize() {
    let setBits = 0;
    for (const byte of this.bits) {
      let b = byte;
      while (b) { setBits += b & 1; b >>= 1; }
    }
    return Math.round(-this.m * Math.log(1 - setBits / this.m) / this.k);
  }
}

const bf = new BloomFilter(1000000, 0.01);
bf.add('[email protected]');
bf.has('[email protected]');  // true
bf.has('[email protected]');    // false (or rarely true — false positive)

Bloom filter variants

Variant	Adds capability	Cost
Counting Bloom filter	Delete	4× space (4-bit counters instead of 1-bit)
Scalable Bloom filter	Grows automatically	Slightly higher FPR; multiple filters in series
Cuckoo filter	Delete + better cache locality	More complex insertion logic; slightly more memory
Quotient filter	Delete + resize + merge	More complex; comparable space
HyperLogLog	Cardinality estimation (not membership)	O(log log n) space; different problem

For most use cases, plain Bloom filter or Cuckoo filter are the practical choices. Cuckoo filters have edged ahead in some benchmarks due to better cache behavior on modern hardware.

Common Bloom filter bugs

Saturating the filter. If you insert more items than the filter was designed for, the false positive rate spikes — eventually every query returns "probably yes." Monitor the fill ratio; rebuild when it exceeds the design point.
Non-independent hash functions. Using k hashes that are correlated (e.g., k+1 different prefixes of one hash) produces clustered bit patterns and elevated FPR. Use double hashing (two independent hashes combined linearly) for cheap pseudo-independent k hashes.
Confusing "probably yes" with "yes." Always assume some queries return false positives. Downstream code must verify or be tolerant. The Bloom filter is a fast pre-filter, not the source of truth.
Trying to delete. If you need delete, use a counting Bloom filter or a Cuckoo filter. Naively clearing bits in a plain Bloom filter creates false negatives — far worse than false positives.
Mismatched hashes when sharing. If two systems share a Bloom filter, they must use the exact same hash functions and bit array size. Subtle differences (one uses MurmurHash3, the other CityHash) make the filter useless.
Querying without first inserting. Querying an empty filter for many items returns "probably yes" only at random rates. The benefit appears when there's a meaningful inserted set; for tiny n, just use a hash set.

Frequently asked questions

How does a Bloom filter avoid false negatives?

When you insert an element, k bits are set in the bit array (one per hash function). Querying tests those same k bits — if any is 0, the element was never inserted (definitively absent). False positives happen when an element's k bits all happen to be set by other inserted items, giving a false "yes." The asymmetry — bits only ever go from 0 to 1, never reverse — is what eliminates false negatives.

What's the optimal number of hash functions?

For m bits and n inserted items, optimal k = (m/n) × ln(2) ≈ 0.693 × (m/n). For 10 bits per item, that's ~7 hash functions. Too few hash functions and the bit array fills slowly but each lookup misses; too many and the bit array saturates quickly and false positives spike. The 7-hash setup at 10 bits per item gives ~1% false positive rate — the standard balance.

Why can't you delete from a Bloom filter?

Setting a bit back to 0 might also "delete" some other element that happened to have its bits set in the same positions. Counting Bloom filters use small counters (4 bits) instead of single bits per slot, allowing decrement on delete — at the cost of 4× memory. For pure-bit Bloom filters, deletion is impossible without reconstruction.

How is a Bloom filter different from a hash table?

A hash table stores the actual elements and their values; a Bloom filter stores no elements, only bit fingerprints. Hash tables answer "what value is associated with key X?" and "is X in the set?" exactly. Bloom filters answer only "is X probably in the set?" Bloom filters are 10-100× smaller than hash tables for the same n elements, but lose exact membership and value lookup.

When should I use a Bloom filter?

When the set is huge and you can tolerate occasional false positives. Common pattern — cascade a small Bloom filter in front of an expensive lookup (database query, disk read, network call). Most negative results are caught by the Bloom filter; the few false positives proceed to the slower path. Net win: 10-100× faster average for negative-heavy workloads.

How are Bloom filters used in browsers?

Google Chrome's Safe Browsing checks every URL you visit against a list of known malicious sites. The list has ~5M entries; storing it as a hash set would be 50+ MB. The Bloom-filter-based version is ~2 MB. False positives trigger a second-stage check (real network request); ~99% of safe URLs short-circuit at the local Bloom filter.

Can multiple Bloom filters be combined?

Yes — bitwise OR for set union (with the same hash functions and same bit array size). Bitwise AND for intersection (gives an upper bound — items definitely in neither aren't in the AND). Cannot do exact difference. This is why Bloom filters scale well horizontally — many machines build local filters, then OR them together for the global view.