Probabilistic Data Structures
Count-Min Sketch
Frequency estimates in d × w counters — always an upper bound
The Count-Min Sketch estimates item frequencies with sub-linear memory. d hash functions × w columns; query returns the min of d cells. Always overestimates, never undercounts. Powers stream analytics everywhere.
- Insert / queryO(d) — d hash functions
- Memoryd × w counters · independent of stream size
- Error boundε = e/w with probability ≥ 1 − 2⁻ᵈ
- BiasAlways overestimates, never undercounts
- MergeableYes — sum corresponding cells
- Used byDataSketches, Druid, Prometheus, NetFlow, DDoS detection
Interactive visualization
Watch each insert touch d cells, and queries take the min across rows.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How Count-Min Sketch works
A Count-Min Sketch is a 2-D grid of small integer counters with d rows and w columns. Each row has its own pairwise-independent hash function h₁, h₂, ..., h_d, each mapping items to one of the w columns. The grid starts zeroed.
Insert(x): for each row i = 1..d, increment grid[i][h_i(x)].
Query(x): for each row i = 1..d, read grid[i][h_i(x)]. Return the minimum across all d cells.
That's the entire algorithm. Each insert touches exactly d cells; each query reads exactly d cells. Memory is fixed at d × w × counter_width bits — independent of how many items you insert.
Why the minimum
Each cell grid[i][h_i(x)] is incremented every time x appears, plus every time some other item y appeared that happened to hash to the same column in row i. So:
grid[i][h_i(x)] = count(x) + Σ_{y ≠ x : h_i(y) = h_i(x)} count(y)
Every cell is therefore an upper bound on count(x). The tightest upper bound is the smallest cell — the row whose collisions, by luck, summed to the least. So min ≥ count(x) always. The estimate never undercounts.
How much overestimation? Cormode-Muthukrishnan's analysis: with d = ⌈ln(1/δ)⌉ rows and w = ⌈e/ε⌉ columns, the estimate exceeds the true count by more than ε·N (where N is total stream length) with probability at most δ. For ε = 0.1% and δ = 0.01% — sharp bounds — you need ~10 rows and ~2,718 columns, totaling ~108 KB at 4-byte counters.
Why not just use a hash table?
A hash table giving exact frequencies stores one (key, count) entry per distinct key. For a stream with 10⁹ distinct items, that's ~16-32 GB. Count-Min Sketch is 108 KB for the same problem with ε = 0.1% error.
The trade: you lose exact counts, you can't enumerate the items (the sketch stores no keys), and you can't query items that weren't in the stream (you'll get a non-zero "estimate" from collisions). But you scale to billions of items in megabytes of memory, mergeable across shards.
Choosing d and w
| Use case | Width w | Depth d | Memory | Guarantee |
|---|---|---|---|---|
| Demo / toy | 16 | 4 | 256 B | ε ≈ 17%, δ ≈ 6% |
| Coarse heavy-hitters | 200 | 5 | 4 KB | ε ≈ 1.4%, δ ≈ 3% |
| Twitter-scale trending | 2,000 | 7 | 56 KB | ε ≈ 0.14%, δ ≈ 0.8% |
| Sharp analytics | 2,718 (e × 1000) | 10 | 108 KB | ε = 0.1%, δ ≈ 0.01% |
| NetFlow per-IP | 10,000 | 5 | 200 KB | ε ≈ 0.03%, δ ≈ 3% |
Doubling d halves the failure probability (multiplicative). Doubling w halves the expected error (additive). Wider sketches help when you care about magnitude of error; deeper sketches help when you want strong worst-case guarantees.
Count-Min variants
| Variant | Trick | Use case |
|---|---|---|
| Plain CMS | Min across d cells | Frequency estimation, heavy hitters |
| Count Sketch (Charikar) | Signed updates ±1, take median | Supports decrements, unbiased estimator |
| Count-Min-Mean Sketch | Min, then subtract estimated noise floor | Better accuracy at small counts |
| Conservative update CMS | Only increment the minimum cell, not all d | Tighter overestimation, harder to merge |
| Cuckoo Count | Cuckoo hash with counters | Decrement support; cache-friendlier |
| HeavyKeeper | Probabilistic eviction with min-heap | Top-k items with bounded memory |
When to use Count-Min Sketch
- You need per-key frequencies on a high-volume stream. Network packet flows, log line frequencies, search query rates. CMS hits 10M+ inserts per second per core.
- You need heavy-hitters. "Which URLs are trending?" "Which IPs are flooding?" — CMS + min-heap finds the top items without storing all keys.
- Cardinality estimation in a database query planner. Sketches per column let the planner estimate join sizes without scanning the table.
- Distributed aggregation. Each shard maintains a local CMS; central node sums them cell-by-cell for the global picture.
- You're OK with overestimates. Most CMS use cases — alerting, trending, traffic shaping — are robust to a few percent of overcounting.
Skip CMS when you need exact counts, when you need to enumerate items (CMS stores no keys), or when your stream has tiny cardinality (just use a hash table).
Count-Min Sketch vs other counters
| Hash table | Count-Min Sketch | HyperLogLog | Bloom Filter | |
|---|---|---|---|---|
| Stores | Exact counts per key | Approx counts per key | Approx cardinality only | Approx membership |
| Memory for 10⁹ keys | ~16-32 GB | ~100 KB | ~12 KB | ~1.2 GB at 1% FPR |
| Insert | O(L) hash | O(d) ≈ O(1) | O(1) | O(k) |
| Query type | Exact count | Count ≥ true count | Distinct count | Probably in / not in |
| Error mode | None | Overestimate, never under | ±1.6% relative | ~1% false positive |
| Decrement support | Yes | No (use Count Sketch) | No | Only counting variant |
| Mergeable | Yes, expensive | Yes — sum cells | Yes — max cells | Yes — OR cells |
Pseudo-code
// Count-Min Sketch with d hash functions and w columns.
function CMS(d, w):
grid = d × w array of zeros
hashes = d different hash functions h_1, ..., h_d // pairwise independent
function insert(x):
for i in 1..d:
grid[i][h_i(x) mod w] += 1
function query(x):
return min over i in 1..d of grid[i][h_i(x) mod w]
function merge(a, b): // both same d, w
for i in 1..d:
for j in 1..w:
a.grid[i][j] += b.grid[i][j]
function heavy_hitters(threshold, items_seen):
heap = []
for x in items_seen:
c = query(x)
if c >= threshold:
heap.push((c, x))
return heap
JavaScript implementation
class CountMinSketch {
constructor({ width = 2718, depth = 10 } = {}) {
this.w = width;
this.d = depth;
this.grid = Array.from({ length: depth }, () => new Uint32Array(width));
// d different seeds for d different hash functions
this.seeds = Array.from({ length: depth }, (_, i) => (0x9e3779b9 + i * 0x1234567) >>> 0);
}
_hash(item, seed) {
const s = typeof item === 'string' ? item : JSON.stringify(item);
let h = seed;
for (let i = 0; i < s.length; i++) {
h = ((h ^ s.charCodeAt(i)) * 16777619) >>> 0;
}
return h % this.w;
}
add(item, count = 1) {
for (let i = 0; i < this.d; i++) {
this.grid[i][this._hash(item, this.seeds[i])] += count;
}
}
estimate(item) {
let min = Infinity;
for (let i = 0; i < this.d; i++) {
const v = this.grid[i][this._hash(item, this.seeds[i])];
if (v < min) min = v;
}
return min;
}
merge(other) {
if (other.d !== this.d || other.w !== this.w) throw new Error('shape mismatch');
for (let i = 0; i < this.d; i++) {
for (let j = 0; j < this.w; j++) {
this.grid[i][j] += other.grid[i][j];
}
}
}
}
const cms = new CountMinSketch({ width: 1000, depth: 5 });
const corpus = 'apple banana cherry apple apple banana'.split(' ');
for (const word of corpus) cms.add(word);
console.log(cms.estimate('apple')); // 3 (true count, perfect)
console.log(cms.estimate('banana')); // 2
console.log(cms.estimate('grape')); // small overestimate from collisions
Python implementation
import math
import hashlib
class CountMinSketch:
def __init__(self, epsilon=0.001, delta=0.0001):
# w = e/eps, d = ln(1/delta)
self.w = math.ceil(math.e / epsilon)
self.d = math.ceil(math.log(1 / delta))
self.grid = [[0] * self.w for _ in range(self.d)]
def _hashes(self, item):
data = str(item).encode()
out = []
for i in range(self.d):
h = hashlib.blake2b(data + i.to_bytes(2, 'big'), digest_size=8).digest()
out.append(int.from_bytes(h, 'big') % self.w)
return out
def add(self, item, count=1):
for i, j in enumerate(self._hashes(item)):
self.grid[i][j] += count
def estimate(self, item):
return min(self.grid[i][j] for i, j in enumerate(self._hashes(item)))
def merge(self, other):
for i in range(self.d):
for j in range(self.w):
self.grid[i][j] += other.grid[i][j]
cms = CountMinSketch(epsilon=0.001, delta=0.0001)
# Insert a Zipf-distributed stream
import random
words = ['the'] * 1000 + ['a'] * 500 + ['banana'] * 5
random.shuffle(words)
for w in words: cms.add(w)
print(cms.estimate('the')) # ~1000
print(cms.estimate('a')) # ~500
print(cms.estimate('banana')) # ~5
print(cms.estimate('grape')) # 0 or small overestimate
Common Count-Min Sketch bugs and edge cases
- Correlated hash functions. CMS requires pairwise-independent hashes. Using d different hash functions that share state (e.g., the same MurmurHash with sequentially-bumped seeds) can produce correlated outputs and elevated collision rates. Use double hashing or genuinely independent seeds.
- Modulo collisions when w is a power of 2. If your hash function has weak low bits,
hash mod wconcentrates collisions. Make w prime, or use a high-quality hash function (xxHash, MurmurHash3, BLAKE2). - Treating CMS as exact for popular items. Even heavy hitters can be overestimated by collisions with other heavy hitters. The error is bounded by ε·N (not 0).
- Querying never-inserted items. CMS returns a non-zero value for items that were never inserted, due to collisions. This is fine for frequency estimates of items you know are in the stream, but breaks naive "have I seen this?" usage. Pair with a Bloom filter for the membership question.
- Trying to decrement. Plain CMS doesn't support decrement. Use Count Sketch (median of signed updates) if you need it. Decrementing in CMS can produce values lower than the true count — breaking the never-undercount guarantee.
- Wrong merge semantics. Merging CMS = sum corresponding cells (a + b). Merging HLL = max corresponding registers. Merging Bloom filters = OR. Easy to mix up; use a typed library if you can.
Performance in real systems
- Apache DataSketches CMS: ~5-15 million inserts per second per core. ~100 KB for sharp guarantees.
- Druid topN queries: Use CMS for approximate top-k aggregations across billions of rows per second.
- NetFlow per-IP counters: Backbone routers maintain CMS of ~1 MB per port for per-source-IP flow counts. Without sketches, exact per-IP counts would require gigabytes per port.
- Twitter trending topics: CMS with d=5, w=10000 (~200 KB) per region; identifies heavy hitters in real time across 100k+ tweets/sec.
- Prometheus high-cardinality: Some Prometheus-style monitoring systems use CMS for per-label-set counters when cardinality is too high for direct storage.
- vs exact hash table: On a 1B-element stream with 10M distinct keys, exact counting needs ~160 MB (16 bytes/entry). CMS with 0.1% error needs ~108 KB. 1500× space saving.
- Error in practice: For ε=0.001, 99% of queries have error < 0.1% of stream total. The 1% tail can hit ε; the formal bound is on probability of any bad estimate, but in practice the average-case is far better.
The Count-Min Sketch is one of the most-used streaming data structures because it has a clean trade-off: sub-linear memory, O(d) per operation, simple to implement, never undercounts (a property many downstream systems require), and mergeable across shards. It's the workhorse of approximate stream analytics — the right tool any time you need "roughly how often does X show up?" on a stream too big to count exactly.
Frequently asked questions
Why does the Count-Min Sketch take the minimum across rows?
Each counter is incremented every time any item hashes to it — so it sees the queried item's true count plus the counts of every other item that collided with it in that row. Each row's count is therefore an upper bound on the true count. Taking the minimum across rows gives the tightest upper bound: the row whose collisions happened to be smallest. The true count is always ≤ the minimum, so the sketch never undercounts.
How big should the d × w sketch be?
Choose width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉ where ε is the additive error tolerance (as a fraction of the total stream count) and δ is the failure probability. For ε = 0.1% and δ = 0.01%: w = 2718, d = 10. That's 27,180 counters × 4 bytes = ~108 KB. Doubling d halves the failure probability; doubling w halves the expected error.
What's the difference between Count-Min Sketch and Bloom filter?
A Bloom filter stores membership (in / probably-in / definitely-not-in) using bits. A Count-Min Sketch stores frequency (counts) using small integer counters (typically 32-bit). Bloom filters can't tell you how many times something was inserted, only whether it was ever inserted. Both share the d-hash-function design, but their output and use cases differ: Bloom for set membership, CMS for frequency analysis.
How does Count-Min Sketch find heavy hitters?
The classic heavy-hitters problem is 'find all items whose count exceeds threshold φN where N is the total stream count.' CMS combined with a min-heap of size 1/φ works: for each insert, query the CMS for the item's estimated count; if it exceeds threshold, replace the heap's minimum element. After processing, the heap contains the top-φ candidates. CMS guarantees no false negatives (real heavy hitters are always returned) but may include some false positives.
Can you decrement a Count-Min Sketch?
The standard CMS supports increments only — decrement breaks the never-undercount property because the cells the decrement targets may have been incremented by other colliding items. The Count-Min-Mean Sketch and Count Sketch (Charikar) variants support decrements by using signed updates with random ±1 multipliers and taking the median rather than the minimum. These trade some accuracy for the ability to handle deletions.
What's the time complexity of Count-Min Sketch?
O(d) per insert and per query — d hash computations and d cell accesses. For d = 10 hash functions, each operation is ~10 hashes and 10 cache lines accessed — well under 1 microsecond on modern hardware. Memory is fixed at d × w × counter_width regardless of input size — the defining feature of streaming sketches.
Where is Count-Min Sketch used in practice?
Twitter's heavy-hitter detection (trending topics). Akamai's DDoS detection (per-IP request counts). Cisco's NetFlow analytics. Database query optimizers for cardinality estimation. AT&T's network anomaly detection. Apache DataSketches library (used by Druid, Pinot). Prometheus and other monitoring systems use CMS variants for high-cardinality time series. Any time you need per-key frequencies on a fast stream and can tolerate slight overestimates, CMS is the default tool.