Probabilistic Data Structures

Count-Min Sketch

Frequency estimates in d × w counters — always an upper bound

The Count-Min Sketch estimates item frequencies with sub-linear memory. d hash functions × w columns; query returns the min of d cells. Always overestimates, never undercounts. Powers stream analytics everywhere.

  • Insert / queryO(d) — d hash functions
  • Memoryd × w counters · independent of stream size
  • Error boundε = e/w with probability ≥ 1 − 2⁻ᵈ
  • BiasAlways overestimates, never undercounts
  • MergeableYes — sum corresponding cells
  • Used byDataSketches, Druid, Prometheus, NetFlow, DDoS detection

Interactive visualization

Watch each insert touch d cells, and queries take the min across rows.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Count-Min Sketch works

A Count-Min Sketch is a 2-D grid of small integer counters with d rows and w columns. Each row has its own pairwise-independent hash function h₁, h₂, ..., h_d, each mapping items to one of the w columns. The grid starts zeroed.

Insert(x): for each row i = 1..d, increment grid[i][h_i(x)].

Query(x): for each row i = 1..d, read grid[i][h_i(x)]. Return the minimum across all d cells.

That's the entire algorithm. Each insert touches exactly d cells; each query reads exactly d cells. Memory is fixed at d × w × counter_width bits — independent of how many items you insert.

Why the minimum

Each cell grid[i][h_i(x)] is incremented every time x appears, plus every time some other item y appeared that happened to hash to the same column in row i. So:

grid[i][h_i(x)] = count(x) + Σ_{y ≠ x : h_i(y) = h_i(x)} count(y)

Every cell is therefore an upper bound on count(x). The tightest upper bound is the smallest cell — the row whose collisions, by luck, summed to the least. So min ≥ count(x) always. The estimate never undercounts.

How much overestimation? Cormode-Muthukrishnan's analysis: with d = ⌈ln(1/δ)⌉ rows and w = ⌈e/ε⌉ columns, the estimate exceeds the true count by more than ε·N (where N is total stream length) with probability at most δ. For ε = 0.1% and δ = 0.01% — sharp bounds — you need ~10 rows and ~2,718 columns, totaling ~108 KB at 4-byte counters.

Why not just use a hash table?

A hash table giving exact frequencies stores one (key, count) entry per distinct key. For a stream with 10⁹ distinct items, that's ~16-32 GB. Count-Min Sketch is 108 KB for the same problem with ε = 0.1% error.

The trade: you lose exact counts, you can't enumerate the items (the sketch stores no keys), and you can't query items that weren't in the stream (you'll get a non-zero "estimate" from collisions). But you scale to billions of items in megabytes of memory, mergeable across shards.

Choosing d and w

Use caseWidth wDepth dMemoryGuarantee
Demo / toy164256 Bε ≈ 17%, δ ≈ 6%
Coarse heavy-hitters20054 KBε ≈ 1.4%, δ ≈ 3%
Twitter-scale trending2,000756 KBε ≈ 0.14%, δ ≈ 0.8%
Sharp analytics2,718 (e × 1000)10108 KBε = 0.1%, δ ≈ 0.01%
NetFlow per-IP10,0005200 KBε ≈ 0.03%, δ ≈ 3%

Doubling d halves the failure probability (multiplicative). Doubling w halves the expected error (additive). Wider sketches help when you care about magnitude of error; deeper sketches help when you want strong worst-case guarantees.

Count-Min variants

VariantTrickUse case
Plain CMSMin across d cellsFrequency estimation, heavy hitters
Count Sketch (Charikar)Signed updates ±1, take medianSupports decrements, unbiased estimator
Count-Min-Mean SketchMin, then subtract estimated noise floorBetter accuracy at small counts
Conservative update CMSOnly increment the minimum cell, not all dTighter overestimation, harder to merge
Cuckoo CountCuckoo hash with countersDecrement support; cache-friendlier
HeavyKeeperProbabilistic eviction with min-heapTop-k items with bounded memory

When to use Count-Min Sketch

  • You need per-key frequencies on a high-volume stream. Network packet flows, log line frequencies, search query rates. CMS hits 10M+ inserts per second per core.
  • You need heavy-hitters. "Which URLs are trending?" "Which IPs are flooding?" — CMS + min-heap finds the top items without storing all keys.
  • Cardinality estimation in a database query planner. Sketches per column let the planner estimate join sizes without scanning the table.
  • Distributed aggregation. Each shard maintains a local CMS; central node sums them cell-by-cell for the global picture.
  • You're OK with overestimates. Most CMS use cases — alerting, trending, traffic shaping — are robust to a few percent of overcounting.

Skip CMS when you need exact counts, when you need to enumerate items (CMS stores no keys), or when your stream has tiny cardinality (just use a hash table).

Count-Min Sketch vs other counters

Hash tableCount-Min SketchHyperLogLogBloom Filter
StoresExact counts per keyApprox counts per keyApprox cardinality onlyApprox membership
Memory for 10⁹ keys~16-32 GB~100 KB~12 KB~1.2 GB at 1% FPR
InsertO(L) hashO(d) ≈ O(1)O(1)O(k)
Query typeExact countCount ≥ true countDistinct countProbably in / not in
Error modeNoneOverestimate, never under±1.6% relative~1% false positive
Decrement supportYesNo (use Count Sketch)NoOnly counting variant
MergeableYes, expensiveYes — sum cellsYes — max cellsYes — OR cells

Pseudo-code

// Count-Min Sketch with d hash functions and w columns.

function CMS(d, w):
    grid = d × w array of zeros
    hashes = d different hash functions h_1, ..., h_d   // pairwise independent

function insert(x):
    for i in 1..d:
        grid[i][h_i(x) mod w] += 1

function query(x):
    return min over i in 1..d of grid[i][h_i(x) mod w]

function merge(a, b):                                   // both same d, w
    for i in 1..d:
        for j in 1..w:
            a.grid[i][j] += b.grid[i][j]

function heavy_hitters(threshold, items_seen):
    heap = []
    for x in items_seen:
        c = query(x)
        if c >= threshold:
            heap.push((c, x))
    return heap

JavaScript implementation

class CountMinSketch {
  constructor({ width = 2718, depth = 10 } = {}) {
    this.w = width;
    this.d = depth;
    this.grid = Array.from({ length: depth }, () => new Uint32Array(width));
    // d different seeds for d different hash functions
    this.seeds = Array.from({ length: depth }, (_, i) => (0x9e3779b9 + i * 0x1234567) >>> 0);
  }

  _hash(item, seed) {
    const s = typeof item === 'string' ? item : JSON.stringify(item);
    let h = seed;
    for (let i = 0; i < s.length; i++) {
      h = ((h ^ s.charCodeAt(i)) * 16777619) >>> 0;
    }
    return h % this.w;
  }

  add(item, count = 1) {
    for (let i = 0; i < this.d; i++) {
      this.grid[i][this._hash(item, this.seeds[i])] += count;
    }
  }

  estimate(item) {
    let min = Infinity;
    for (let i = 0; i < this.d; i++) {
      const v = this.grid[i][this._hash(item, this.seeds[i])];
      if (v < min) min = v;
    }
    return min;
  }

  merge(other) {
    if (other.d !== this.d || other.w !== this.w) throw new Error('shape mismatch');
    for (let i = 0; i < this.d; i++) {
      for (let j = 0; j < this.w; j++) {
        this.grid[i][j] += other.grid[i][j];
      }
    }
  }
}

const cms = new CountMinSketch({ width: 1000, depth: 5 });
const corpus = 'apple banana cherry apple apple banana'.split(' ');
for (const word of corpus) cms.add(word);
console.log(cms.estimate('apple'));   // 3 (true count, perfect)
console.log(cms.estimate('banana'));  // 2
console.log(cms.estimate('grape'));   // small overestimate from collisions

Python implementation

import math
import hashlib

class CountMinSketch:
    def __init__(self, epsilon=0.001, delta=0.0001):
        # w = e/eps, d = ln(1/delta)
        self.w = math.ceil(math.e / epsilon)
        self.d = math.ceil(math.log(1 / delta))
        self.grid = [[0] * self.w for _ in range(self.d)]

    def _hashes(self, item):
        data = str(item).encode()
        out = []
        for i in range(self.d):
            h = hashlib.blake2b(data + i.to_bytes(2, 'big'), digest_size=8).digest()
            out.append(int.from_bytes(h, 'big') % self.w)
        return out

    def add(self, item, count=1):
        for i, j in enumerate(self._hashes(item)):
            self.grid[i][j] += count

    def estimate(self, item):
        return min(self.grid[i][j] for i, j in enumerate(self._hashes(item)))

    def merge(self, other):
        for i in range(self.d):
            for j in range(self.w):
                self.grid[i][j] += other.grid[i][j]

cms = CountMinSketch(epsilon=0.001, delta=0.0001)
# Insert a Zipf-distributed stream
import random
words = ['the'] * 1000 + ['a'] * 500 + ['banana'] * 5
random.shuffle(words)
for w in words: cms.add(w)
print(cms.estimate('the'))      # ~1000
print(cms.estimate('a'))        # ~500
print(cms.estimate('banana'))   # ~5
print(cms.estimate('grape'))    # 0 or small overestimate

Common Count-Min Sketch bugs and edge cases

  • Correlated hash functions. CMS requires pairwise-independent hashes. Using d different hash functions that share state (e.g., the same MurmurHash with sequentially-bumped seeds) can produce correlated outputs and elevated collision rates. Use double hashing or genuinely independent seeds.
  • Modulo collisions when w is a power of 2. If your hash function has weak low bits, hash mod w concentrates collisions. Make w prime, or use a high-quality hash function (xxHash, MurmurHash3, BLAKE2).
  • Treating CMS as exact for popular items. Even heavy hitters can be overestimated by collisions with other heavy hitters. The error is bounded by ε·N (not 0).
  • Querying never-inserted items. CMS returns a non-zero value for items that were never inserted, due to collisions. This is fine for frequency estimates of items you know are in the stream, but breaks naive "have I seen this?" usage. Pair with a Bloom filter for the membership question.
  • Trying to decrement. Plain CMS doesn't support decrement. Use Count Sketch (median of signed updates) if you need it. Decrementing in CMS can produce values lower than the true count — breaking the never-undercount guarantee.
  • Wrong merge semantics. Merging CMS = sum corresponding cells (a + b). Merging HLL = max corresponding registers. Merging Bloom filters = OR. Easy to mix up; use a typed library if you can.

Performance in real systems

  • Apache DataSketches CMS: ~5-15 million inserts per second per core. ~100 KB for sharp guarantees.
  • Druid topN queries: Use CMS for approximate top-k aggregations across billions of rows per second.
  • NetFlow per-IP counters: Backbone routers maintain CMS of ~1 MB per port for per-source-IP flow counts. Without sketches, exact per-IP counts would require gigabytes per port.
  • Twitter trending topics: CMS with d=5, w=10000 (~200 KB) per region; identifies heavy hitters in real time across 100k+ tweets/sec.
  • Prometheus high-cardinality: Some Prometheus-style monitoring systems use CMS for per-label-set counters when cardinality is too high for direct storage.
  • vs exact hash table: On a 1B-element stream with 10M distinct keys, exact counting needs ~160 MB (16 bytes/entry). CMS with 0.1% error needs ~108 KB. 1500× space saving.
  • Error in practice: For ε=0.001, 99% of queries have error < 0.1% of stream total. The 1% tail can hit ε; the formal bound is on probability of any bad estimate, but in practice the average-case is far better.

The Count-Min Sketch is one of the most-used streaming data structures because it has a clean trade-off: sub-linear memory, O(d) per operation, simple to implement, never undercounts (a property many downstream systems require), and mergeable across shards. It's the workhorse of approximate stream analytics — the right tool any time you need "roughly how often does X show up?" on a stream too big to count exactly.

Frequently asked questions

Why does the Count-Min Sketch take the minimum across rows?

Each counter is incremented every time any item hashes to it — so it sees the queried item's true count plus the counts of every other item that collided with it in that row. Each row's count is therefore an upper bound on the true count. Taking the minimum across rows gives the tightest upper bound: the row whose collisions happened to be smallest. The true count is always ≤ the minimum, so the sketch never undercounts.

How big should the d × w sketch be?

Choose width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉ where ε is the additive error tolerance (as a fraction of the total stream count) and δ is the failure probability. For ε = 0.1% and δ = 0.01%: w = 2718, d = 10. That's 27,180 counters × 4 bytes = ~108 KB. Doubling d halves the failure probability; doubling w halves the expected error.

What's the difference between Count-Min Sketch and Bloom filter?

A Bloom filter stores membership (in / probably-in / definitely-not-in) using bits. A Count-Min Sketch stores frequency (counts) using small integer counters (typically 32-bit). Bloom filters can't tell you how many times something was inserted, only whether it was ever inserted. Both share the d-hash-function design, but their output and use cases differ: Bloom for set membership, CMS for frequency analysis.

How does Count-Min Sketch find heavy hitters?

The classic heavy-hitters problem is 'find all items whose count exceeds threshold φN where N is the total stream count.' CMS combined with a min-heap of size 1/φ works: for each insert, query the CMS for the item's estimated count; if it exceeds threshold, replace the heap's minimum element. After processing, the heap contains the top-φ candidates. CMS guarantees no false negatives (real heavy hitters are always returned) but may include some false positives.

Can you decrement a Count-Min Sketch?

The standard CMS supports increments only — decrement breaks the never-undercount property because the cells the decrement targets may have been incremented by other colliding items. The Count-Min-Mean Sketch and Count Sketch (Charikar) variants support decrements by using signed updates with random ±1 multipliers and taking the median rather than the minimum. These trade some accuracy for the ability to handle deletions.

What's the time complexity of Count-Min Sketch?

O(d) per insert and per query — d hash computations and d cell accesses. For d = 10 hash functions, each operation is ~10 hashes and 10 cache lines accessed — well under 1 microsecond on modern hardware. Memory is fixed at d × w × counter_width regardless of input size — the defining feature of streaming sketches.

Where is Count-Min Sketch used in practice?

Twitter's heavy-hitter detection (trending topics). Akamai's DDoS detection (per-IP request counts). Cisco's NetFlow analytics. Database query optimizers for cardinality estimation. AT&T's network anomaly detection. Apache DataSketches library (used by Druid, Pinot). Prometheus and other monitoring systems use CMS variants for high-cardinality time series. Any time you need per-key frequencies on a fast stream and can tolerate slight overestimates, CMS is the default tool.