Probabilistic Data Structures

HyperLogLog

Billion distinct items in 12 KB, with 1% error

HyperLogLog estimates unique counts using ~12 KB for cardinalities up to 10⁹ with ~1.625% standard error. Counts leading zeros in hashes; takes the harmonic mean across registers. Powers Redis PFCOUNT.

Memory~12 KB for 1.625% error
InsertO(1)
Query (cardinality)O(m) — m registers
Standard error1.04/√m — precision 14 → 0.81%
Cardinality rangeUp to 10⁹+ at precision 14
Used byRedis PFCOUNT, BigQuery, Snowflake, ClickHouse

Interactive visualization

Watch a stream of hashed values land into buckets — each bucket tracks the maximum leading-zero count it has seen.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How HyperLogLog works

The core insight is from probability. Hash a uniformly random value to a long bit string. The leading-zero count is geometrically distributed — the probability of seeing k zeros at the start is 2⁻ᵏ. Among n random hashes, you expect to eventually observe a hash with about log₂(n) leading zeros, just because of the birthday-paradox-style scaling of extreme values.

Flajolet's leap (1980s) was to invert this: given the maximum leading-zero count observed over a stream, you can estimate n ≈ 2^(max leading zeros). The estimator has horrible variance on its own — but split the stream into m buckets, take a stabilized average across all m, and the variance drops as 1/√m.

The HyperLogLog algorithm:

Choose a precision p; let m = 2ᵖ be the number of registers. Each register is 6 bits.
For each incoming element x:
- Compute a 64-bit hash h = hash(x).
- Use the top p bits of h as the register index i.
- Count leading zeros in the remaining 64 − p bits, call it ρ.
- Set register[i] = max(register[i], ρ + 1).
To estimate cardinality:
- E = α_m × m² × (Σ 2⁻ʳⁱ)⁻¹ — the bias-corrected harmonic mean.
- Apply small-range and large-range corrections (HLL++).

That's it. Each element is one hash, one bucket update, one byte of work. Cardinality query is one harmonic-mean computation over m registers.

Precision vs memory vs error

Precision p	Registers m = 2ᵖ	Memory (6 bits/reg)	Standard error
10	1,024	768 B	~3.25%
12	4,096	3 KB	~1.625%
14	16,384	12 KB	~0.81% — Redis default
16	65,536	48 KB	~0.41%
18	262,144	192 KB	~0.20%

The standard error is 1.04/√m. Every increase of p by 1 doubles the memory and reduces error by ~30%. Beyond p=14 you usually stop — 1% is good enough for analytics dashboards.

Why harmonic mean

The naive way to combine register estimates is the arithmetic mean: average 2^ρᵢ across all i. Bad idea — one register with ρ=20 means 2²⁰ = 1 million, dominating any reasonable average. The geometric mean (used in the original LogLog) tamps this down but is still noisy.

The harmonic mean uniquely down-weights large values. Mathematically:

E = α_m × m² / Σᵢ 2⁻ʳⁱ

where α_m ≈ 0.7213 / (1 + 1.079/m) is the bias-correction constant.

This estimator has theoretically near-optimal variance for a fixed memory budget. The 2007 HyperLogLog paper proved that improving on it requires more memory, not a better formula.

HyperLogLog++ corrections

Plain HLL has two failure modes:

Small cardinalities (n ≪ m). When most registers are still 0, the harmonic mean is biased high. HLL++ uses linear counting in this regime: estimate = m × ln(m/V) where V is the count of zero registers.
Large cardinalities (n > 2³²). The 32-bit version of HLL saturates near 2³². HLL++ uses 64-bit hashes to push the saturation point above 10¹⁸ — practically unlimited.

The sparse representation is the other Google contribution: when few registers are populated, store them as a sorted list of (index, value) pairs instead of a dense array. Massively reduces memory for small streams. Once the sparse representation crosses ~6 KB it converts to dense.

When to use HyperLogLog

You need approximate COUNT(DISTINCT) at scale. Exact distinct counts on a billion-row table need O(unique items) memory — that's gigabytes for high-cardinality keys. HLL needs 12 KB.
You need to merge counts from multiple shards. HLL union is associative — each worker computes a local HLL, the coordinator does register-wise max. Distributed unique-user counts solved.
You're tracking many similar metrics. Daily-active-users per region, per country, per feature flag — each as its own 12 KB HLL. A million HLLs cost 12 GB; storing the underlying sets would cost terabytes.
You're OK with relative error. HLL gives ±1% on the result, regardless of true cardinality. If you need ±1 unit precision, use exact counting.

Skip HLL when cardinalities are tiny (just use a hash set), when you need exact counts (regulatory or billing contexts), or when you also need to enumerate the elements (HLL stores nothing about the items themselves).

HyperLogLog vs other counters

	Hash set	HyperLogLog	Count-Min Sketch	Bloom Filter
What it counts	Distinct elements	Distinct elements (estimate)	Frequency per element	Membership
Memory for 10⁹ items	~10 GB	~12 KB	~1 MB	~1.2 GB at 1% FPR
Error type	None	~1.6% std error	Always overestimates by ≤ ε	~1% false positive
Insert	O(L) hash + lookup	O(1)	O(d)	O(k)
Query type	Exact count	Approximate count	Approximate frequency	Probably-in / not-in
Mergeable across shards	Yes, expensive	Yes, O(m) max	Yes, addition	Yes, OR

Pseudo-code

// HyperLogLog with precision p (m = 2^p registers).

function HLL(p):
    m = 2^p
    registers = [0] * m
    alpha = 0.7213 / (1 + 1.079 / m)  // small-m corrected

function add(x):
    h = hash64(x)
    i = h >> (64 - p)                  // top p bits → register index
    w = (h << p) | (1 << (p - 1))      // bottom bits with sentinel
    rho = leading_zeros(w) + 1
    registers[i] = max(registers[i], rho)

function estimate():
    sum = 0
    zeros = 0
    for r in registers:
        sum += 2 ^ (-r)
        if r == 0: zeros += 1
    E = alpha * m * m / sum

    if E <= 2.5 * m and zeros > 0:
        return m * ln(m / zeros)         // linear counting for small n
    return E                              // large n: HLL formula

JavaScript implementation

class HyperLogLog {
  constructor(precision = 14) {
    this.p = precision;
    this.m = 1 << precision;
    this.registers = new Uint8Array(this.m);
    this.alpha = 0.7213 / (1 + 1.079 / this.m);
  }

  // Simple 32-bit string hash (use a real one in production: xxhash, MurmurHash3).
  hash(item) {
    const s = typeof item === 'string' ? item : JSON.stringify(item);
    let h = 2166136261;
    for (let i = 0; i < s.length; i++) {
      h = ((h ^ s.charCodeAt(i)) * 16777619) >>> 0;
    }
    return h;
  }

  add(item) {
    const h = this.hash(item);
    // Top p bits → register index.
    const i = h >>> (32 - this.p);
    // Bottom 32 - p bits → leading zeros.
    const w = (h << this.p) | (1 << (this.p - 1));
    let rho = 1;
    let m = 0x80000000;
    while ((w & m) === 0 && m !== 0) { rho++; m >>>= 1; }
    if (rho > this.registers[i]) this.registers[i] = rho;
  }

  estimate() {
    let sum = 0, zeros = 0;
    for (const r of this.registers) {
      sum += Math.pow(2, -r);
      if (r === 0) zeros++;
    }
    const E = this.alpha * this.m * this.m / sum;
    if (E <= 2.5 * this.m && zeros > 0) {
      return this.m * Math.log(this.m / zeros);   // linear counting
    }
    return E;
  }

  merge(other) {
    if (other.m !== this.m) throw new Error('precision mismatch');
    for (let i = 0; i < this.m; i++) {
      this.registers[i] = Math.max(this.registers[i], other.registers[i]);
    }
  }
}

const hll = new HyperLogLog(14);
for (let i = 0; i < 1_000_000; i++) hll.add('user_' + i);
console.log(hll.estimate());   // ~1,000,000 ± 8,100 (0.81% std err)

Python implementation

import math
import hashlib

class HyperLogLog:
    def __init__(self, p=14):
        self.p = p
        self.m = 1 << p
        self.registers = bytearray(self.m)
        self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def _hash(self, item):
        return int.from_bytes(
            hashlib.blake2b(str(item).encode(), digest_size=8).digest(),
            'big'
        )

    def add(self, item):
        h = self._hash(item)
        i = h >> (64 - self.p)                 # top p bits
        w = (h << self.p) & ((1 << 64) - 1) | (1 << (self.p - 1))
        # leading zeros + 1
        rho = 64 - w.bit_length() + 1
        if rho > self.registers[i]:
            self.registers[i] = rho

    def estimate(self):
        s = sum(2.0 ** -r for r in self.registers)
        E = self.alpha * self.m * self.m / s
        zeros = self.registers.count(0)
        if E <= 2.5 * self.m and zeros > 0:
            return self.m * math.log(self.m / zeros)
        return E

    def merge(self, other):
        for i in range(self.m):
            if other.registers[i] > self.registers[i]:
                self.registers[i] = other.registers[i]

hll = HyperLogLog(14)
for i in range(1_000_000):
    hll.add(f'user_{i}')
print(round(hll.estimate()))  # ~1,000,000 ± 0.81%

Common HyperLogLog bugs and edge cases

Forgetting the small-range correction. Plain HLL massively overestimates when most registers are zero. Always check whether the estimate is < 2.5m and fall back to linear counting in that regime.
Mismatched precisions during merge. Merging two HLLs requires the same m. If one is precision 12 and the other 14, the merge is meaningless. Downsample by ORing register groups before merging.
Bad hash function. HLL assumes near-uniform random hash output. Truncated string hashes or non-cryptographic hashes with biases skew the estimate. Use MurmurHash3, xxHash, or BLAKE2.
32-bit hash saturation. With 32-bit hashes, all hash values eventually collide once you've seen ~2³² distinct items. HLL++ uses 64-bit hashes to avoid this. For high-cardinality streams, 64-bit is mandatory.
Confusing standard error with maximum error. HLL's 1% std error means ~95% of estimates are within ±2%, ~99% within ±3%. Individual estimates can be further off. Don't claim "exactly 1% accurate."
Off-by-one in leading-zero count. The standard formulation counts ρ as 1-indexed (leading zeros + 1 for the first 1-bit). Some implementations use 0-indexed counts and get half-correct estimates. Test against a known sequence.

Performance in real systems

Redis PFADD: ~1-2 microseconds per insert. ~12 KB per HLL. Linearly scales: 1M HLLs cost ~12 GB.
BigQuery APPROX_COUNT_DISTINCT: ~10-100× faster than COUNT(DISTINCT) on high-cardinality columns. Uses HLL++ with precision 11-15 depending on query.
Snowflake APPROX_COUNT_DISTINCT: Default error ~1.6%. Tunable via HLL_ACCURACY.
Twitter: Real-time HLL for unique-tweet authors per hashtag. Each hashtag's HLL is ~12 KB; tens of millions of hashtags fit comfortably in a Redis cluster.
ClickHouse uniq(): Uses adaptive HLL — switches to exact counting for small streams, HLL above ~30k unique items.
Memory comparison: Storing 10⁹ raw 16-byte UUIDs takes 16 GB. Storing them as a hash set takes ~32 GB after overhead. HLL precision 14: 12 KB. Six orders of magnitude smaller.

HyperLogLog is the rare data structure that's mathematically optimal for its problem: minimum memory for a given variance bound, given a sketch-only constraint. Every modern analytics database has it under some name — and most of them use HLL++ with precision 14 by default.

Frequently asked questions

Why do leading zeros in hash values estimate cardinality?

If a uniform random hash has k leading zeros, that's an event of probability 2^-k. Among n distinct hashes, the expected maximum number of leading zeros is roughly log_2 n. So 'observed max leading zeros' gives a back-of-envelope estimate of distinct count: cardinality ≈ 2^(max leading zeros). The estimator has high variance on its own — HyperLogLog uses thousands of registers and harmonic mean to reduce variance dramatically.

Why does HyperLogLog use the harmonic mean?

The arithmetic mean of 2^(LZ) values is dominated by outliers — a single anomalously high register inflates the estimate. The harmonic mean (1 / mean(1/x)) gives lower weight to large values, suppressing variance. Combined with the standard correction constant α_m, this brings the standard error down to 1.04/√m where m is the register count. For m = 16384 registers (precision 14), error is ~0.81%.

What's the difference between LogLog and HyperLogLog?

The 2003 LogLog paper used the geometric mean and gave ~1.3/√m relative standard error. The 2007 HyperLogLog paper swapped in the harmonic mean, gaining a ~25% improvement to 1.04/√m. HyperLogLog++ (Google, 2013) added bias correction for low cardinalities and a sparse representation that saves memory when few buckets are populated. The 'HyperLogLog' label today usually means the HLL++ variant.

How much memory does HyperLogLog use?

m registers × 6 bits each (the max useful leading-zero count is 64 for a 64-bit hash). At precision 14 (m=16384), that's 16384 × 6 = 98304 bits ≈ 12 KB. Standard error: 1.04/√16384 ≈ 0.81%. Redis defaults to precision 14 and exactly 12 KB per HLL. Doubling memory halves the standard error — precision 15 (m=32768) uses 24 KB for ~0.57% error.

Can you merge two HyperLogLogs?

Yes — element-wise maximum of corresponding registers. If both HLLs use the same hash function and same precision, taking the max across all m registers produces a valid HLL representing the union of their input sets. The merge is associative and idempotent, which is why HLL is the canonical structure for distributed unique counts: shard the stream across workers, build local HLLs, merge them on a coordinator.

How accurate is HyperLogLog?

Standard error ≈ 1.04/√m where m is the register count. At m=16384 (Redis default): 0.81% std error, so 95% of estimates are within ±1.6% and 99% within ±2.4% of true cardinality. For 1 million distinct items, that's ±8,100 (1σ). For 1 billion items, ±8.1 million. The error is multiplicative (1.04/√m × true count), not additive — HLL is best when relative accuracy matters.

Where is HyperLogLog used in practice?

Redis PFADD/PFCOUNT/PFMERGE — the canonical end-user interface. BigQuery's APPROX_COUNT_DISTINCT (also Snowflake, Presto, ClickHouse). Twitter's unique-user counts. Reddit's daily-active-user dashboards. Web analytics — Google Analytics uses HLL-like structures for unique-visitor counts. Anywhere COUNT(DISTINCT) on billions of rows would be too expensive to compute exactly.