Data Structures

Wavelet Tree

Q: How does recursion on alphabet halves work?

At the root, partition the alphabet [0, sigma) into lower half [0, sigma/2) and upper half [sigma/2, sigma). The root bitvector has length n: position i is 0 if s[i] is in the lower half, 1 if in the upper. Now form two subsequences — the lower-half characters in order (left child) and the upper-half characters in order (right child) — and recurse on each with the corresponding alphabet half. Tree depth is log2(sigma); leaves represent single characters. Total bits stored across all levels is exactly n * ceil(log2(sigma)). The structure exists implicitly — only the bitvectors plus their rank/select index are stored.

Q: What is rank and select on a bitvector?

rank_b(B, i) = number of bits equal to b in B[0..i-1]. select_b(B, k) = position of the k-th bit equal to b. Both are O(1) using a precomputed index of about 0.05n extra bits — superblock counts (every 512 bits), block counts (every 64 bits), and word-level popcount. The classic Jacobson 1989 / Clark 1996 indexes plus Vigna's broadword tricks deliver rank/select on any 64-bit popcount-supporting CPU in 5 to 20 nanoseconds. The wavelet tree's O(log sigma) query bound depends on each level being O(1) — every bitvector in the tree carries its own rank/select index.

Q: How does the FM-index use wavelet trees?

The FM-index (Ferragina, Manzini 2000) is a compressed full-text index built on the Burrows-Wheeler transform L of a string T, plus a count array C and a wavelet tree over L for rank queries. Backward search for a pattern P of length m: start with [0, n] and shrink by repeated rank queries on the BWT — for each character P[i], compute new range [C[c] + rank_c(L, l), C[c] + rank_c(L, r)]. Each step is O(log sigma); total query is O(m log sigma) plus locate. This is a sub-linear query on a compressed text — used in Bowtie, BWA, and SDSL succinct libraries for genomic search.

Q: When do I want this over a hash or array?

When the alphabet is large, the sequence is long, and you need rank/select/quantile/range queries. A hash table can answer membership in O(1) but cannot answer rank_c(s, i) (count of c in prefix) or quantile queries (the k-th smallest in a range) without auxiliary structures. A character-indexed 2D array can answer rank in O(1) but uses sigma * n bits — prohibitive when sigma > 256. Wavelet trees give O(log sigma) for all these queries with only n log sigma bits — within o(1) of the entropy bound. Used in bioinformatics (alphabet 4 to 20), text search (Unicode), graph adjacency lists, and time-series compressed indexes.

Recursively partition the alphabet — answer rank/select/quantile in O(log σ) on a packed bitvector

A wavelet tree is a succinct data structure for sequences over an alphabet of size σ, supporting rank, select, and access queries in O(log σ) time using only n log σ + o(n log σ) bits of space. The construction recursively partitions the alphabet into halves; at each level, a bitvector marks which characters fell in the upper half. Combined with constant-time bitvector rank, you get rank_c(s, i) — the count of character c in s[0..i] — in O(log σ). Invented by Grossi, Gupta, Vitter in 2003. Foundational for FM-index full-text search, range quantiles, and 2D dominance queries.

Spacen log σ + o(n log σ) bits
Rank/selectO(log σ)
ConstructionO(n log σ)
Foundational useFM-index
InventedGrossi Gupta Vitter 2003
Range quantileO(log σ)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why wavelet tree matters

Compressed full-text index. The FM-index pairs the Burrows-Wheeler transform with a wavelet tree to compress a text T to roughly n H_k(T) bits while supporting backward-search queries in O(m log σ) time. Tools like BWA and Bowtie use it to align billions of DNA reads against the human genome (3 * 10^9 bases) in a few gigabytes of RAM rather than tens of gigabytes uncompressed.
Bioinformatics. DNA over alphabet {A, C, G, T}, σ = 4, so log2(σ) = 2 — every wavelet-tree operation costs 2 bitvector queries. Protein sequences have σ ≈ 20, log2(σ) ≈ 4.3. Both fit easily in a few CPU cycles.
Document retrieval. Inverted indexes augmented with wavelet trees support snippet ranking, top-k document retrieval, and range-restricted queries (find docs in date range mentioning a term) in O(log σ) per access without unrolling the entire posting list.
Range quantile queries. Given a sequence of integers and a query range [i, j], find the median or k-th smallest in O(log σ). Used in Online Analytical Processing (OLAP) over compressed columnar data.
2D dominance queries. Sort points by x; build a wavelet tree on y values. Range counting, range maximum, and dominance queries reduce to wavelet-tree operations in O(log σ) per query.
Sequence summarization. Histogram queries, frequent elements, and top-k extraction over arbitrary substring ranges — all O(log σ) per element returned. The basis for "succinct analytics" research.

Construction algorithm

Input. A sequence s of length n over alphabet [0, σ).
Root level. Build bitvector B_root of length n. Set B_root[i] = 1 if s[i] >= σ/2 else 0. Build the rank/select index over B_root.
Partition. Form left subsequence s_L = characters in s with value < σ/2, in original order. Form right subsequence s_R = characters >= σ/2 (typically minus σ/2 to renumber).
Recurse. Build wavelet tree of s_L over alphabet [0, σ/2); build wavelet tree of s_R over alphabet [0, σ/2). Stop when alphabet has size 1.
Total time. Each level processes n bits, and there are log2(σ) levels — total O(n log σ).
Total space. Sum of bitvector lengths across levels is exactly n at every level (each character contributes 1 bit per level), giving n * log2(σ) bits for the bitvectors plus o(n log σ) for the rank/select indexes. Optimal up to lower-order terms.

Query algorithms

Access(i). Walk root to leaf. At each node, read B[i]: if 0, recurse left with i' = rank_0(B, i+1) - 1; if 1, recurse right with i' = rank_1(B, i+1) - 1. Reaching a leaf yields the character. O(log σ).
Rank_c(i). Walk root to leaf following the binary expansion of c. At each node, if c's bit is 0, set i = rank_0(B, i) and go left; else i = rank_1(B, i) and go right. The final i at the leaf is rank_c. O(log σ).
Select_c(k). Walk leaf to root. At the leaf for c, start with k. At each step up, transform: if coming from a left child, the new position is select_0(B_parent, k); else select_1(B_parent, k). O(log σ).
Range count. Count occurrences of c in s[i..j] = rank_c(j+1) - rank_c(i). O(log σ).
Range quantile. Walk root to leaf. At each node compute zero_count = rank_0(B, j+1) - rank_0(B, i). If k <= zero_count, recurse left with new range; else subtract zero_count from k and recurse right. O(log σ).
Range top-k. Best-first search through the tree using a priority queue keyed by node frequency. O((k + 1) log σ).

Variants

Balanced wavelet tree. Partition the alphabet at the median (numeric center). Depth = ceil(log2(σ)). Worst-case query O(log σ).
Huffman-shaped wavelet tree. Use character frequencies to determine partition. Depth proportional to character entropy. Total bits = n * H_0(s) where H_0 is zero-order entropy. Average query time O(H_0 + 1).
Wavelet matrix. Claude and Navarro 2012 — concatenate all level bitvectors into one big bitvector, with offsets per level. Same asymptotic bounds, much better cache behavior on real CPUs (single sequential bitvector instead of pointer-chasing per level). Typical 2 to 4x query speedup. Default in modern succinct libraries.
Generalized wavelet tree. Use sigma-ary alphabet partition instead of binary, reducing depth to log_sigma(σ) at the cost of more bits per level. Useful when σ is small and CPU has wide SIMD.
Multiary / k-ary wavelet tree. Partition into k subsets per level; bitvectors generalize to k-ary sequences encoded with log2(k) bits each. Speeds up rank/select for moderate σ.
Wavelet tree over implicit alphabet. Rank-Space encode the input first to compress σ to the count of distinct characters. Combined with an inverse map, this drops space and accelerates queries when only a small subset of the alphabet appears.

Space-saving math

Plain encoding. n characters at log2(σ) bits each = n log2(σ) bits. Cannot answer rank/select without auxiliary structure.
Wavelet tree. n * log2(σ) bits for the bitvectors + o(n log σ) for the rank/select indexes (typically 5 to 10% overhead). Answers rank/select/quantile in O(log σ).
Compressed bitvectors. Replace each level's plain bitvector with a compressed encoding (RRR, gap encoding, sparse Elias-Fano). Total bits drop to n * H_0(s) + o(n log σ).
Concrete example. Human genome n = 3 * 10^9, σ = 4. Plain: 6 * 10^9 bits = 750 MB. Wavelet tree: ~6 * 10^9 + 5% overhead ≈ 790 MB. With BWT compression: ~600 MB total — fits in RAM on a laptop.
Bigger alphabets. Wikipedia text n = 10^10 characters, σ = 256 (bytes). Plain: 80 * 10^9 bits = 10 GB. Wavelet tree: 10 GB + 0.5 GB index. With Huffman shape: ~60 * 10^9 bits = 7.5 GB.

Common misconceptions

"Succinct = compressed." Different goals. Compressed structures aim for n H_k bits of space, often without random access. Succinct structures aim for the information-theoretic minimum bits while keeping rank/select queries O(1) or near. Wavelet trees are succinct in this sense — within o(1) of the entropy lower bound.
"Complex implementation." A first-cut wavelet tree fits in 200 lines of C++. Production-quality versions (with cache-friendly bitvector packing and rank/select indexes) are bigger but readily available in libraries like SDSL-lite.
"Worse than a hash table." Different operations. Hash tables answer point membership in O(1). Wavelet trees answer rank, select, range count, and range quantile that hashes cannot answer at all without auxiliary indexes. The right structure depends on the query mix.
"Only useful for text." Wavelet trees apply to any integer sequence over a bounded alphabet — graph adjacency lists, time series, sorted permutations. Range-distinct, range-quantile, and 2D dominance queries on such sequences map directly to wavelet-tree primitives.
"σ must be a power of 2." Not required. Pad the alphabet to the next power of 2 with dummy characters that never appear, or implement non-power-of-2 partitions explicitly (slightly more bookkeeping, same asymptotics).
"FM-index needs the whole tree at query time." FM-index uses only rank queries, never select; it only needs the root-to-leaf walk per character. Memory-mapped wavelet matrices answer FM-index queries with one disk page per level.
"O(log σ) is slower than a hash." For σ = 256, log2(σ) = 8. With a wavelet matrix and broadword rank, each query is ~50 ns — within 3x of a hashmap lookup, while supporting orders-of-magnitude richer queries.

Frequently asked questions

How does recursion on alphabet halves work?

At the root, partition the alphabet [0, sigma) into lower half [0, sigma/2) and upper half [sigma/2, sigma). The root bitvector has length n: position i is 0 if s[i] is in the lower half, 1 if in the upper. Now form two subsequences — the lower-half characters in order (left child) and the upper-half characters in order (right child) — and recurse on each with the corresponding alphabet half. Tree depth is log2(sigma); leaves represent single characters. Total bits stored across all levels is exactly n * ceil(log2(sigma)). The structure exists implicitly — only the bitvectors plus their rank/select index are stored.

What is rank and select on a bitvector?

rank_b(B, i) = number of bits equal to b in B[0..i-1]. select_b(B, k) = position of the k-th bit equal to b. Both are O(1) using a precomputed index of about 0.05n extra bits — superblock counts (every 512 bits), block counts (every 64 bits), and word-level popcount. The classic Jacobson 1989 / Clark 1996 indexes plus Vigna's broadword tricks deliver rank/select on any 64-bit popcount-supporting CPU in 5 to 20 nanoseconds. The wavelet tree's O(log sigma) query bound depends on each level being O(1) — every bitvector in the tree carries its own rank/select index.

How does the FM-index use wavelet trees?

The FM-index (Ferragina, Manzini 2000) is a compressed full-text index built on the Burrows-Wheeler transform L of a string T, plus a count array C and a wavelet tree over L for rank queries. Backward search for a pattern P of length m: start with [0, n] and shrink by repeated rank queries on the BWT — for each character P[i], compute new range [C[c] + rank_c(L, l), C[c] + rank_c(L, r)]. Each step is O(log sigma); total query is O(m log sigma) plus locate. This is a sub-linear query on a compressed text — used in Bowtie, BWA, and SDSL succinct libraries for genomic search.

When do I want this over a hash or array?

When the alphabet is large, the sequence is long, and you need rank/select/quantile/range queries. A hash table can answer membership in O(1) but cannot answer rank_c(s, i) (count of c in prefix) or quantile queries (the k-th smallest in a range) without auxiliary structures. A character-indexed 2D array can answer rank in O(1) but uses sigma * n bits — prohibitive when sigma > 256. Wavelet trees give O(log sigma) for all these queries with only n log sigma bits — within o(1) of the entropy bound. Used in bioinformatics (alphabet 4 to 20), text search (Unicode), graph adjacency lists, and time-series compressed indexes.

What is the difference between a balanced and a Huffman-shaped wavelet tree?

A balanced wavelet tree partitions the alphabet at the median, giving depth ceil(log2(sigma)) and n * ceil(log2(sigma)) bits — O(log sigma) per op. A Huffman-shaped wavelet tree uses character frequencies: frequent characters live near the root, rare ones deep down. Total bits become n * H_0(s), where H_0 is the empirical zero-order entropy — typically much less than log2(sigma). Average query time becomes O(H_0(s) + 1), often 2 to 5x faster on real text. Variants include the wavelet matrix (Claude, Navarro 2012) which packs all level bits into a single big bitvector for cache efficiency.

How is range quantile computed?

Quantile(i, j, k) = the k-th smallest character in s[i..j]. Walk down the wavelet tree: at the root, count zeros in s[i..j] using rank_0(B, j+1) - rank_0(B, i). If k <= zero_count, the answer is in the left subtree (lower alphabet half); recurse left with the projected range. Otherwise, the answer is in the right subtree; subtract zero_count from k and recurse right. Total work is O(log sigma). Variants compute range majority, range top-k frequent characters, and range dominant — all in O(log sigma) with the wavelet tree as the core engine.