Compression

Burrows-Wheeler Transform

A reversible permutation that clusters similar letters — the engine of bzip2

The Burrows-Wheeler Transform sorts all rotations of a string and outputs the last column. Same letters, new order — and the new order is dramatically more compressible. Core of bzip2.

  • Forward transformO(n) with suffix array
  • Inverse transformO(n) via LF mapping
  • Output sizen characters + 1 row index
  • MemoryO(n)
  • ReversibleYes — bijection with the sentinel
  • Used bybzip2, BWA, Bowtie, FM-index

Interactive visualization

Watch all rotations of BANANA$ sort themselves and the last column emerge as a clustered string.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the Burrows-Wheeler Transform works

BWT is a permutation of the input string. The number of letters doesn't change, only their order. The new order is engineered to make downstream compression cheap.

The classic algorithm is shockingly simple:

  1. Append a sentinel. Pick a character that doesn't appear in the input — conventionally $ — and append it. The sentinel is treated as lexicographically smaller than any other character.
  2. Generate all rotations. For a string of length n, write out the n rotations (cyclic shifts).
  3. Sort them lexicographically.
  4. Output the last column. Read the rightmost character of each sorted rotation, top to bottom. That string is the BWT.

Example with BANANA$:

All rotations:           Sorted lexicographically:    Last column:
  BANANA$                   $BANANA                       A
  ANANA$B                   A$BANAN                       N
  NANA$BA                   ANA$BAN                       N
  ANA$BAN                   ANANA$B                       B
  NA$BANA                   BANANA$                       $
  A$BANAN                   NA$BANA                       A
  $BANANA                   NANA$BA                       A

BWT(BANANA$) = "ANNB$AA"

The original string BANANA$ contains the letters {A,A,A,B,N,N,$}. So does the BWT ANNB$AA. But the BWT clumps the A's and the N's. Compared to the original, run-length encoding suddenly has something to work with.

Why BWT is reversible

Knowing only the BWT string and one integer (which row of the sorted table was the original), you can recover the input. The trick is the LF mapping:

The i-th occurrence of a character in the last column (L) corresponds to the i-th occurrence of that same character in the first column (F).

Why? In a sorted rotation table, the first column is just the BWT sorted. (The L and F columns share the same multiset of characters.) Within any character group, both columns preserve the relative order of the rotations they came from.

So inverse BWT:

  1. Sort BWT(L) to get the first column F.
  2. For each position in L, record the rank of that character among its peers in L (1st A, 2nd A, etc.). Call this LF[i].
  3. Starting at the row marked as the original, walk: emit F[r], then r = LF[r]. Repeat n times.

The walk reconstructs the input one character at a time, in reverse, using only the BWT and a single starting index.

Why the output is compressible

The structural insight: rotations sort by their starting characters. All rotations starting with th land in adjacent rows. The characters preceding those rotations — sitting in the last column — are drawn from a narrow distribution because of how the source language works. In English, th is preceded by space, e, i, comma — a small set. Long runs of identical characters appear naturally.

Concretely, the BWT of a 1 MB English text has runs averaging 2-4 characters per run, versus 1.05 on the original. RLE alone gives a ~3× win on the transformed string that wasn't available on the original. Layered with move-to-front and Huffman, the total bzip2 compression on a typical 1 MB ASCII document hits ~75-80% reduction.

The bzip2 pipeline

StageWhat it doesWhy it helps
1. Block (100-900 KB)Splits input into independent blocksBounds memory; enables parallel decode
2. RLE pre-passCollapses runs of ≥4 identical bytesReduces BWT input size for pathological inputs
3. Burrows-Wheeler TransformSorts rotations, takes last columnClusters similar characters
4. Move-to-front (MTF)Each char → its position in a recency listMaps runs to streams of small integers
5. RLE on MTF outputCompresses runs of 0s (common after MTF)The MTF output is dominated by 0s for clustered inputs
6. HuffmanEntropy-codes the integer streamFinal bit-packing

Each stage transforms the data into something the next stage can compress harder. The BWT itself never compresses anything — it rearranges. The compression happens in stages 4-6.

BWT-based structures

StructureUse case
Plain BWT + MTF + Huffmanbzip2-style general compression
FM-indexCompressed full-text substring search (DNA aligners, search engines)
BWT-LZ hybridModern compressors mixing BWT block-sort with LZ77 references
Run-length BWT (RLBWT)Genomic data with extreme repetition — petabyte-scale indexes
bzip3Modern BWT compressor using arithmetic coding and SAIS for 2-3× speedup over bzip2
bscBlock sorting compressor — beats bzip2 ratio and speed via better context modeling

When to use BWT

  • You need maximum compression on text and don't care about speed. Bzip2 outperforms gzip by 15-25% on natural language and source code. Pay 2-4× the CPU.
  • Genomic search. The FM-index is the dominant data structure for short-read alignment — BWA, Bowtie, and their derivatives all use it.
  • Streaming-friendly block compression. Each bzip2 block is independent — you can seek into the middle of a compressed file with constant-time overhead per block.
  • You want repetition-friendly compression. BWT excels on inputs with repeated substrings (logs, source code, genomes), worse than LZ-based methods on inputs with long literal runs (binary images, pre-compressed data).

Skip BWT for real-time scenarios (gzip and zstd dominate latency-sensitive workloads) and tiny inputs where the per-block overhead dominates.

Pseudo-code

// Forward BWT.

function bwt(s):
    s = s + "$"                          // append sentinel
    rotations = [s[i:] + s[:i] for i in 0..len(s)]
    rotations.sort()
    last_column = ""
    original_index = -1
    for r, rot in enumerate(rotations):
        last_column += rot[-1]
        if rot == s: original_index = r
    return (last_column, original_index)

// Inverse BWT.

function inverse_bwt(last_column, original_index):
    n = len(last_column)
    first_column = sorted(last_column)

    // Build LF mapping.
    rank_in_first = {}
    for i, c in enumerate(first_column):
        rank_in_first.setdefault(c, []).append(i)

    lf = [0] * n
    counts = {}
    for i, c in enumerate(last_column):
        k = counts.get(c, 0)
        lf[i] = rank_in_first[c][k]
        counts[c] = k + 1

    // Walk LF map from original_index, emit characters.
    result = [""] * n
    r = original_index
    for i in range(n - 1, -1, -1):
        result[i] = first_column[r]
        r = lf.index(r)  // efficient version uses inverse-lf
    return "".join(result[:-1])  // drop sentinel

JavaScript implementation

function bwt(input) {
  const s = input + '$';
  const n = s.length;
  // Build rotations as indices into s+s
  const indices = Array.from({ length: n }, (_, i) => i);
  const doubled = s + s;
  indices.sort((a, b) => {
    const ra = doubled.slice(a, a + n);
    const rb = doubled.slice(b, b + n);
    return ra < rb ? -1 : ra > rb ? 1 : 0;
  });
  let last = '', origIdx = -1;
  for (let r = 0; r < n; r++) {
    const i = indices[r];
    last += s[(i + n - 1) % n];
    if (i === 0) origIdx = r;
  }
  return { bwt: last, origIdx };
}

function inverseBWT(last, origIdx) {
  const n = last.length;
  // First column = sorted last column.
  const first = [...last].sort().join('');
  // Build LF mapping.
  const counts = {};
  const occ = [];
  for (const c of last) {
    occ.push(counts[c] || 0);
    counts[c] = (counts[c] || 0) + 1;
  }
  // For each char, where do its occurrences start in first column?
  const firstOf = {};
  for (let i = 0; i < n; i++) {
    if (firstOf[first[i]] === undefined) firstOf[first[i]] = i;
  }
  const lf = new Array(n);
  for (let i = 0; i < n; i++) lf[i] = firstOf[last[i]] + occ[i];
  // Walk: F[r], then r = lf[r], n times.
  let r = origIdx, out = '';
  for (let i = 0; i < n; i++) {
    out = first[r] + out;
    r = lf[r];
  }
  return out.replace(/\$$/, '');  // strip sentinel
}

const { bwt: encoded, origIdx } = bwt('BANANA');
console.log(encoded);             // ANNB$AA
console.log(inverseBWT(encoded, origIdx));  // BANANA

Python implementation

def bwt(text):
    s = text + '$'
    n = len(s)
    rotations = sorted(s[i:] + s[:i] for i in range(n))
    last_col = ''.join(rot[-1] for rot in rotations)
    orig_idx = rotations.index(s)
    return last_col, orig_idx

def inverse_bwt(last, orig_idx):
    n = len(last)
    first = ''.join(sorted(last))

    # Count rank of each occurrence in last column.
    counts = {}
    occ = []
    for c in last:
        occ.append(counts.get(c, 0))
        counts[c] = counts.get(c, 0) + 1

    # First-column starting position for each char.
    first_of = {}
    for i, c in enumerate(first):
        if c not in first_of: first_of[c] = i

    lf = [first_of[last[i]] + occ[i] for i in range(n)]

    r, out = orig_idx, []
    for _ in range(n):
        out.append(first[r])
        r = lf[r]
    return ''.join(reversed(out)).rstrip('$')

encoded, idx = bwt('BANANA')
print(encoded)             # ANNB$AA
print(inverse_bwt(encoded, idx))  # BANANA

Real implementations replace the naive rotations.sort() with a suffix-array construction (SA-IS or DC3) in O(n). The bzip2 reference uses a hand-tuned multi-key quicksort with a fallback to a radix sort.

Common BWT bugs and edge cases

  • Forgetting the sentinel. Without a unique terminator, the rotations aren't uniquely orderable when the input has period repetitions (ABABAB). The sentinel breaks ties and ensures reversibility.
  • Storing the original index wrong. The inverse needs which row in the sorted table was the original. Off-by-one here turns the inverse into garbage. Encode it as a 32-bit integer in the file header.
  • Quadratic memory in the naive version. Building all n rotations explicitly is O(n²) memory — fatal at 100 KB blocks. Always work with suffix-array indices, not literal rotation strings.
  • Inverse LF walk direction. The walk emits characters in reverse order. Forgetting to reverse the result returns the input backwards.
  • Mixing block boundaries. Bzip2's blocks are independent; if you concatenate two compressed streams' BWTs naively you get nonsense. Process block-at-a-time.
  • Sentinel collision. If $ already appears in the input (e.g., shell scripts), you need a different sentinel or an end-of-string marker that's structurally distinct (e.g., byte 0xFF + escape encoding).

Performance in real systems

  • bzip2 -9 (900 KB blocks): ~15-25 MB/s encode, ~30-45 MB/s decode on a modern CPU. ~78-82% reduction on text.
  • bzip3: ~50-100 MB/s encode (uses arithmetic coding + SAIS), similar ratio to bzip2.
  • FM-index size: The compressed human genome (3 Gbp) fits in ~1 GB as an FM-index, supporting microsecond substring queries.
  • BWA alignment: Maps 100M short reads against the human genome in ~30 minutes on 8 cores using a BWT-based index.
  • Block size trade-off: bzip2 -1 uses 100 KB blocks (faster, less RAM, slightly worse ratio); -9 uses 900 KB (~3% better ratio). Decode RAM scales with block size.
  • vs gzip: On a 100 MB English text corpus, gzip produces ~32 MB, bzip2 ~24 MB — a 25% advantage paid for with ~3× the encode time.

BWT is the canonical example of a compression algorithm whose value is in preparation, not compression. It shrinks nothing on its own — it shapes data into a form that simpler stages can shrink dramatically.

Frequently asked questions

Why does the Burrows-Wheeler Transform help compression if it doesn't shrink the string?

BWT is a permutation — the output has exactly the same characters in different order. It doesn't compress on its own. The win is that the output is far more compressible than the input by simple downstream methods. Characters that were preceded by similar contexts in the source cluster into runs in the BWT, so run-length encoding plus move-to-front plus Huffman can squeeze the output much harder than they could squeeze the original.

How is BWT reversible without storing extra information?

Almost — you must store one integer alongside the BWT string: the row index of the original (which rotation in the sorted table was the original string, indicated by the sentinel character $). Given the BWT string and that one row index, the inverse is uniquely determined by the LF mapping — the i-th occurrence of a character in the last column matches the i-th occurrence in the first column. Repeatedly apply LF and you reconstruct the original in O(n).

What's the difference between BWT and a suffix array?

A suffix array is the array of starting indices of all suffixes in lexicographic order — it's the engine that builds the BWT in O(n) time. The BWT itself is the string formed by reading the character just before each sorted suffix. Conceptually: suffix array tells you the order, BWT extracts the characters in that order. The FM-index combines both for sublinear-memory exact substring search.

How does bzip2 use BWT?

Bzip2 works in 100-900 KB blocks. For each block: 1) Apply BWT. 2) Apply move-to-front (MTF) — replace each character with its current position in a list, then move it to the front. 3) Apply RLE on the resulting integer stream. 4) Huffman-code the result. The BWT step makes the MTF stream have many small numbers (often zeros), which RLE+Huffman compress extremely well. Bzip2 beats gzip by ~15-25% on text at the cost of being 2-4× slower.

What's the time and space complexity of BWT?

O(n) time and O(n) memory using a suffix array construction algorithm (DC3 or SA-IS). The naive 'sort all rotations' is O(n² log n) — fine for tutorials, useless for 100 KB blocks. Bzip2's reference implementation uses a custom DC3-style algorithm tuned for short blocks. Inverse BWT is O(n) time and O(n) space via the LF mapping.

Why does BWT cluster similar characters?

Rotations are sorted by their starting characters. All rotations starting with 'th' land adjacent in the sorted table. The characters preceding those rotations — the last column entries — are often the same letter, because words like 'the', 'they', 'them' all have similar preceding letters in natural language. So the last column ends up with long runs of repeated characters. Information theory: BWT exposes the contextual structure of the source.

Is BWT used in DNA analysis?

Heavily. The FM-index — a BWT-based data structure — powers BWA, Bowtie, and Bowtie2, the dominant short-read aligners. Mapping a billion 100-bp DNA reads to a 3-billion-base genome requires sublinear-memory exact and approximate search, which the FM-index delivers in O(query length) time per read. The 3 GB human genome compresses to ~1 GB index, fits in RAM, and supports backwards substring search in microseconds.