Compression

Huffman Coding

Frequent symbols, short codes — optimal prefix coding in O(n log n)

Huffman coding is a lossless compression scheme that assigns variable-length prefix codes to symbols by frequency. Build a min-heap, repeatedly merge the two lowest-frequency nodes. Powers gzip, JPEG, MP3.

  • Tree constructionO(n log n) for n symbols
  • Encode / decodeO(L) input length
  • Compression ratio~43-50% on English text
  • OptimalityOptimal symbol-by-symbol prefix code
  • Code propertyPrefix-free, self-delimiting
  • Used bygzip, DEFLATE, JPEG, MP3, PNG, PKZIP

Interactive visualization

Watch the tree build itself bottom-up by repeatedly fusing the two lowest-frequency nodes.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Huffman coding works

The motivation in one sentence: in normal English text, the letter e appears about 12% of the time and z about 0.07%. Storing both as 8-bit ASCII wastes bits on common letters and stingily underspends on rare ones. Huffman fixes this by giving common letters short codes (2-3 bits) and rare letters long codes (12+ bits). The expected total cost goes down.

The algorithm is greedy and runs in two passes:

  1. Count frequencies. Walk the input once, build a histogram freq[symbol].
  2. Build the tree. Put every (frequency, symbol) pair into a min-heap. While the heap has more than one node, extract the two smallest, create a parent whose frequency is their sum, push the parent back. The last node is the root.
  3. Assign codes. Walk the tree from the root: going left appends a 0 bit to the current code, going right appends a 1. Every leaf gets a unique bit string whose length equals its depth.
  4. Encode. Replace each input symbol with its code, concatenating bits.
  5. Decode. Walk the tree from the root using each bit, emit the leaf's symbol when one is reached, restart at the root.

Worked example with five symbols and frequencies A=45, B=13, C=12, D=16, E=9, F=5:

Initial heap:  [F:5, E:9, C:12, B:13, D:16, A:45]

Merge F+E → (FE:14):    [C:12, B:13, FE:14, D:16, A:45]
Merge C+B → (CB:25):    [FE:14, D:16, CB:25, A:45]
Merge FE+D → (FED:30):  [CB:25, FED:30, A:45]
Merge CB+FED → (X:55):  [A:45, X:55]
Merge A+X → (root:100): [root:100]

Codes (left=0, right=1):
  A = 0           (1 bit)
  C = 100         (3 bits)
  B = 101         (3 bits)
  F = 1100        (4 bits)
  E = 1101        (4 bits)
  D = 111         (3 bits)

Expected bits per symbol: 45·1 + 13·3 + 12·3 + 16·3 + 9·4 + 5·4 = 224 bits. Fixed-width 3-bit coding (we need ⌈log₂ 6⌉ = 3 bits) would cost 100·3 = 300 bits. Huffman saves 25%.

The prefix property — why decoding is unambiguous

Every Huffman code lives at a leaf of the binary tree, never at an internal node. That means no code is a prefix of another. The decoder can therefore read bits left-to-right and emit a symbol the moment it lands at a leaf, then jump back to the root for the next symbol — no field separators, no length-prefix, no lookahead.

If you wrote codes ad-hoc — say, A=0, B=01, C=10 — you'd have ambiguity. The bits 010 could mean AB or AC or some other reading. The tree structure makes this impossible by construction.

Huffman variants

VariantTrickWhy use it
Static HuffmanBuild tree once over the whole input, transmit the treeBest compression when input is known up front
Adaptive Huffman (FGK, Vitter)Tree evolves during encoding; decoder mirrors updatesStreaming, no separate tree transmission
Canonical HuffmanStore only code lengths, not the tree itselfJPEG and DEFLATE — header overhead is tiny
Length-limited Huffman (Package-Merge)Cap maximum codeword length (e.g., ≤15 bits in DEFLATE)Fixed-size decoding tables, hardware decoders
Range / arithmetic codingTreats whole message as a fractional intervalBeats Huffman's whole-bit constraint; bzip2 / ZSTD use it
ANS (asymmetric numeral systems)State-based coding using a precomputed tableModern: Zstandard, LZFSE — faster than arithmetic, same ratio

For most use cases today, canonical Huffman dominates because the tree representation is compact (just a list of code lengths per symbol) and decoding can be table-driven for cache efficiency. DEFLATE — the algorithm inside gzip and PNG — uses canonical Huffman with a maximum code length of 15 bits.

When to use Huffman coding

  • You need a fast, lossless, license-free compressor. Huffman has been off-patent since the 1950s, runs in linear time after a single histogram pass, and is the bottom layer of most general-purpose compressors.
  • Your data is skewed. Pure Huffman shines when a small number of symbols dominate the histogram — natural language, DNA (4 symbols with biased frequencies), pixel-difference channels in JPEG.
  • You're building a hybrid compressor. Almost every real-world codec layers Huffman on top of something else. LZ77 + Huffman = DEFLATE. RLE + MTF + Huffman = bzip2's tail end. DCT + Huffman = JPEG.
  • Hardware decoding matters. Canonical Huffman tables fit in tiny ROMs and decode in one or two table lookups per symbol; arithmetic and ANS coders need multiplication.

Skip Huffman when your alphabet has <4 symbols (the rounding to whole bits hurts you), when the data is already uniformly random (no symbol is more likely than another, no compression possible), or when you can afford arithmetic coding's slightly better ratio at the cost of CPU.

Huffman vs other compressors

HuffmanArithmeticLZ77ZSTD
Code granularityWhole bits per symbolFractional bitsSubstring referencesLZ77 + ANS
Ratio on English text~50% reduction~55%~60%~70-75%
Encode speed (MB/s)~200-400~50-100~150~400-1000
Decode speed~500-1000 MB/s~100 MB/s~400~1500
LicenseFree since 1952Patent-clear since ~2010FreeFree (Facebook)
Streaming-friendlyYes (adaptive)YesYesYes
Best forFinal bit-packingMaximum ratio per CPU cycleRepetitive dataGeneral-purpose 2020s

The unit of comparison matters. Huffman by itself underperforms because real data has repetition that whole-symbol coding can't exploit. Pair it with LZ77 (which finds and removes repetition) and the combination — DEFLATE — has been the default of the web for thirty years.

Pseudo-code

// Huffman tree construction.

function huffmanTree(frequencies):
    heap = min-heap of (freq, node) pairs
    for symbol, freq in frequencies:
        heap.push((freq, Leaf(symbol)))

    while heap.size > 1:
        (f1, n1) = heap.pop()
        (f2, n2) = heap.pop()
        merged = Internal(left=n1, right=n2)
        heap.push((f1 + f2, merged))

    root = heap.pop().node
    return root

// Walk tree to extract codes.
function buildCodebook(node, prefix="", codes={}):
    if node is Leaf:
        codes[node.symbol] = prefix or "0"  // single-symbol edge case
    else:
        buildCodebook(node.left, prefix + "0", codes)
        buildCodebook(node.right, prefix + "1", codes)
    return codes

// Encode: lookup each symbol in codebook.
function encode(text, codes):
    return join([codes[c] for c in text])

// Decode: walk tree per bit.
function decode(bits, root):
    out = ""
    node = root
    for b in bits:
        node = node.left if b == "0" else node.right
        if node is Leaf:
            out += node.symbol
            node = root
    return out

JavaScript implementation

class MinHeap {
  constructor() { this.h = []; }
  push(item) {
    this.h.push(item);
    let i = this.h.length - 1;
    while (i > 0) {
      const p = (i - 1) >> 1;
      if (this.h[p][0] <= this.h[i][0]) break;
      [this.h[p], this.h[i]] = [this.h[i], this.h[p]];
      i = p;
    }
  }
  pop() {
    const top = this.h[0];
    const last = this.h.pop();
    if (this.h.length) {
      this.h[0] = last;
      let i = 0, n = this.h.length;
      for (;;) {
        const l = 2*i+1, r = 2*i+2; let s = i;
        if (l < n && this.h[l][0] < this.h[s][0]) s = l;
        if (r < n && this.h[r][0] < this.h[s][0]) s = r;
        if (s === i) break;
        [this.h[s], this.h[i]] = [this.h[i], this.h[s]];
        i = s;
      }
    }
    return top;
  }
  size() { return this.h.length; }
}

function buildTree(text) {
  const freq = new Map();
  for (const ch of text) freq.set(ch, (freq.get(ch) || 0) + 1);
  const heap = new MinHeap();
  for (const [ch, f] of freq) heap.push([f, { sym: ch }]);
  while (heap.size() > 1) {
    const [f1, n1] = heap.pop();
    const [f2, n2] = heap.pop();
    heap.push([f1 + f2, { left: n1, right: n2 }]);
  }
  return heap.pop()[1];
}

function buildCodes(node, prefix = '', codes = {}) {
  if (node.sym !== undefined) { codes[node.sym] = prefix || '0'; return codes; }
  buildCodes(node.left, prefix + '0', codes);
  buildCodes(node.right, prefix + '1', codes);
  return codes;
}

function encode(text) {
  const tree = buildTree(text);
  const codes = buildCodes(tree);
  const bits = [...text].map(c => codes[c]).join('');
  return { tree, bits };
}

function decode(bits, tree) {
  let out = '', node = tree;
  for (const b of bits) {
    node = b === '0' ? node.left : node.right;
    if (node.sym !== undefined) { out += node.sym; node = tree; }
  }
  return out;
}

const { tree, bits } = encode('ABRACADABRA');
console.log(bits.length, 'bits vs', 11 * 8, 'fixed-width');
console.log(decode(bits, tree));  // ABRACADABRA

Python implementation

import heapq
from collections import Counter

class Node:
    def __init__(self, sym=None, freq=0, left=None, right=None):
        self.sym, self.freq, self.left, self.right = sym, freq, left, right
    def __lt__(self, other):
        return self.freq < other.freq

def build_tree(text):
    freq = Counter(text)
    heap = [Node(sym, f) for sym, f in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        n1 = heapq.heappop(heap)
        n2 = heapq.heappop(heap)
        heapq.heappush(heap, Node(freq=n1.freq + n2.freq, left=n1, right=n2))
    return heap[0]

def build_codes(node, prefix='', codes=None):
    if codes is None: codes = {}
    if node.sym is not None:
        codes[node.sym] = prefix or '0'
        return codes
    build_codes(node.left, prefix + '0', codes)
    build_codes(node.right, prefix + '1', codes)
    return codes

def encode(text):
    tree = build_tree(text)
    codes = build_codes(tree)
    return tree, ''.join(codes[c] for c in text)

def decode(bits, tree):
    out, node = [], tree
    for b in bits:
        node = node.left if b == '0' else node.right
        if node.sym is not None:
            out.append(node.sym)
            node = tree
    return ''.join(out)

tree, bits = encode('ABRACADABRA')
print(len(bits), 'bits vs', 11 * 8, 'fixed-width')
print(decode(bits, tree))

For "ABRACADABRA" the Huffman tree assigns A=0, B=10, R=110, C=1110, D=1111. Total length: 23 bits. Fixed-width (3 bits for 5 symbols) would cost 33 bits — Huffman saves 30%.

Common Huffman bugs and edge cases

  • Forgetting the tree. The decoder needs the tree — or the canonical code lengths — to reverse the encoding. Streaming the bits without the table produces garbage. Real-world formats prepend a compact tree serialization.
  • Single-symbol input. If the entire input is one character, the heap starts with one node and never enters the merge loop. Your "tree" is a single leaf with no code. The fix: when only one symbol exists, assign it the code 0 as a special case.
  • Unbounded code lengths. A pathological frequency distribution (Fibonacci-shaped) produces codes as long as n bits. DEFLATE caps lengths at 15 and falls back to a length-limited variant when needed. Fixed-size hardware decoders need this constraint.
  • Tie-breaking order. When two heap entries have equal frequency, the order you pop them affects the resulting tree (not the optimality, but the specific codes). Encoders and decoders must use the same tie-break rule or shipping a tree across systems goes wrong.
  • Static vs adaptive mismatch. Encoder uses static, decoder expects adaptive — or vice versa. Output looks compressed but is garbled. Always serialize the variant in the file header.
  • Naive frequency counting on huge inputs. Walking a 10 GB file twice (once to count, once to encode) is fine for offline tools but wrong for streaming. Use adaptive Huffman or a sample-based static tree.

Performance in real systems

Concrete numbers from production codecs:

  • gzip default (-6): ~50 MB/s encode, ~250 MB/s decode on a modern x86 core. ~65% reduction on English text.
  • JPEG baseline: Huffman tables consume ~400 bytes in the header; decoding is table-driven and adds <5% to total decode time.
  • DEFLATE huffman pass alone: contributes ~15-25% extra reduction on top of LZ77 substring matches.
  • Canonical Huffman decode: One memory lookup per ~6-8 bits with a 9-bit fast table; falls through to a slow path only for the long codes.
  • Entropy gap: Huffman is ~5-10% above Shannon entropy for English; arithmetic coding closes that gap at ~4-8× the CPU cost.

The takeaway: Huffman is rarely the bottleneck. The bottleneck is usually the substring-match pass (LZ77, BWT) that runs before Huffman. Huffman is what you reach for to squeeze the last 10-25% out of any byte-stream where some symbols are common and others are rare.

Frequently asked questions

Why is Huffman coding optimal?

Among all symbol-by-symbol prefix codes — meaning codes where no codeword is a prefix of another and each symbol maps to its own codeword — Huffman produces the minimum expected codeword length. The greedy choice of merging the two lowest-frequency nodes can be proven optimal by an exchange argument: any other pair you merge first can be swapped with Huffman's choice without increasing total cost. Note: arithmetic coding and ANS can compress better because they don't restrict each symbol to a whole number of bits.

What's a prefix code and why does Huffman use one?

A prefix code is a code where no codeword is a prefix of another. This means the decoder can read the bitstream left-to-right and decode each symbol the moment it has enough bits — no separators or fixed-width framing needed. Huffman's tree structure guarantees the prefix property automatically: every symbol lives at a leaf, so its path can't be a prefix of any other path.

How much compression does Huffman achieve on English text?

About 43-50% reduction — roughly 4.7 bits per character versus 8 bits ASCII. The Shannon entropy of English is around 4.1 bits/char, and Huffman gets within ~10% of that limit. Gzip layers Huffman on top of LZ77 (which replaces repeats with back-references) to reach 65-75% reduction on typical text. Pure Huffman on a single 100 KB English document compresses to roughly 56 KB.

Why does gzip use Huffman coding?

Gzip is DEFLATE, which is LZ77 followed by Huffman. LZ77 finds repeated substrings and replaces them with (distance, length) back-references — but those references and literal bytes are themselves drawn from a non-uniform distribution. Huffman compresses that distribution: literal 'e' is frequent, literal 'q' is rare, short distances are frequent, long ones rare. The Huffman pass on DEFLATE's intermediate stream typically saves another 15-25% on top of LZ77's wins.

What's the difference between static and adaptive Huffman?

Static Huffman scans the entire input first to count frequencies, builds the tree, then encodes — the tree must be transmitted alongside the data. Adaptive Huffman (Vitter's algorithm, FGK) maintains the tree as it encodes, updating frequencies on the fly. Decoder mirrors the same updates. Adaptive avoids transmitting the tree and handles streaming data, but each symbol triggers tree rebalancing — slower in practice. Static dominates in production.

What's the time complexity of Huffman tree construction?

O(n log n) for n distinct symbols using a min-heap. Each of the n-1 merge operations does two heap extractions (O(log n) each) and one insertion (O(log n)). If frequencies are already sorted, a two-queue variant achieves O(n). Encoding once the tree is built is O(L) where L is total input length; decoding is also O(L) using a state-machine flattening of the tree.

What if all symbols have the same frequency?

Huffman degenerates into a balanced binary tree, and the code becomes fixed-width — every symbol gets ⌈log₂ n⌉ bits. No compression. Huffman's advantage requires skewed frequencies; the more skewed the distribution, the bigger the win. On uniformly random data, Huffman cannot compress at all (information theory forbids it), and the tree overhead actually makes the output slightly larger.