Machine Learning

Byte Pair Encoding (BPE)

A compression trick from 1994 that became the way GPT reads

Byte Pair Encoding (BPE) builds a subword vocabulary by repeatedly merging the most frequent adjacent symbol pair, turning rare words into reusable pieces — the tokenizer behind GPT, RoBERTa, and most modern LLMs.

Invented1994 (Gage) · 2016 (Sennrich, for NLP)
Training costO(N · V) naive, ~O(N log V) heap
Encoding costO(M log M) per word
Typical GPT vocab~50k (GPT-2) · ~100k+ (GPT-4)
Base alphabet256 bytes (byte-level)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How BPE works

Modern language models can't read raw text. They read integers — token IDs that index a fixed embedding table. The hard question is: what should a token be? Make tokens whole words and your vocabulary explodes (English alone has millions of word forms) while every unseen word collapses into a useless <unk>. Make tokens single characters and sequences get so long that attention, which costs O(sequence²), becomes ruinous. BPE threads the needle by learning subword units — pieces between characters and words, sized by how often they actually occur.

The algorithm started life as a data-compression scheme. Philip Gage published it in 1994 in The C Users Journal under the name "A New Algorithm for Data Compression." Rico Sennrich, Barry Haddow, and Alexandra Birch repurposed it for neural machine translation in their 2016 ACL paper, and it has been the default tokenizer for GPT-style models ever since.

Training a BPE tokenizer is a loop with one idea repeated until the vocabulary is full:

Seed. Split the corpus into words, then split each word into its base symbols (single characters, or single bytes in byte-level BPE). Append a word-boundary marker — GPT-2 prefixes a special space glyph, the original paper used a trailing </w> — so the model can tell "in" inside a word from "in" as a whole word.
Count pairs. Across the whole (word-frequency-weighted) corpus, count every adjacent symbol pair.
Merge the most frequent pair. Take the highest-count pair, say ("e", "s"), and add a new symbol "es" to the vocabulary. Rewrite every occurrence of that pair as the new symbol. Record the merge in an ordered list.
Repeat until the vocabulary reaches the target size (e.g. 50,000).

The output is two artifacts: a vocabulary (every base symbol plus every merged symbol) and an ordered merge list (the rule "merge A and B into AB," ranked by when it was learned). Encoding new text replays that merge list greedily.

The precise mechanism and its complexity

Frequent pairs become tokens; rare strings stay shredded. After training on English, you'll find " the", "ing", and "tion" as single tokens because they appear constantly, while a word like "antidisestablishmentarianism" splits into a handful of common chunks. The vocabulary is, in effect, a frequency-sorted dictionary of the most reusable text fragments.

Training cost. Let N be the number of symbol positions in the corpus and V the number of merges (target vocab minus base alphabet). A naive implementation re-scans the corpus to recount all pairs after every merge: O(N) per merge × V merges = O(N·V). That's why training a 50k-merge tokenizer on a multi-gigabyte corpus the naive way is slow. Production trainers (Hugging Face tokenizers, SentencePiece) keep a priority queue of pair counts and only update the pairs touched by each merge, bringing the practical cost close to O(N log V).

Encoding cost. Encoding is per-word and independent of corpus size. For a word of M symbols, the textbook approach repeatedly finds the lowest-rank applicable merge and applies it — O(M) merges, each an O(M) scan, so O(M²) worst case. With a min-heap keyed on merge rank it drops to O(M log M). Since M (symbols per word) is tiny, encoding is effectively linear in text length.

The encoding rule is the subtle part. You do not merge by frequency in the new text. You apply the learned merges in the order they were learned. At each step, scan the current symbol sequence, find the adjacent pair whose merge has the lowest rank (was learned earliest), apply it, and repeat until no pair in the sequence appears in the merge list. This guarantees that the same merge list always produces the same canonical tokenization.

When to choose BPE

Open-vocabulary language modeling. Any LLM that must handle arbitrary text — code, URLs, emoji, multiple languages — wants a tokenizer with no true OOV. Byte-level BPE delivers exactly that.
You want a fixed, predictable vocabulary size. BPE lets you dial the vocabulary to a precise budget (the embedding table and softmax output layer scale with it), trading sequence length against parameter count.
Multilingual or morphologically rich text. Subwords share pieces across related words ("nation", "national", "nationalize") and across languages with shared scripts, so the model generalizes from fewer examples.
Speed matters. Encoding is a deterministic greedy replay — no neural network in the loop, unlike a learned segmenter.

Reach for something else when your domain is narrow and fixed (a closed protocol grammar wants a hand-written lexer), when you need provably reversible lossless compression (use Huffman coding or an arithmetic coder, not a tokenizer), or when token boundaries must respect linguistics exactly (a morphological analyzer beats statistical merges).

BPE vs other tokenizers and coders

	BPE	WordPiece	Unigram (SentencePiece)	Word-level	Character-level	Huffman / LZW
Merge criterion	Highest raw pair frequency	Highest likelihood gain	Prune to maximize corpus likelihood	none (whole words)	none (single chars)	Code length / dictionary
Out-of-vocabulary	None (byte fallback)	None (byte/char fallback)	None	Everything unseen → <unk>	None	N/A (lossless bytes)
Vocabulary size	Tunable (~30k–100k)	Tunable (~30k)	Tunable	Millions	~256 bytes	Built from input
Tokens per English word	~1.3	~1.3	~1.3	1	~5 (chars)	N/A
Determinism on encode	Deterministic greedy	Greedy longest-match	Probabilistic (or Viterbi)	Lookup	Trivial	Deterministic
Used by	GPT-2/3/4, RoBERTa, LLaMA	BERT, DistilBERT	T5, ALBERT, mBART	early word2vec	some char-RNNs	gzip, ZIP, GIF
Purpose	Model input tokens	Model input tokens	Model input tokens	Model input tokens	Model input tokens	Bit-exact storage

The headline distinction: BPE, WordPiece, and Unigram all produce subword tokens for a model, but they disagree on what makes a good merge. BPE optimizes raw co-occurrence; WordPiece optimizes corpus likelihood; Unigram starts from a big vocabulary and prunes the least useful pieces. Huffman and LZW share BPE's "find frequent patterns" instinct but aim for bit-exact lossless storage, not model-friendly chunks — BPE for NLP doesn't care about decompressing to a smaller byte count, only about a compact, reusable token set.

What the numbers actually say

GPT-2 ships 50,257 tokens. That's 256 base bytes + 50,000 learned merges + 1 special end-of-text token. GPT-3 reuses it; GPT-4's cl100k_base roughly doubles it to ~100,277 to cut token counts on code and non-English text.
Roughly 4 characters per token on English prose — about 0.75 words per token, by OpenAI's own rule of thumb. So a 1,000-word essay is ≈ 1,333 tokens. A larger vocabulary buys fewer tokens per character: cl100k_base encodes the same English ~10-15% more compactly than GPT-2's vocab, and Python source far more compactly still.
Non-Latin scripts pay a tax. Because byte-level BPE falls back to UTF-8 bytes, a single CJK character (3 bytes in UTF-8) that wasn't merged can cost up to 3 tokens. Early GPT-2 spent several tokens per Chinese character; cl100k_base added merges to soften this, but the gap with English remains.
Training is the cheap part. Learning a 50k-merge vocabulary over a few GB of text takes minutes-to-hours on one machine with a heap-based trainer — trivial next to the GPU-months of training the model itself. The tokenizer is frozen before model training starts.

JavaScript implementation

A compact, readable trainer plus encoder. It keeps the merge list ordered (rank = position in the list) so encoding can replay merges by learned priority. Words are space-split and frequency-weighted, exactly as the original paper does.

// ---- TRAIN: learn an ordered list of merges ----
function trainBPE(corpus, numMerges) {
  // word frequency table; split each word into chars + end-of-word marker
  const wordFreq = new Map();
  for (const w of corpus.trim().split(/\s+/)) {
    wordFreq.set(w, (wordFreq.get(w) || 0) + 1);
  }
  // each word becomes an array of symbols, e.g. "low" -> ["l","o","w","</w>"]
  let words = [...wordFreq].map(([w, f]) => ({
    symbols: [...w, "</w>"],
    freq: f,
  }));

  const merges = []; // ordered: merges[i] learned at rank i

  for (let step = 0; step < numMerges; step++) {
    // 1. count adjacent pairs, weighted by word frequency
    const pairs = new Map();
    for (const { symbols, freq } of words) {
      for (let i = 0; i < symbols.length - 1; i++) {
        const key = symbols[i] + " " + symbols[i + 1];
        pairs.set(key, (pairs.get(key) || 0) + freq);
      }
    }
    if (pairs.size === 0) break;

    // 2. pick the most frequent pair (lexicographic tie-break for determinism)
    let best = null, bestCount = -1;
    for (const [key, count] of pairs) {
      if (count > bestCount || (count === bestCount && key < best)) {
        best = key; bestCount = count;
      }
    }
    const [a, b] = best.split(" ");
    merges.push([a, b]);

    // 3. rewrite every word, fusing a+b into the new symbol
    const merged = a + b;
    for (const word of words) {
      const out = [];
      const s = word.symbols;
      for (let i = 0; i < s.length; i++) {
        if (i < s.length - 1 && s[i] === a && s[i + 1] === b) {
          out.push(merged); i++;            // skip the consumed symbol
        } else out.push(s[i]);
      }
      word.symbols = out;
    }
  }
  return merges;
}

// ---- ENCODE: replay merges by learned rank ----
function encode(word, merges) {
  const rank = new Map(merges.map(([a, b], i) => [a + " " + b, i]));
  let symbols = [...word, "</w>"];

  while (symbols.length > 1) {
    // find the adjacent pair with the lowest (earliest) learned rank
    let bestRank = Infinity, bestIdx = -1;
    for (let i = 0; i < symbols.length - 1; i++) {
      const r = rank.get(symbols[i] + " " + symbols[i + 1]);
      if (r !== undefined && r < bestRank) { bestRank = r; bestIdx = i; }
    }
    if (bestIdx === -1) break;              // no learnable pair remains

    symbols = [
      ...symbols.slice(0, bestIdx),
      symbols[bestIdx] + symbols[bestIdx + 1],
      ...symbols.slice(bestIdx + 2),
    ];
  }
  return symbols;
}

const merges = trainBPE("low low low low low lower lower newest newest widest", 8);
console.log(encode("lowest", merges)); // e.g. ["low","est</w>"] depending on merges

Two details mirror real tokenizers. First, training picks by frequency but encoding picks by rank — these are different orderings, and conflating them is the classic bug. Second, the end-of-word marker keeps "est" at the end of a word distinct from "est" in the middle, which is exactly the information a model needs.

Python implementation

The same two phases in Python, plus the canonical "tokens per character" measurement that practitioners actually report.

from collections import Counter

def train_bpe(corpus: str, num_merges: int):
    word_freq = Counter(corpus.split())
    # each word -> tuple of symbols, with end-of-word marker
    words = {tuple(list(w) + ["</w>"]): f for w, f in word_freq.items()}
    merges = []                       # ordered list of (a, b)

    for _ in range(num_merges):
        pairs = Counter()
        for symbols, freq in words.items():
            for a, b in zip(symbols, symbols[1:]):
                pairs[(a, b)] += freq
        if not pairs:
            break
        # most frequent pair; tie-break lexicographically for determinism
        best = max(pairs.items(), key=lambda kv: (kv[1], kv[0]))[0]
        merges.append(best)

        a, b = best
        new_words = {}
        for symbols, freq in words.items():
            out, i = [], 0
            while i < len(symbols):
                if i < len(symbols) - 1 and symbols[i] == a and symbols[i + 1] == b:
                    out.append(a + b); i += 2
                else:
                    out.append(symbols[i]); i += 1
            new_words[tuple(out)] = freq
        words = new_words
    return merges


def encode(word: str, merges):
    rank = {pair: i for i, pair in enumerate(merges)}
    symbols = list(word) + ["</w>"]
    while len(symbols) > 1:
        # adjacent pair with the lowest learned rank
        candidates = [
            (rank[(symbols[i], symbols[i + 1])], i)
            for i in range(len(symbols) - 1)
            if (symbols[i], symbols[i + 1]) in rank
        ]
        if not candidates:
            break
        _, i = min(candidates)
        symbols = symbols[:i] + [symbols[i] + symbols[i + 1]] + symbols[i + 2:]
    return symbols


corpus = "low " * 5 + "lower " * 2 + "newest " * 6 + "widest " * 3
merges = train_bpe(corpus, num_merges=10)
print(encode("lowest", merges))

# tokens-per-character — the metric people actually quote
text = "the quick brown fox jumps over the lazy dog"
toks = sum(len(encode(w, merges)) for w in text.split())
print(f"{len(text) / toks:.2f} chars/token")

In production you would not roll your own — use OpenAI's tiktoken (enc = tiktoken.get_encoding("cl100k_base"); enc.encode(text)) or Hugging Face tokenizers, both of which are Rust-backed and add byte-level handling, regex pre-tokenization, and special tokens. The code above exists to make the loop legible, not to be fast.

Variants worth knowing

Byte-level BPE. GPT-2's key change: the base alphabet is the 256 raw bytes, not Unicode characters. Every conceivable input is some sequence of bytes, so there is no OOV and no need for an <unk> token at all. A clever byte-to-printable-character map keeps the bytes representable as visible glyphs during training.

WordPiece. BERT's tokenizer. Instead of merging the most frequent pair, it merges the pair that most increases the training-corpus likelihood under a unigram model — practically, the pair maximizing count(AB) / (count(A)·count(B)). It marks word-internal pieces with a ## prefix and encodes by greedy longest-match.

Unigram language model (SentencePiece). Kudo's 2018 alternative goes top-down: start with a large candidate vocabulary, then iteratively prune the pieces whose removal least hurts corpus likelihood. It yields a probability for each segmentation, enabling subword regularization — sampling different tokenizations of the same text during training to make models more robust. T5 and ALBERT use it.

BPE-dropout. A regularizer for BPE itself: during training, randomly skip some merges so the same word tokenizes differently across epochs, exposing the model to more subword variety without changing the learned vocabulary.

SentencePiece's whitespace handling. Rather than pre-splitting on spaces, SentencePiece treats the input as a raw stream and encodes spaces as a visible marker (▁), making tokenization fully reversible and language-agnostic — important for scripts like Chinese and Japanese that don't delimit words with spaces.

Common bugs and edge cases

Merging by frequency at encode time. Encoding must replay merges by their learned rank, not by frequency in the new text. Get this wrong and you produce non-canonical tokenizations the model never saw in training.
Dropping the word-boundary marker. Without an end-of-word (or leading-space) marker, the tokenizer can't distinguish "ing" as a suffix from "ing" mid-word, and merges leak across word boundaries that should never join.
Unstable tie-breaking. When two pairs tie on frequency, you must break the tie deterministically (lexicographic, or first-seen). Otherwise the merge list — and every downstream model — changes between runs.
Forgetting the byte fallback. A character-level vocabulary that hasn't seen some Unicode character will choke or emit <unk>. Byte-level BPE avoids this by construction; if you implement character-level, you still need a defined fallback.
Naive O(N·V) training on big corpora. Re-counting all pairs after every merge is fine for a tutorial and intolerable for production. Maintain incremental pair counts and only update pairs adjacent to a merge.
Counting token cost as character cost. A 100-character prompt is not 100 tokens — and not a fixed ratio either. Numbers, code, and CJK text blow up the token count far past the ~4-chars-per-token English heuristic, which matters when you're paying per token.

Frequently asked questions

Why does BPE solve the out-of-vocabulary problem?

A word-level vocabulary maps any unseen word to a single <unk> token, throwing away all information. BPE's vocabulary always includes the 256 single bytes (byte-level) or every base character, so any string — a typo, a new product name, an emoji — can be tokenized as a fallback sequence of smaller pieces. Nothing is ever truly unknown.

What is the difference between BPE and WordPiece?

Both build subword vocabularies by merging, but BPE merges the pair with the highest raw frequency, while WordPiece (used by BERT) merges the pair that most increases the likelihood of the training corpus under a unigram language model — roughly, the pair with the highest count(AB) / (count(A)·count(B)). BPE is simpler and slightly faster to train; WordPiece tends to favor linguistically meaningful merges.

How many tokens does GPT use for a typical English word?

On English prose, OpenAI's tiktoken averages about 0.75 words per token — roughly 4 characters per token. Common words are a single token; rare or compound words split into 2-4 pieces. Code, non-Latin scripts, and long numbers cost far more tokens per character because their byte patterns are less frequent in the merge table.

Is BPE training deterministic?

Yes, given a fixed corpus, a fixed target vocabulary size, and a fixed tie-breaking rule. The only non-determinism is how you break ties when two pairs have the same frequency; libraries pin this (e.g. lexicographically) so the same corpus always yields the same ordered merge list. Encoding new text with that merge list is fully deterministic and greedy.

Why is byte-level BPE used instead of character-level?

Character-level BPE needs a base alphabet of every Unicode character it might see — over 140,000 code points — and still breaks on bytes it never trained on. Byte-level BPE (GPT-2 onward) uses the 256 raw UTF-8 bytes as its base alphabet, so the entire universe of possible inputs is covered by a tiny, fixed 256-symbol floor. No input is ever unrepresentable.

Does the order of merges matter when encoding?

Yes — encoding applies merges in the exact order they were learned, not by frequency in the new text. At each step you find the pair in the current sequence whose merge has the lowest learned rank (was learned earliest) and apply it, repeating until no learnable pair remains. Applying merges in the wrong order produces a different, non-canonical tokenization.