Machine Learning

Beam Search

Hedge your bets — keep the top k sequences, not just the best one

Beam search is a heuristic decoding algorithm that keeps the k highest-scoring partial sequences at each step instead of committing greedily to one, trading O(k·V) work per step for far better final sequences.

Per-step costO(k·V)
Total cost (length L)O(L·k·V)
MemoryO(k·L)
Greedy isk = 1
Optimal?No — heuristic

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The greedy trap, and how beam search escapes it

Imagine a translation model decoding one word at a time. At step one the most likely word is "The" with probability 0.6, and "A" with 0.4. Greedy decoding commits to "The" and never looks back. But suppose the only good sentence the model can complete starts with "A" — greedy has already painted itself into a corner. A locally optimal first choice led to a globally bad sentence. This is the central failure mode of greedy decoding: it makes an irreversible decision at every step using only one step of look-ahead.

Beam search hedges. Instead of keeping one running hypothesis, it keeps the k best partial sequences — the beams — alive at every step, where k is the beam width. It expands all k of them by every possible next token, scores the resulting candidates, and prunes back down to the top k. A first choice that looked mediocre survives long enough to prove itself if its continuations score well. Greedy is just beam search with k = 1; exhaustive search is beam search with k = ∞. Beam search lives on the dial between them.

The algorithm has been a workhorse of sequence prediction since the speech-recognition systems of the 1970s, and it powered essentially every neural machine translation and image-captioning system before sampling-based decoding (top-k, nucleus) took over for open-ended generation. It is still the default for translation, summarization, and speech-to-text, where you want the single most plausible output rather than a diverse one.

How beam search works, step by step

Decoding scores a sequence by the sum of per-token log-probabilities (we use logs because the product of many probabilities underflows to zero in floating point). A beam is a (sequence, score) pair. The loop is:

Initialize with a single beam: the empty sequence (or just the start token) with score 0.
Expand. For each of the k active beams, run the model to get a distribution over the vocabulary V, and form k·V candidate continuations by appending every token and adding its log-probability to the beam's score.
Prune. Keep only the top k candidates by score; discard the rest. These become the active beams for the next step.
Retire finished beams. Any beam that just emitted the end-of-sequence token (<eos>) is moved to a finished list and stops being expanded; the search shrinks to fill its slot.
Stop when k finished hypotheses are collected or a maximum length is reached, then return the highest-scoring finished beam (after length normalization).

The key insight is the prune-then-expand discipline. At every step exactly k hypotheses survive, regardless of how the tree branches, so memory stays bounded at O(k·L) for sequences of length L — you never materialize the full V^L tree.

The math and complexity bounds

A sequence y = (y₁, …, y_L) conditioned on input x is scored by its log-likelihood:

score(y) = Σ_{t=1..L} log P(y_t | y_1, …, y_{t-1}, x)

Exact decoding would maximize this over all V^L possible sequences — astronomically infeasible (a vocabulary of 50,000 and length 20 gives 50000²⁰ ≈ 10⁹⁴ sequences). Beam search approximates the argmax with bounded work:

Per step: scoring k beams over vocabulary V costs O(k·V) to enumerate candidates, plus O(k·V·log k) to select the top k (a partial sort or a size-k min-heap). The model forward passes for the k beams are the real bottleneck in practice, batched together.
Total over length L: O(L·k·V) — exactly k times the cost of greedy (which is O(L·V)), and exponentially cheaper than the O(V^L) exhaustive search.
Memory: O(k·L) to store the k sequences (plus the model's per-beam KV cache in a transformer, which dominates in practice).

Because each log P term is negative, total score decreases monotonically with length. Beam search therefore has a structural bias toward short sequences. The standard fix is length normalization, dividing by a length penalty:

lp(y) = (5 + |y|)^α / (5 + 1)^α          # Google GNMT, α ≈ 0.6
normalized_score(y) = score(y) / lp(y)

With α = 0 there's no penalty (short bias intact); α = 1 divides by raw length (a plain per-token average). Most systems tune α between 0.6 and 1.0 on a validation set.

When to use beam search — and when not to

Use it for tasks with one "correct" output: machine translation, speech recognition, grammatical error correction, code generation with a target, and abstractive summarization. You want the most plausible sequence, and a width of 4–10 reliably beats greedy.
Avoid it for open-ended generation — story writing, chit-chat, brainstorming. Beam search produces bland, repetitive, "safe" text because the highest-probability continuations are generic. Use temperature sampling, top-k, or nucleus (top-p) sampling instead, which deliberately inject diversity.
Watch the width. Beyond k ≈ 5–10, returns diminish and translation quality can actually drop (see the costed claims below). Bigger beams are not free wins.
Beware exposure to model bias. Beam search faithfully surfaces whatever the model thinks is most probable. If the model is miscalibrated (e.g. assigns high probability to the empty string, a real bug in early NMT), more search makes the output worse, not better.

Beam search vs other decoding strategies

	Greedy (k=1)	Beam search	Exhaustive / A*	Top-k sampling	Nucleus (top-p)
Time per step	O(V)	O(k·V)	O(V^L) / heuristic	O(V)	O(V)
Memory	O(L)	O(k·L)	exponential / frontier	O(L)	O(L)
Finds global optimum?	No	No (heuristic)	Yes	No (random)	No (random)
Output determinism	Deterministic	Deterministic	Deterministic	Stochastic	Stochastic
Output diversity	None	Low	None	High	High (adaptive)
Best for	Speed, debugging	Translation, ASR, summarization	Provable optimality (small spaces)	Creative text	Open-ended chat

The headline trade-off: beam search buys a large quality gain over greedy for a constant k× compute cost, while remaining deterministic. Sampling methods trade determinism and the "most plausible" guarantee for diversity. A* search would give the true optimum but needs an admissible heuristic that bounds the best achievable future score — rarely available for neural language models, so beam search is the practical compromise.

What the numbers actually say

The search space is the whole point. A vocabulary of 50,000 over length 20 is 50000²⁰ ≈ 10⁹⁴ candidate sequences. Beam search with k = 5 evaluates 5 × 50000 = 250,000 candidates per step, about 5 million model-scored continuations total — a factor of roughly 10⁸⁸ less work than exhaustive search.
Greedy → beam is a real quality jump. Across WMT machine-translation benchmarks, moving from greedy (k=1) to a beam of 4–5 typically gains on the order of 1–2 BLEU points — a difference human evaluators notice — at exactly 4–5× the decode cost.
The beam search curse is real. Koehn and Knowles (2017) showed that for NMT, BLEU peaks around beam width 4–8 and then declines as the beam grows to 100+; the model's truly-most-probable sequences are often degenerately short. Larger beams expose model bias rather than fixing it.
Length normalization matters a lot. Google's GNMT system (Wu et al., 2016) reported that adding length normalization with α = 0.6 plus a coverage penalty improved translation quality measurably over an unnormalized beam — without it, beams collapse to short outputs.
Cost is linear in width. Doubling the beam roughly doubles decode latency and KV-cache memory in a transformer, because you run the model on k sequences per step. There is no sub-linear shortcut.

JavaScript implementation

A self-contained beam search over a toy model. nextLogProbs(seq) returns the model's log-probabilities over the vocabulary for the next token; here it's stubbed, but in production it is a neural network forward pass.

// EOS token id; vocabulary is 0..V-1.
const EOS = 0;

function beamSearch(nextLogProbs, { beamWidth = 4, maxLen = 20, alpha = 0.6 } = {}) {
  // A beam: { seq: number[], score: number (cumulative log-prob), done: bool }
  let beams = [{ seq: [], score: 0, done: false }];
  const finished = [];

  for (let step = 0; step < maxLen; step++) {
    const candidates = [];

    for (const beam of beams) {
      if (beam.done) { finished.push(beam); continue; }
      const logp = nextLogProbs(beam.seq);          // Float array, length V
      for (let token = 0; token < logp.length; token++) {
        candidates.push({
          seq: [...beam.seq, token],
          score: beam.score + logp[token],          // log-probs ADD
          done: token === EOS,
        });
      }
    }
    if (candidates.length === 0) break;

    // Prune: keep the top-k candidates by raw cumulative score.
    candidates.sort((a, b) => b.score - a.score);
    beams = candidates.slice(0, beamWidth);

    // Stop early if every surviving beam has finished (collected once, below).
    if (beams.every(b => b.done)) break;
  }
  // Collect any finished beams still sitting in the active set (counted once).
  finished.push(...beams.filter(b => b.done));

  // Length-normalize, then return the best finished hypothesis.
  const pool = finished.length ? finished : beams;
  const lp = (len) => Math.pow((5 + len) / 6, alpha);
  pool.sort((a, b) => (b.score / lp(b.seq.length)) - (a.score / lp(a.seq.length)));
  return pool[0];
}

Two details worth flagging. First, scores add because we work in log-space — never multiply raw probabilities, or you'll underflow. Second, the final ranking uses the length-normalized score even though pruning during the loop uses the raw score; mixing those up is a classic bug.

Python implementation

The same algorithm in Python, using heapq.nlargest so we never fully sort the k·V candidate list — selecting the top k is O(k·V·log k) instead of O(k·V·log(k·V)).

import heapq
import math
from dataclasses import dataclass, field

EOS = 0

@dataclass(order=True)
class Beam:
    score: float
    seq: tuple = field(compare=False, default=())
    done: bool = field(compare=False, default=False)

def beam_search(next_log_probs, beam_width=4, max_len=20, alpha=0.6):
    beams = [Beam(0.0, (), False)]
    finished = []

    for _ in range(max_len):
        candidates = []
        for beam in beams:
            if beam.done:
                finished.append(beam)
                continue
            logp = next_log_probs(beam.seq)          # sequence of length V
            for token, lp in enumerate(logp):
                candidates.append(Beam(
                    score=beam.score + lp,            # log-probs add
                    seq=beam.seq + (token,),
                    done=(token == EOS),
                ))
        if not candidates:
            break

        # Prune to the top-k by raw cumulative score (O(kV log k)).
        beams = heapq.nlargest(beam_width, candidates, key=lambda b: b.score)
        if all(b.done for b in beams):       # all survivors finished — stop early
            break

    # Collect any finished beams still in the active set (counted once).
    finished.extend(b for b in beams if b.done)
    pool = finished or beams

    def norm(b):                                     # length normalization
        lp = ((5 + len(b.seq)) / 6) ** alpha
        return b.score / lp

    return max(pool, key=norm)

Note the asymmetry: we prune by raw score inside the loop (cheap, comparable across equal-length candidates at the same step) but rank the final pool by the length-normalized score so that a long, fluent hypothesis can beat a short, terse one.

Variants worth knowing

Diverse beam search (Vijayakumar et al., 2016). Standard beams clump together — the top 5 often share a prefix and differ by one synonym. Diverse beam search splits the budget into groups and adds a penalty for tokens already chosen by earlier groups, yielding genuinely different outputs. Essential when you want k distinct translations or captions.

Length-normalized and coverage-penalized beam search. Google's GNMT adds, on top of length normalization, a coverage penalty that rewards attending to every source word — discouraging the model from ignoring part of the input. Both penalties are applied to the final ranking.

Constrained beam search. Forces certain tokens or phrases to appear (e.g. a required terminology term in translation, or a forced JSON schema). Grid beam search and disjunctive constraints implement this by tracking which constraints each beam has satisfied, expanding the state to (sequence, constraints-met).

Best-first / A* beam search. Replace the fixed-width frontier with a priority queue ordered by an admissible cost-to-go heuristic. When the heuristic is good this finds the optimum while exploring far fewer nodes; when it's missing (the usual case for neural LMs) it degrades to plain beam search.

Stochastic beam search. Samples without replacement from the sequence model using the Gumbel-top-k trick, giving you k samples that are also a valid set — bridging the gap between deterministic beam search and pure sampling.

Common bugs and edge cases

Working in probability space instead of log space. Multiplying 20 probabilities of ~0.1 underflows float to 0 and every beam ties at score 0. Always sum log-probabilities.
Forgetting length normalization. Without it, beam search almost always returns the shortest legal sequence (often just <eos>), because adding any token only lowers the score.
Expanding finished beams. A beam that emitted <eos> must be retired, not re-expanded; otherwise it keeps "growing" past the sentence end and corrupts the ranking.
Pruning by normalized score mid-loop. Length-normalize only for the final selection. Normalizing during pruning lets very short partial beams dominate and starves longer hypotheses before they can finish.
Stopping too early. Halting at the first finished beam can miss a better hypothesis still being built. Either collect k finished beams, or use the rule: stop only once no active beam can beat the best finished one.
Assuming a bigger beam is always better. Past width ~5–10 you can lose quality (the beam search curse) while paying linearly more compute. Tune k on a validation set, don't crank it.
Duplicate hypotheses. Two different beams can converge to the same sequence after a merge; deduplicate the finished list before ranking, or you'll over-count one candidate.

Frequently asked questions

Is beam search guaranteed to find the most probable sequence?

No. Beam search is a heuristic, not an exact search. It only keeps the top-k partial sequences at each step, so the globally most-probable sequence can be pruned early if its prefix scores poorly. Only an exhaustive search over all V^L sequences — or A* with an admissible heuristic — guarantees the true optimum.

Why does a larger beam width sometimes produce worse text?

This is the beam search curse, documented by Koehn and Knowles in 2017 for machine translation: as beam width grows past about 5 to 10, BLEU often drops. Larger beams find higher-probability sequences, but the model's most probable sequences are frequently short and generic (often the empty string), so more search exposes that bias rather than fixing it.

What is length normalization in beam search?

Because each token's log-probability is negative, longer sequences accumulate lower total scores and beam search systematically favors short outputs. Length normalization divides the cumulative log-probability by length^alpha (Google's GNMT used alpha = 0.6) so that long and short candidates compete fairly.

What is the time complexity of beam search?

Per step, beam search expands k beams over a vocabulary of size V and selects the top-k of the k·V candidates, giving O(k·V) scoring plus O(k·V·log k) for the selection. Over a sequence of length L the total is O(L·k·V) — exactly k times the cost of greedy decoding, and exponentially cheaper than the O(V^L) exhaustive search.

When is greedy decoding the same as beam search?

Beam search with beam width k = 1 is exactly greedy decoding — you keep a single best partial sequence and extend it by its single most likely next token. Greedy is the special case of beam search at the cheapest, most myopic end of the spectrum.

How does beam search decide when to stop?

When a beam emits the end-of-sequence token it is moved to a finished list and no longer expanded. Search continues until k complete hypotheses are collected or a maximum length is hit. A common early-stopping rule halts once the best finished hypothesis can no longer be beaten by any still-active beam.