Algorithms

The Viterbi Algorithm

The one path through the maze that the noise was hiding

The Viterbi algorithm is a dynamic-programming method that finds the single most likely sequence of hidden states in a hidden Markov model, running in O(T·N²) time instead of the O(N^T) cost of brute force.

SolvesHMM decoding (best path)
TimeO(T·N²)
SpaceO(T·N)
Brute force costO(N^T)
PublishedAndrew Viterbi, 1967

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: one best path, not a fog of guesses

You can't see the weather, but you can see whether your friend went for a walk, shopped, or cleaned indoors each day. Hidden behind those activities is a sequence of hidden states — Sunny or Rainy — that you never observe directly. The Viterbi algorithm answers a precise question: across all the ways the weather could have unfolded, which single sequence of days best explains the activities you saw?

This is the decoding problem for a Markov process whose states are hidden — a hidden Markov model (HMM). An HMM has three pieces: a set of hidden states, transition probabilities between them, and emission probabilities that tie each state to an observable symbol. Given a string of observations, decoding finds the most probable underlying state path.

The naive approach is to enumerate every possible state path, score each by multiplying its transition and emission probabilities, and keep the best. For T observations over N states that's N^T paths — for a 100-symbol sequence over just 5 states, that's 5¹⁰⁰ ≈ 10⁷⁰ paths, more than the number of atoms in the Milky Way. Viterbi gets the exact same answer in a few thousand operations. The trick is the same one behind dynamic programming everywhere: the best path ending in a given state at time t only depends on the best paths ending in each state at time t−1, so you never re-explore a prefix twice.

How it works: the trellis and the recurrence

Lay the computation out as a trellis: T columns (one per observation) and N rows (one per state). Cell δ[t][j] holds the probability of the most likely path that ends in state j after emitting the first t+1 observations. A parallel table ψ[t][j] (psi) stores the back-pointer: which state at t−1 the best path came from.

Three phases:

1. Initialization (t = 0). For each state j, the best path of length one is just starting in j and emitting the first observation:

δ[0][j] = π[j] · b[j][O₀]

where π is the initial-state distribution and b[j][O₀] is the probability state j emits the first symbol.

2. Recursion (t = 1 … T−1). The best path into state j at time t is the best path into some state i at t−1, extended by the i → j transition and the emission of O_t. Take the maximum over all predecessors i:

δ[t][j] = max over i ( δ[t-1][i] · a[i][j] ) · b[j][Oₜ]
ψ[t][j] = argmax over i ( δ[t-1][i] · a[i][j] )

The max is the whole game. The forward algorithm uses sum here instead — that single swap is the only structural difference between computing the best path and computing total observation probability.

3. Termination and backtrace. The probability of the best overall path is max over j of δ[T-1][j]. To recover the path itself, start from that winning final state and follow the ψ back-pointers leftward to time 0. Reverse the collected states and you have the decoded sequence.

Filling the table is O(T·N²): there are T·N cells, and each cell maximizes over N predecessors. The back-pointer table costs O(T·N) space. Because the recurrence is identical to a shortest-path computation on a layered graph, Viterbi is exactly shortest path on the trellis with edge weights of −log(probability) — minimizing summed negative-log weights is the same as maximizing the product.

Why production code lives in log space

Each step multiplies a transition probability by an emission probability, both below 1. Over a 300-symbol speech frame sequence you'd multiply 300 numbers near 0.01 together; the result underflows a 64-bit double (smallest normal ≈ 2.2 × 10⁻³⁰⁸) and silently becomes 0.0. Every path then ties at zero and the back-trace is garbage.

The fix is to work with log-probabilities. Logs turn products into sums, log(a·b) = log(a) + log(b), and because log is monotonic the argmax is unchanged. The recurrence becomes purely additive:

δlog[t][j] = max over i ( δlog[t-1][i] + log a[i][j] ) + log b[j][Oₜ]

A zero probability maps to log(0) = −∞, which correctly poisons any path through an impossible transition. This also makes Viterbi cheaper: additions are faster than multiplications, and you only call log once per parameter at setup, not in the hot loop.

When to reach for Viterbi

You need the single best explanation, not a distribution. Decode the most likely word sequence, tag sequence, or transmitted bitstream — one answer.
Your model is a first-order HMM (or a chain CRF). The future state depends only on the current one; that Markov property is what makes the DP exact.
The state space is small enough that N² is cheap. A few dozen to a few hundred states is comfortable; tens of thousands needs beam search.
You want an exact, deterministic result. Unlike sampling methods, Viterbi returns the provable global optimum for the model you gave it.

Reach for something else when the per-step marginal matters more than the joint path (use forward-backward / posterior decoding), when the state space is enormous (use beam search and accept approximation), or when dependencies reach back more than one step in a way you can't fold into the state.

Viterbi vs the other HMM algorithms

	Viterbi	Forward	Forward-Backward	Beam search	Brute force
Answers	Best single path	P(observations)	Per-step posteriors	Approx. best path	Best single path
Operator over predecessors	max	sum	sum (both ways)	max over kept beam	—
Time	O(T·N²)	O(T·N²)	O(T·N²)	O(T·N·B)	O(N^T)
Space (path recovery)	O(T·N)	O(N)	O(T·N)	O(T·B)	O(T)
Exact?	Yes	Yes	Yes	No (prunes)	Yes
Underflow risk	Yes → use logs	Yes → use scaling	Yes → use scaling	Yes → use logs	Yes
Typical use	Decoding, error-correction	Likelihood, model scoring	Baum-Welch training, posteriors	Large-vocab speech, NMT	Teaching only

Viterbi and forward are structurally twins — same trellis, same O(T·N²), differing only in max vs sum. Forward-backward runs a forward and a backward pass to get the probability that the chain was in each state at each time; it's the engine inside Baum-Welch (EM) for learning HMM parameters, whereas Viterbi assumes the parameters are already known.

What the numbers actually say

5¹⁰⁰ vs 250,000. A 100-observation sequence over 5 states: brute force enumerates 5¹⁰⁰ ≈ 7.9 × 10⁶⁹ paths; Viterbi does T·N² = 100 × 25 = 2,500 cell updates (≈250,000 floating-point ops with the inner max). The speedup is not a constant factor — it's the difference between exponential and polynomial.
Constraint-length-7 convolutional codes. The Viterbi decoder in GSM, 802.11, and satellite links uses a 64-state trellis (2⁶). At 2 bits per cell of back-pointer it still fits comfortably in on-chip SRAM, which is why it's been cast into hardware since the 1970s.
Voyager. NASA's Voyager 1 and 2 probes used convolutional coding with Viterbi decoding to claw bits back from a signal received at roughly 10⁻¹⁸ watts; the algorithm Andrew Viterbi published in 1967 is still decoding telemetry from interstellar space today.
Log space buys exactness for free. Switching from products to sums of logs adds zero asymptotic cost and removes underflow that would otherwise corrupt any sequence longer than a few hundred symbols.

JavaScript implementation

This version works in log space and returns both the decoded state path and its log-probability. states are indices 0…N−1; obs is an array of observation indices.

// logPi[j]      : log initial prob of state j
// logA[i][j]    : log transition prob i -> j
// logB[j][o]    : log emission prob of symbol o from state j
function viterbi(obs, logPi, logA, logB) {
  const T = obs.length, N = logPi.length;
  const delta = Array.from({ length: T }, () => new Float64Array(N));
  const psi   = Array.from({ length: T }, () => new Int32Array(N));

  // 1. Initialization
  for (let j = 0; j < N; j++) {
    delta[0][j] = logPi[j] + logB[j][obs[0]];
    psi[0][j] = 0;
  }

  // 2. Recursion
  for (let t = 1; t < T; t++) {
    for (let j = 0; j < N; j++) {
      let best = -Infinity, arg = 0;
      for (let i = 0; i < N; i++) {
        const score = delta[t - 1][i] + logA[i][j];
        if (score > best) { best = score; arg = i; }
      }
      delta[t][j] = best + logB[j][obs[t]];
      psi[t][j] = arg;
    }
  }

  // 3. Termination — find best final state
  let bestLogP = -Infinity, last = 0;
  for (let j = 0; j < N; j++) {
    if (delta[T - 1][j] > bestLogP) { bestLogP = delta[T - 1][j]; last = j; }
  }

  // 4. Backtrace through the psi pointers
  const path = new Array(T);
  path[T - 1] = last;
  for (let t = T - 2; t >= 0; t--) path[t] = psi[t + 1][path[t + 1]];

  return { path, logProb: bestLogP };
}

Two details worth flagging. The inner loop walks predecessors i and keeps a running max — no Math.max(...arr) spread, because we need the argmax too. And the back-pointer at psi[t+1][...] tells you the state at time t that fed the winner at t+1 — get the off-by-one index wrong here and you decode a path that's one step out of phase.

Python implementation

The same algorithm in idiomatic Python, plus the classic weather/HMM worked example so you can run it and see the decoded path.

import math

def viterbi(obs, log_pi, log_a, log_b):
    T, N = len(obs), len(log_pi)
    delta = [[float('-inf')] * N for _ in range(T)]
    psi   = [[0] * N for _ in range(T)]

    # 1. Initialization
    for j in range(N):
        delta[0][j] = log_pi[j] + log_b[j][obs[0]]

    # 2. Recursion
    for t in range(1, T):
        for j in range(N):
            best, arg = float('-inf'), 0
            for i in range(N):
                score = delta[t - 1][i] + log_a[i][j]
                if score > best:
                    best, arg = score, i
            delta[t][j] = best + log_b[j][obs[t]]
            psi[t][j] = arg

    # 3. Termination
    last = max(range(N), key=lambda j: delta[T - 1][j])
    best_logp = delta[T - 1][last]

    # 4. Backtrace
    path = [0] * T
    path[T - 1] = last
    for t in range(T - 2, -1, -1):
        path[t] = psi[t + 1][path[t + 1]]
    return path, best_logp


# --- Worked example: the classic "weather" HMM ---
# states: 0 = Sunny, 1 = Rainy
# observations: 0 = walk, 1 = shop, 2 = clean
lg = math.log
log_pi = [lg(0.6), lg(0.4)]
log_a  = [[lg(0.7), lg(0.3)],    # Sunny -> Sunny / Rainy
          [lg(0.4), lg(0.6)]]    # Rainy -> Sunny / Rainy
log_b  = [[lg(0.6), lg(0.3), lg(0.1)],   # Sunny emits walk/shop/clean
          [lg(0.1), lg(0.4), lg(0.5)]]   # Rainy emits walk/shop/clean

path, logp = viterbi([0, 1, 2], log_pi, log_a, log_b)
print(path, round(math.exp(logp), 5))   # -> [0, 1, 1]  0.01296

Seeing walk, shop, clean, Viterbi decodes Sunny, Rainy, Rainy: the model strongly favours starting sunny (high start probability and a strong "walk" emission), but day two flips to Rainy because "shop" is emitted more readily from Rainy (0.4 vs 0.3) and that edge outweighs the cost of leaving Sunny, and day three stays Rainy on the heavy "clean" emission plus the Rainy→Rainy self-transition. Change the last observation to walk and the whole path flips to all-Sunny — a small input change can reroute the entire decode, which is exactly why the max-over-paths matters.

Variants worth knowing

List Viterbi (the k-best paths). Instead of one best path, return the top k most likely sequences. Used in speech and NLP to feed a downstream re-ranker that can apply richer (non-Markov) features the HMM couldn't.

Beam search. When N is huge (hundreds of thousands of HMM states in large-vocabulary speech), keep only the top B cells per column and prune the rest. This drops cost to O(T·N·B) but loses the exactness guarantee — a globally optimal path can be pruned early if it looks weak mid-sequence.

Soft-output Viterbi (SOVA) and BCJR. Error-correction wants not just the decoded bit but a confidence on it. SOVA augments Viterbi with reliability values; the BCJR / forward-backward algorithm computes exact per-bit posteriors and is the basis of the turbo-decoding that powers 3G/4G.

Online / streaming Viterbi. The back-trace normally needs the whole sequence, blocking real-time use. By keeping a sliding window, you can emit decisions once all surviving paths merge to a common prefix, trading a small latency for bounded memory — standard in hardware decoders.

Common bugs and edge cases

Multiplying raw probabilities. The single most common bug. Any sequence past a few hundred symbols underflows to 0.0 and the back-trace becomes meaningless. Work in log space.
Off-by-one in the back-pointer. The state for time t is psi[t+1][path[t+1]], not psi[t][...]. Getting this wrong yields a path shifted by one step that still looks plausible.
Forgetting the final emission has no successor. The last column gets an emission term but no transition out; the termination step just takes the max of the last δ column.
Confusing Viterbi with posterior decoding. The most likely path is not the sequence of per-step most likely states. Posterior decoding can output a path containing a zero-probability transition; Viterbi never can.
Assuming Viterbi trains the model. Viterbi assumes you already know π, A, and B. Learning those from data is Baum-Welch (forward-backward EM), a separate algorithm.
Ties handled inconsistently. When two predecessors give an equal score, your > vs >= choice silently decides the path. Pick one and document it, or two correct implementations will disagree on tied inputs.

Frequently asked questions

What problem does the Viterbi algorithm solve?

It solves the decoding problem for a hidden Markov model: given a sequence of observations, find the single most likely sequence of hidden states that produced them. It returns one optimal path, not a probability distribution over states.

What is the time complexity of the Viterbi algorithm?

O(T·N²) time and O(T·N) space, where T is the number of observations and N is the number of hidden states. Each of the T columns fills N cells, and each cell maximizes over N predecessors. Brute force over all paths would cost O(N^T).

How is Viterbi different from the forward algorithm?

Both fill the same trellis with the same recurrence shape, but Viterbi takes a max over predecessors while the forward algorithm takes a sum. Viterbi answers "what is the single best path?"; the forward algorithm answers "what is the total probability of the observations across all paths?"

Why does the Viterbi algorithm work in log space?

Multiplying hundreds of probabilities below 1 underflows to zero in floating point. Taking logs turns the products into sums of negative numbers, so the recurrence becomes additive, the max is unchanged, and underflow disappears. Production implementations almost always run in log space.

Does Viterbi give the most likely state at each step?

No. Viterbi gives the most likely whole path. Picking the per-step most likely state (posterior decoding via forward-backward) can produce a different sequence — and that sequence may even include a transition with zero probability, which Viterbi can never output.

Where is the Viterbi algorithm used in the real world?

It decodes convolutional error-correcting codes in nearly every cell phone, modem, and deep-space probe; it powers speech recognition, part-of-speech tagging, and DNA gene-finding. Andrew Viterbi published it in 1967; the same idea underlies GSM, 802.11, and Voyager telemetry.