Machine Learning

Hidden Markov Models

You never see the states — only the symbols they leak

A hidden Markov model (HMM) is a probabilistic model where an unobserved Markov chain of states emits observable symbols; the Viterbi algorithm recovers the most likely state sequence from the observations alone in O(N²·T) time.

  • Decoding (Viterbi)O(N²·T)
  • Likelihood (forward)O(N²·T)
  • Training (Baum-Welch)O(N²·T) per iteration
  • Brute-force decodeO(Nᵀ) — intractable
  • ParametersA, B, π

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a hidden Markov model works

Imagine a friend in another city tells you each day whether they walked, shopped, or cleaned. You never see the weather there, but you suspect it drives their choice — they walk on sunny days and clean on rainy ones. The weather is a hidden state that follows its own day-to-day rhythm; their activity is the observable symbol it leaks. From the string of activities alone, can you reconstruct the weather? That's exactly the problem a hidden Markov model solves.

An HMM has two layers. The bottom layer is an ordinary Markov chain over N hidden states — at each time step the state jumps to another according to a fixed transition matrix, and the next state depends only on the current one (the Markov property). The top layer is the catch: you can't see the states. Each state instead emits one of M observable symbols at random, governed by an emission distribution. You see only the symbols. The state path is hidden.

Formally, an HMM is the triple λ = (A, B, π):

  • A — the N×N transition matrix. A[i][j] = P(state j at t+1 | state i at t). Each row sums to 1.
  • B — the N×M emission matrix. B[j][k] = P(observe symbol k | in state j). Each row sums to 1.
  • π — the initial state distribution, length N, summing to 1.

Two independence assumptions make the whole machine tractable: the Markov assumption (the next state depends only on the current state, not the full history) and the output independence assumption (an emission depends only on the state that produced it, not on past emissions or states). These let every algorithm below factor a length-T joint probability into a product of one-step terms.

The three canonical problems

Rabiner's famous 1989 tutorial framed everything an HMM does as three questions, and almost every textbook still uses this scaffold:

  1. Evaluation — given λ and an observation sequence O, what is P(O | λ)? Solved by the forward algorithm. Used to score which of several trained models best explains new data.
  2. Decoding — given λ and O, what hidden state sequence Q most likely produced O? Solved by the Viterbi algorithm. This is the "infer the states from what you can see" problem.
  3. Learning — given only observation sequences (no labels), find the λ that maximizes P(O | λ). Solved by Baum-Welch, an Expectation-Maximization procedure.

The Viterbi algorithm, precisely

Naively, decoding means trying every possible hidden path. With N states and T time steps there are Nᵀ paths — for a modest 10-state model over a 100-symbol sequence that's 10¹⁰⁰ paths, more than the number of atoms in the observable universe. Brute force is hopeless.

Viterbi (Andrew Viterbi, 1967, originally for decoding convolutional codes) is dynamic programming over a trellis: a grid with one column per time step and one row per state. Define

δ[t][j] = the probability of the single most likely path that
          ends in state j after emitting the first t observations

The recurrence is the heart of it. The best path into state j at time t must pass through some state i at time t−1, so you maximize over that predecessor:

δ[t][j] = max over i of ( δ[t-1][i] · A[i][j] ) · B[j][O[t]]
ψ[t][j] = argmax over i of ( δ[t-1][i] · A[i][j] )    // backpointer

You fill the trellis left to right. The ψ backpointers remember which predecessor won each cell. When you reach the last column, pick the state with the highest δ and follow the backpointers right-to-left to reconstruct the whole path. Each of the N·T cells does an O(N) maximization, giving O(N²·T) time and O(N·T) space for the backpointers. The exponential collapses to a polynomial because at every cell you keep only the single best path into each state and throw the rest away — a dominated partial path can never become optimal later.

When to reach for an HMM

  • You can name the hidden states. Weather, phonemes, genomic regions (exon/intron), market regimes (bull/bear) — domains where a small, interpretable set of states is the right mental model.
  • Sequential data with local dependence. The next observation depends mostly on the current hidden state, not the deep past. HMMs are the natural fit when a first-order Markov assumption is roughly true.
  • Small data, or you need probabilities, not just labels. An HMM gives you calibrated likelihoods and works with hundreds of examples, where a neural net would overfit.
  • Interpretability and guarantees matter. Gene-finders, fault detectors, and signal-quality monitors need to justify their output; the trellis is auditable.

Reach for something else when dependencies span long ranges (use an LSTM or transformer), when you have abundant labeled data and want raw accuracy, or when the states resist a clean discrete interpretation. The first-order Markov assumption is a real limitation — yesterday matters, but the day before yesterday doesn't, which is often only approximately true.

HMM vs other sequence models

HMMMarkov chainCRFRNN / LSTMTransformer
States observable?No — hiddenYesHidden (labels)Latent vectorLatent vector
Probabilistic / generativeGenerativeGenerativeDiscriminativeUsually discriminativeDiscriminative / generative
Dependency range1st-order (current state)1st-orderWhole sequence (features)Long, decayingFull, attention-weighted
Decode costO(N²·T)O(N²·T) ViterbiO(T·d²)O(T²·d)
Trains without labels?Yes — Baum-Welchn/a (count states)No — needs labelsSelf-supervised possibleSelf-supervised pretraining
Data appetiteHundreds of sequencesTinyThousandsTens of thousands+Millions+
Interpretable statesYesYesPartly (weights)NoNo
Typical useGene-finding, POS tagging, speech (classic)PageRank, text genNER, sequence labelingTime series, classic NMTLLMs, modern speech

The defining contrast is HMM vs Markov chain: a Markov chain lets you watch the states directly, so there's nothing to infer. An HMM hides them and forces you to back them out from the emissions. The contrast with a conditional random field (CRF) is subtler — a CRF is the discriminative cousin that models P(states | observations) directly and can use arbitrary overlapping features, whereas an HMM is generative and models the joint P(states, observations).

What the numbers actually say

  • Exponential to polynomial. For N = 10, T = 100, brute-force decoding visits 10¹⁰⁰ paths; Viterbi visits N²·T = 10,000 cell-updates. That's the difference between "never finishes" and "microseconds."
  • Memory is the real ceiling. The backpointer table is N·T entries. A 50-state genomic HMM over a 3-billion-base chromosome would need 1.5×10¹¹ backpointers — so production gene-finders stream the sequence in windows rather than materializing the full trellis.
  • Log space is not optional. Multiplying 100 emission probabilities each around 0.1 gives 10⁻¹⁰⁰, which underflows a 64-bit double (smallest ≈ 2.2×10⁻³⁰⁸ before fully denormalizing, but the products vanish far sooner in practice). Summing log-probabilities sidesteps this entirely.
  • Baum-Welch cost. Each EM iteration is one forward and one backward pass, O(N²·T), times the number of training sequences. Convergence to a local optimum typically takes tens of iterations, so total cost is roughly O(iters · seqs · N²·T).

JavaScript implementation

Here is Viterbi decoding in log space — the version you'd actually ship. The classic "umbrella/weather" toy model is wired in so you can run it directly.

// HMM as {states, symbols, A, B, pi}; A,B,pi are plain probabilities.
function viterbi(hmm, obs) {
  const { states, symbols, A, B, pi } = hmm;
  const N = states.length, T = obs.length;
  const log = x => (x > 0 ? Math.log(x) : -Infinity);
  const sym = s => symbols.indexOf(s);            // map symbol -> column

  // delta[t][j] = log-prob of best path ending in state j at time t
  const delta = Array.from({ length: T }, () => new Array(N).fill(-Infinity));
  const psi   = Array.from({ length: T }, () => new Array(N).fill(0));

  // init at t = 0
  for (let j = 0; j < N; j++) delta[0][j] = log(pi[j]) + log(B[j][sym(obs[0])]);

  // recurrence
  for (let t = 1; t < T; t++) {
    for (let j = 0; j < N; j++) {
      let best = -Infinity, arg = 0;
      for (let i = 0; i < N; i++) {
        const score = delta[t - 1][i] + log(A[i][j]);
        if (score > best) { best = score; arg = i; }
      }
      delta[t][j] = best + log(B[j][sym(obs[t])]);
      psi[t][j] = arg;
    }
  }

  // termination + backtrace
  let last = 0, bestEnd = -Infinity;
  for (let j = 0; j < N; j++) if (delta[T - 1][j] > bestEnd) { bestEnd = delta[T - 1][j]; last = j; }
  const path = new Array(T);
  path[T - 1] = last;
  for (let t = T - 2; t >= 0; t--) path[t] = psi[t + 1][path[t + 1]];

  return { path: path.map(j => states[j]), logProb: bestEnd };
}

const hmm = {
  states:  ['Rainy', 'Sunny'],
  symbols: ['walk', 'shop', 'clean'],
  pi: [0.6, 0.4],
  A:  [[0.7, 0.3], [0.4, 0.6]],          // Rainy/Sunny transitions
  B:  [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]],// emission of walk/shop/clean
};
console.log(viterbi(hmm, ['walk', 'shop', 'clean']).path);
// => ['Sunny', 'Rainy', 'Rainy']  — inferred weather from activities

Two details worth flagging. First, everything is in log space, so the products become sums and there's no underflow on long sequences. Second, the backpointer write happens in the same inner loop as the max — forgetting to store psi is the single most common Viterbi bug, because the code still runs and returns a probability; it just reconstructs the wrong path.

Python implementation

The same Viterbi decode, plus the forward algorithm for the evaluation problem so you can see how max (Viterbi) and sum (forward) differ by exactly one operator on the same trellis.

import math

def viterbi(states, symbols, A, B, pi, obs):
    N, T = len(states), len(obs)
    log = lambda x: math.log(x) if x > 0 else float('-inf')
    col = {s: k for k, s in enumerate(symbols)}

    delta = [[float('-inf')] * N for _ in range(T)]
    psi   = [[0] * N for _ in range(T)]

    for j in range(N):                                  # init
        delta[0][j] = log(pi[j]) + log(B[j][col[obs[0]]])

    for t in range(1, T):                               # recurrence
        for j in range(N):
            best_i = max(range(N), key=lambda i: delta[t-1][i] + log(A[i][j]))
            delta[t][j] = delta[t-1][best_i] + log(A[best_i][j]) + log(B[j][col[obs[t]]])
            psi[t][j] = best_i

    last = max(range(N), key=lambda j: delta[T-1][j])    # termination
    path = [0] * T
    path[T-1] = last
    for t in range(T-2, -1, -1):                         # backtrace
        path[t] = psi[t+1][path[t+1]]
    return [states[j] for j in path], delta[T-1][last]

def forward(states, symbols, A, B, pi, obs):
    # P(O | lambda): same trellis, but SUM over predecessors instead of MAX
    N, T = len(states), len(obs)
    col = {s: k for k, s in enumerate(symbols)}
    alpha = [[0.0] * N for _ in range(T)]
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][col[obs[0]]]
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = sum(alpha[t-1][i] * A[i][j] for i in range(N)) * B[j][col[obs[t]]]
    return sum(alpha[T-1])                               # total likelihood

states  = ['Rainy', 'Sunny']
symbols = ['walk', 'shop', 'clean']
pi = [0.6, 0.4]
A  = [[0.7, 0.3], [0.4, 0.6]]
B  = [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]]
obs = ['walk', 'shop', 'clean']

print(viterbi(states, symbols, A, B, pi, obs)[0])  # ['Sunny', 'Rainy', 'Rainy']
print(forward(states, symbols, A, B, pi, obs))     # P(O) ~= 0.0334

Notice the only structural difference between viterbi and forward is max versus sum over the predecessors. Viterbi asks "what is the best single explanation?"; forward asks "summing over all explanations, how likely is this data?" The forward version above runs in plain probability space for clarity, but a real implementation rescales each column or works in log space with a log-sum-exp to avoid underflow.

Variants worth knowing

Continuous-emission HMMs (GMM-HMM). When observations are real-valued vectors (audio frames, sensor readings) rather than discrete symbols, the emission matrix B is replaced by a probability density — classically a Gaussian mixture per state. This was the backbone of speech recognition for two decades before deep learning.

The forward-backward algorithm. Pairs the forward pass with a symmetric backward pass to compute γ[t][j] = P(state j at time t | O) — the smoothed posterior over each state. This is the E-step inside Baum-Welch and also gives "posterior decoding," which maximizes per-position accuracy rather than whole-path probability.

Higher-order HMMs. Let the next state depend on the last k states instead of one. More expressive, but the transition table grows as Nᵏ⁺¹ and decoding cost as O(Nᵏ⁺¹·T), so order 2 is usually the practical limit.

Hidden semi-Markov models (HSMM). Plain HMMs implicitly assume geometric state durations (the probability of staying drops off exponentially). HSMMs model an explicit duration distribution per state, which matters for genomics (exon lengths) and human activity recognition.

Conditional random fields. Not an HMM variant but the discriminative alternative for sequence labeling — drop the generative joint model and directly model P(labels | observations) with arbitrary features. CRFs usually beat HMMs on supervised tagging tasks when labeled data is available.

Common bugs and edge cases

  • Numerical underflow. Running Viterbi or forward in raw probability space silently produces 0 on sequences longer than a few dozen symbols. Always use log space (Viterbi) or column rescaling / log-sum-exp (forward).
  • Forgetting the backpointers. Viterbi still returns a valid log-probability without storing ψ, so the bug hides — but the reconstructed path is garbage. Test against a known toy model.
  • Zero probabilities and unseen symbols. If a symbol never appeared in training, its emission probability is 0, which zeroes out otherwise-valid paths (and gives log(0) = −∞). Apply Laplace/add-one smoothing to A and B.
  • Treating Baum-Welch's local optimum as global. EM only guarantees a local maximum of the likelihood. Different random initializations of (A, B, π) can converge to very different models; restart several times and keep the best.
  • Confusing Viterbi decoding with posterior decoding. Viterbi finds the single most likely path; posterior (forward-backward) decoding picks the most likely state at each position independently. The latter can produce a path the model assigns zero probability to (an illegal transition), so pick deliberately.
  • Mis-normalized matrices. Each row of A, each row of B, and π must each sum to 1. An un-normalized row breaks every probability claim downstream and is easy to introduce after smoothing.

Frequently asked questions

What's the difference between a Markov chain and a hidden Markov model?

In a plain Markov chain you observe the states directly. In an HMM the states are hidden — you only see symbols each state emits at random. You must infer the state sequence from the emissions, which adds an emission probability matrix B on top of the chain's transition matrix A.

Why is the Viterbi algorithm O(N²·T) and not exponential?

There are Nᵀ possible state paths, so brute force is exponential. Viterbi is dynamic programming: at each of T time steps it keeps only the single best path ending in each of N states, costing N transitions per state. That's N·N·T = O(N²·T) time and O(N·T) space for the backpointers.

What's the difference between the forward algorithm and Viterbi?

They share the same trellis. Forward sums over all paths to compute P(observations) — the total likelihood. Viterbi takes the max instead of the sum to find the single most likely path. Sum answers "how probable is this data?"; max answers "what states most likely produced it?".

Why does Viterbi work in log space?

Multiplying hundreds of probabilities below 1 underflows to zero in floating point. Taking logarithms turns the products into sums, so log P stays representable, and argmax is preserved because log is monotonic. Production HMM code almost always runs in log space.

What does Baum-Welch actually train?

Baum-Welch is the Expectation-Maximization instance for HMMs. Given only the observation sequences, it iteratively re-estimates the transition matrix A, emission matrix B, and initial distribution π to locally maximize P(observations). It converges to a local optimum, not necessarily the global one, so initialization matters.

Are hidden Markov models still used now that we have RNNs and transformers?

Yes, where you need interpretable states, small data, or hard probabilistic guarantees: gene-finding and protein profiles in bioinformatics, gravitational-wave glitch detection, fault detection, and the classic part-of-speech tagger. Deep models dominate large-scale speech and NLP, but HMMs win when you can name the states.