Probability
Markov Chain
Stochastic process where the future depends only on the present
A Markov chain is a sequence of random states where each transition depends only on the current state — the past doesn't matter (memoryless property). Defined by a transition matrix P where P[i][j] = probability of going from state i to state j. Underlies Google PageRank, Markov Chain Monte Carlo, hidden Markov models in speech recognition, and weather prediction. The simplest model of stochastic dynamics that's still mathematically tractable.
- Markov propertyP(X_{n+1} | X_n) — future depends only on present, not past
- Transition matrix PP[i][j] = probability of going from state i to state j
- Stationary distributionπ such that πP = π — long-run proportions of time in each state
- Convergence conditionIrreducible + aperiodic — converges to unique stationary
- Steady-state computationEigenvector of Pᵀ with eigenvalue 1
- Famous applicationsPageRank, MCMC, weather, board game analysis, NLP
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a Markov chain works
A Markov chain is a sequence of states X₀, X₁, X₂, ... where transitions follow probabilistic rules:
- Finite (or countable) state space S = {1, 2, ..., n}.
- Transition matrix P, where P[i][j] = P(X_{k+1} = j | X_k = i). Each row sums to 1 (you go somewhere from each state).
- Initial distribution π₀ over states.
The Markov property — future state depends only on current state, not on the path taken to reach it. The whole "memory" of the chain is encoded in the current state.
Worked example — weather model
Three states — Sunny (S), Cloudy (C), Rainy (R). Transition matrix:
S C R
S [ 0.7 0.2 0.1 ]
C [ 0.4 0.4 0.2 ]
R [ 0.2 0.3 0.5 ]
If today is Sunny, 70% chance tomorrow is also Sunny; 20% Cloudy; 10% Rainy. Each row sums to 1.
Question — if today is Sunny, what's the chance it's Sunny again 2 days from now? It's the (S, S) entry of P². Compute P² by matrix multiplication:
P² = P · P (numerical multiplication)
The (1, 1) entry is 0.7·0.7 + 0.2·0.4 + 0.1·0.2 = 0.49 + 0.08 + 0.02 = 0.59.
So 59% chance of Sun → Sun in 2 days. More generally, P^k gives k-step transitions.
Stationary distribution
The stationary distribution π is a probability vector satisfying πP = π. It's the long-run proportion of time the chain spends in each state, regardless of starting state (assuming the chain is irreducible and aperiodic).
To find π for our weather example, solve the system:
πP = π
π_S + π_C + π_R = 1
Working out the algebra — π = (0.444, 0.296, 0.260). Long-run, the weather is Sunny 44.4%, Cloudy 29.6%, Rainy 26.0%.
Equivalently — π is the left eigenvector of P with eigenvalue 1, normalized to sum to 1. For large chains, power iteration finds it efficiently.
Convergence — when does the chain settle?
Two conditions are needed for convergence to a unique stationary distribution:
- Irreducibility. Every state is reachable from every other, in some number of steps. No isolated subsets.
- Aperiodicity. The greatest common divisor of return times to any state is 1. Equivalently, the chain doesn't have a periodic structure (cycle length 2 or more) it can't escape.
Most natural chains satisfy both. When both hold, the chain converges to its unique stationary distribution from any starting state — at exponential rate (rate determined by the second-largest eigenvalue of P).
JavaScript — Markov chain implementation
class MarkovChain {
constructor(transitionMatrix, states) {
this.P = transitionMatrix;
this.states = states;
}
step(currentState) {
const idx = this.states.indexOf(currentState);
const probs = this.P[idx];
let r = Math.random();
for (let i = 0; i < probs.length; i++) {
r -= probs[i];
if (r < 0) return this.states[i];
}
return this.states[probs.length - 1];
}
simulate(start, steps) {
let current = start;
const path = [current];
for (let i = 0; i < steps; i++) {
current = this.step(current);
path.push(current);
}
return path;
}
// Power iteration to find stationary distribution
stationary(iters = 1000) {
let pi = new Array(this.states.length).fill(1 / this.states.length);
for (let iter = 0; iter < iters; iter++) {
const next = new Array(pi.length).fill(0);
for (let i = 0; i < pi.length; i++) {
for (let j = 0; j < pi.length; j++) {
next[j] += pi[i] * this.P[i][j];
}
}
pi = next;
}
return pi;
}
}
// Weather example
const weather = new MarkovChain(
[[0.7, 0.2, 0.1], [0.4, 0.4, 0.2], [0.2, 0.3, 0.5]],
['Sunny', 'Cloudy', 'Rainy']
);
console.log(weather.simulate('Sunny', 10));
console.log(weather.stationary()); // ≈ [0.444, 0.296, 0.260]
Famous applications
PageRank — Google's original ranking algorithm
Build a Markov chain over web pages. Transition from page i to page j with probability proportional to links from i to j. Add a "teleport" jump (~15%) to a random page to handle pages with no outgoing links. Compute the stationary distribution — pages with high stationary probability are highly ranked.
Stylized PageRank update — for each page i:
PageRank(i) = (1 − d)/N + d · ∑_{j → i} PageRank(j) / out_degree(j)
where d ≈ 0.85 is the "damping factor." Iterate until convergence. Brin and Page (1998).
MCMC — sampling complex distributions
Want samples from a target distribution π that's hard to sample directly. Construct a Markov chain whose stationary distribution is π. Run the chain long enough; samples become approximately drawn from π.
Two main MCMC algorithms:
- Metropolis-Hastings. Propose moves; accept with probability that detail-balance the chain to π.
- Gibbs sampling. For multivariate distributions, sample one component at a time conditional on others.
Used in Bayesian statistics (sampling posteriors), statistical physics (computing partition functions), and many AI/ML methods.
Hidden Markov Models — speech recognition
States = phonemes (hidden); emissions = audio features (observed). Transitions encode "which phonemes follow which"; emissions encode "which audio features each phoneme produces." Given an audio sequence, infer the most likely phoneme sequence — solve via Viterbi algorithm. Foundation of pre-deep-learning speech recognition.
When Markov chains are the right tool
- Modeling stochastic processes with limited memory. Queueing systems, weather, drug concentrations, board game positions.
- Sampling from complex distributions. MCMC for Bayesian inference, partition functions in physics, optimization.
- Ranking via random walk. PageRank, citation graphs, social network influence.
- Sequence prediction. Text generation, speech recognition, gene sequence analysis.
- Reinforcement learning. Markov Decision Processes (MDPs) — Markov chains with action choices and rewards. Foundation of all RL.
- Computer simulation. Monte Carlo methods, stochastic differential equations, agent-based models.
Skip Markov chains when long-range memory matters — language has long-range dependencies (deep learning beats Markov models for translation). Skip when the state space is too large (curse of dimensionality — exponential in number of variables).
Common mistakes
- Assuming the Markov property holds. Most real systems have memory beyond the current state. Approximate as Markov by including enough history in the state — but this can blow up the state space exponentially.
- Forgetting to check irreducibility. A reducible chain has multiple stationary distributions or absorbing classes. The stationary distribution depends on starting state in this case.
- Periodicity oscillations. A periodic chain doesn't converge — it cycles through states. Some random walk variants are periodic; verify before claiming convergence.
- Using insufficient burn-in for MCMC. Early samples reflect the initial state, not the stationary distribution. Discard the first few thousand samples ("burn-in") before computing statistics.
- Treating samples as independent. MCMC samples are correlated — autocorrelation needs to be accounted for in confidence intervals. Effective sample size is much smaller than total iterations.
- Power iteration not converging. If the chain has multiple eigenvalues with magnitude 1 (e.g., periodicity), power iteration oscillates. Diagnose with eigenvalue analysis; use a more robust algorithm (eigendecomposition).
Frequently asked questions
What's the Markov property exactly?
The future depends only on the present, not on the past. Mathematically — P(X_{n+1} = s | X_n = sₙ, X_{n-1} = sₙ₋₁, ..., X_0 = s₀) = P(X_{n+1} = s | X_n = sₙ). All past history is "summarized" in the current state. Real-world systems often have this approximately — drug concentration, weather, queue lengths, board positions in chess (with finite history).
How is a Markov chain different from a random walk?
A random walk IS a Markov chain — the simplest kind, where steps are i.i.d. (each step is +1 or -1 with equal probability, say). General Markov chains have state-dependent transitions; the random walk is the special case where the transition probabilities are the same from every position. So all random walks are Markov chains, but not all Markov chains are random walks.
How do I find a stationary distribution?
Solve πP = π subject to ∑ πᵢ = 1. This is finding the left eigenvector of P with eigenvalue 1. Equivalently, the right eigenvector of Pᵀ with eigenvalue 1. For small chains, solve algebraically. For large ones, power iteration — start with any positive distribution, multiply by P repeatedly; if the chain is irreducible and aperiodic, you'll converge to π.
When does a Markov chain converge to its stationary distribution?
When it's irreducible (every state reachable from every other) AND aperiodic (gcd of return times to any state is 1). Without these, behavior is more complex — periodic chains oscillate without converging; reducible chains have multiple absorbing components. Most "natural" chains satisfy both conditions; you check explicitly when working with one.
How does PageRank use Markov chains?
Build a Markov chain over web pages. Transition from page i to page j with probability proportional to whether i links to j. Add a "teleport" probability (10-15%) to handle dead ends — this makes the chain irreducible. The PageRank vector is the stationary distribution — pages visited most often by a random surfer. Brin and Page's 1998 paper.
What's MCMC?
Markov Chain Monte Carlo — algorithms that construct a Markov chain whose stationary distribution is some target distribution we want to sample from. Run the chain long enough to converge, then take samples. Used to sample from posterior distributions in Bayesian inference, partition functions in statistical physics, and any complex distribution that's hard to sample directly. Metropolis-Hastings and Gibbs sampling are standard MCMC methods.
What's a hidden Markov model?
A Markov chain whose states aren't directly observed — instead, each state emits an observable signal. The chain is "hidden"; you see only the emissions. Used in speech recognition (states = phonemes; emissions = audio features), part-of-speech tagging, bioinformatics (DNA hidden states). Algorithms (Baum-Welch, Viterbi) infer the hidden states from observations.