Machine Learning
Expectation-Maximization (EM)
Guess the hidden labels, refit the model, repeat — and the likelihood can only go up
Expectation-Maximization (EM) is an iterative algorithm that fits maximum-likelihood parameters when data has hidden variables, alternating an E-step that computes soft assignments with an M-step that re-estimates parameters — and provably never decreases the likelihood.
- TypeIterative MLE / MAP
- Per-iteration cost (GMM)O(n · k · d²)
- ConvergenceLinear, to a local optimum
- LikelihoodMonotonically non-decreasing
- IntroducedDempster, Laird & Rubin, 1977
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The chicken-and-egg problem EM solves
Suppose you have a thousand exam scores and you're convinced they came from two groups — a struggling cohort and a strong cohort — but nobody recorded which student belongs to which. You want to recover each group's mean and spread. There's a maddening circularity here: if you knew the labels, computing the two means would be trivial averaging; and if you knew the two means, you could guess the labels by proximity. But you have neither.
Expectation-Maximization breaks the deadlock by guessing one to get the other, then iterating. Start with arbitrary parameters. Use them to compute, for every point, a soft guess of which group it came from. Then pretend those soft guesses are correct and re-fit the parameters. Repeat. Dempster, Laird, and Rubin unified dozens of scattered special cases of this idea into one framework in their 1977 paper "Maximum Likelihood from Incomplete Data via the EM Algorithm" — one of the most-cited statistics papers ever written.
The hidden labels are the latent variables, usually written z. The observed scores are x. EM is a general recipe for maximizing the likelihood of x whenever marginalizing out z makes the likelihood ugly but knowing z would have made it easy.
The two steps, precisely
Let the parameters be θ. The quantity we actually want to maximize is the marginal log-likelihood of the observed data:
ℓ(θ) = log p(x | θ) = log Σ_z p(x, z | θ)
That log-of-a-sum is the problem — the sum sits inside the log, so the derivative tangles all the components together and there's no closed-form maximizer. EM sidesteps it by introducing the responsibilities, the posterior over the latent variable given the current parameters.
E-step (Expectation). Using the current θ⁽ᵗ⁾, compute for every point i and every component k the responsibility
γ_ik = p(z_i = k | x_i, θ⁽ᵗ⁾) = π_k · N(x_i | μ_k, Σ_k)
────────────────────────────
Σ_j π_j · N(x_i | μ_j, Σ_j)
This is just Bayes' rule: prior weight times likelihood, normalized. γ_ik is the fraction of point i that "belongs" to component k, and the row sums to 1.
M-step (Maximization). Treat the responsibilities as fixed soft labels and maximize the expected complete-data log-likelihood, the function called Q(θ | θ⁽ᵗ⁾) = E_z[ log p(x, z | θ) | x, θ⁽ᵗ⁾ ]. For a Gaussian mixture this is weighted maximum likelihood and has a clean closed form. Let N_k = Σ_i γ_ik be the effective number of points in component k:
π_k ← N_k / n
μ_k ← (1 / N_k) Σ_i γ_ik · x_i
Σ_k ← (1 / N_k) Σ_i γ_ik · (x_i − μ_k)(x_i − μ_k)ᵀ
These are exactly the ordinary mean, mixing proportion, and covariance — but with each point counted by its responsibility instead of by 1. Alternate E and M until the log-likelihood stops climbing.
Why it works. The decomposition ℓ(θ) = ELBO(q, θ) + KL(q ‖ p(z|x,θ)) holds for any distribution q over the latents. The E-step sets q = p(z|x,θ⁽ᵗ⁾), which drives the KL term to zero, so the lower bound (the ELBO, or evidence lower bound) touches the true likelihood at θ⁽ᵗ⁾. The M-step then pushes that bound uphill. Because the bound equaled the likelihood before the M-step and only rose during it, ℓ(θ⁽ᵗ⁺¹⁾) ≥ ℓ(θ⁽ᵗ⁾). The likelihood is monotonically non-decreasing — that's the guarantee Jensen's inequality buys you.
Cost and convergence rate
For a Gaussian mixture with n points, k components, and dimension d:
- E-step: evaluating each Gaussian density requires the inverse covariance and a quadratic form, costing
O(d²)per point per component (after anO(k·d³)Cholesky/inverse per iteration). SoO(n·k·d²)overall. - M-step: the weighted covariance update is also
O(n·k·d²). - Per iteration:
O(n·k·d² + k·d³). For spherical or diagonal covariances thed²collapses tod. - Memory:
O(n·k)for the responsibility matrix plusO(k·d²)for the parameters.
The headline subtlety is the rate. EM converges linearly, not quadratically like Newton's method. The asymptotic rate equals the largest eigenvalue of the "fraction of missing information" matrix — the missing-data Fisher information divided by the complete-data information. Concretely: if your two clusters barely overlap, almost no information is missing and EM lands in a handful of iterations; if they overlap heavily so half your points are ambiguous, that fraction approaches 1 and convergence slows to a crawl. This is the textbook explanation for why EM rockets forward early and then inches toward the optimum.
When to reach for EM
- Mixture models — Gaussian mixtures for soft clustering and density estimation; mixtures of Bernoullis, Poissons, or experts.
- Genuinely missing data — fitting a model when some feature values are absent, treating the missing entries as the latent variables (its original 1977 motivation).
- Hidden-state sequence models — the Baum-Welch algorithm that trains a hidden Markov model is EM specialized to chains.
- Factor analysis and PCA-with-noise — probabilistic PCA estimates loadings via EM when the latent factors are unobserved.
- Censored or truncated data — survival analysis where you only know "the event happened after time T."
Reach for something else when the latent posterior p(z|x,θ) is itself intractable — then you need variational EM or sampling. And if you don't care about a probabilistic model at all and just want hard partitions, plain k-means is simpler and faster.
EM vs related optimizers
| EM (GMM) | k-means | Gradient ascent on ℓ(θ) | Variational EM | Gibbs sampling | |
|---|---|---|---|---|---|
| Assignments | Soft (probabilities) | Hard (0/1) | Soft | Soft (approximate) | Sampled labels |
| Fits covariance / shape | Yes (full Σ) | No (spherical only) | Yes | Yes | Yes |
| Step size to tune | None (closed form) | None | Learning rate | None / a few | None (but burn-in) |
| Guarantee | ℓ never decreases | Distortion never increases | ℓ increases if lr small enough | ELBO never decreases | Asymptotically exact |
| Convergence rate | Linear | Linear (fast in practice) | Linear; superlinear if Newton | Linear | Stochastic, no fixed point |
| Latent posterior | Must be tractable | Trivial (argmin) | Marginalized analytically | Intractable → approximated | Sampled |
| Global optimum | No (local) | No (local) | No (local) | No | Yes (in the limit) |
The clean way to see EM is as coordinate ascent on the ELBO: the E-step maximizes the bound over the distribution q, the M-step maximizes it over θ. K-means is the same coordinate ascent after you force q to be a one-hot indicator and freeze every covariance to σ²I with σ → 0.
What the numbers actually say
- k-means is EM in the σ → 0 limit. As a spherical Gaussian's variance shrinks, each responsibility
γ_iksharpens to 0 or 1 (the softmax becomes an argmax). The E-step becomes nearest-centroid assignment; the M-step becomes averaging. EM with full covariances does strictly more — it can model an elongated, tilted cluster that k-means would wrongly split. - One full-covariance component in d dimensions has d + d(d+1)/2 free parameters. In 10 dimensions that's 65 numbers per component, so a 5-component model has 325 parameters plus 5 mixing weights — which is exactly why EM overfits and collapses without regularization on small data.
- Convergence is governed by separation. Empirically, well-separated 2-D mixtures converge to 6 decimal places of log-likelihood in 10–30 iterations; heavily overlapping ones can take hundreds. The cost per iteration is fixed, so the wall-clock cost scales directly with that iteration count.
- The likelihood is unbounded above for unconstrained GMMs. Park one component on a single point and send its variance to 0: the density at that point → ∞, so ℓ → ∞. There is no global maximizer; the "good" solution you want is a high local maximum, not the supremum.
JavaScript implementation
A complete 1-D, two-component Gaussian mixture fit. The same skeleton generalizes to k components and to vectors by swapping the scalar Gaussian for a multivariate one.
// Univariate Gaussian density
const gauss = (x, mu, sigma) =>
Math.exp(-((x - mu) ** 2) / (2 * sigma * sigma)) /
(Math.sqrt(2 * Math.PI) * sigma);
function emGMM1D(data, k = 2, maxIter = 200, tol = 1e-6) {
const n = data.length;
// --- init: spread means across the data, equal weights, global variance
const lo = Math.min(...data), hi = Math.max(...data);
let mu = Array.from({ length: k }, (_, j) => lo + (hi - lo) * (j + 0.5) / k);
let sigma = Array(k).fill((hi - lo) / (2 * k));
let pi = Array(k).fill(1 / k);
let prevLL = -Infinity;
for (let iter = 0; iter < maxIter; iter++) {
// --- E-step: responsibilities gamma[i][j]
const gamma = data.map(x => {
const w = mu.map((m, j) => pi[j] * gauss(x, m, sigma[j]));
const s = w.reduce((a, b) => a + b, 0) || 1e-300;
return w.map(v => v / s);
});
// --- M-step: weighted re-estimation
for (let j = 0; j < k; j++) {
const Nj = gamma.reduce((a, g) => a + g[j], 0) || 1e-300;
pi[j] = Nj / n;
mu[j] = gamma.reduce((a, g, i) => a + g[j] * data[i], 0) / Nj;
const varj = gamma.reduce((a, g, i) => a + g[j] * (data[i] - mu[j]) ** 2, 0) / Nj;
sigma[j] = Math.sqrt(Math.max(varj, 1e-6)); // variance floor — see Common bugs
}
// --- convergence: total marginal log-likelihood
const ll = data.reduce((a, x) =>
a + Math.log(mu.reduce((s, m, j) => s + pi[j] * gauss(x, m, sigma[j]), 1e-300)), 0);
if (Math.abs(ll - prevLL) < tol) return { pi, mu, sigma, ll, iter };
prevLL = ll;
}
return { pi, mu, sigma, ll: prevLL, iter: maxIter };
}
// Two blobs: ~N(0,1) and ~N(6,1.5)
const sample = (mu, sd) => mu + sd * Math.sqrt(-2 * Math.log(Math.random())) * Math.cos(2 * Math.PI * Math.random());
const data = [...Array(400)].map(() => sample(0, 1)).concat([...Array(600)].map(() => sample(6, 1.5)));
console.log(emGMM1D(data)); // recovers means ~0 and ~6, weights ~0.4 / 0.6
Two details worth flagging. The responsibilities are normalized per point so each row of gamma sums to 1 — that's the E-step's whole job. And the log-likelihood, not the parameters, is what you watch for convergence; it is the quantity EM provably never decreases, so if it ever drops you have a bug.
Python implementation (vectorized, n-dimensional)
The multivariate version, the one you'd actually run, using NumPy. It also shows the log-sum-exp trick that keeps the E-step numerically stable.
import numpy as np
from scipy.stats import multivariate_normal
from scipy.special import logsumexp
def em_gmm(X, k=2, max_iter=200, tol=1e-6, reg=1e-6, seed=0):
rng = np.random.default_rng(seed)
n, d = X.shape
# init: random points as means, global cov, equal weights
mu = X[rng.choice(n, k, replace=False)]
Sig = np.stack([np.cov(X.T) + reg * np.eye(d)] * k)
pi = np.full(k, 1 / k)
prev_ll = -np.inf
for it in range(max_iter):
# ---- E-step: log responsibilities (log-sum-exp for stability)
log_r = np.empty((n, k))
for j in range(k):
log_r[:, j] = np.log(pi[j]) + multivariate_normal.logpdf(X, mu[j], Sig[j])
ll = logsumexp(log_r, axis=1).sum() # marginal log-likelihood
log_r -= logsumexp(log_r, axis=1, keepdims=True)
gamma = np.exp(log_r) # n x k, rows sum to 1
# ---- M-step: weighted MLE
Nk = gamma.sum(axis=0) + 1e-300
pi = Nk / n
mu = (gamma.T @ X) / Nk[:, None]
for j in range(k):
diff = X - mu[j]
Sig[j] = (gamma[:, j, None] * diff).T @ diff / Nk[j]
Sig[j] += reg * np.eye(d) # ridge: prevents singular collapse
if abs(ll - prev_ll) < tol:
break
prev_ll = ll
return dict(pi=pi, mu=mu, Sigma=Sig, loglik=ll, iters=it + 1)
# Baum-Welch is EM for HMMs; pPCA, factor analysis, and mixtures of experts
# all reuse this exact E/M skeleton with a different complete-data likelihood.
Note the reg * np.eye(d) ridge added to every covariance: without it a component can collapse onto one point, its covariance goes singular, logpdf blows up to +∞, and the run is ruined. That tiny diagonal is the difference between a robust fit and a NaN.
Variants worth knowing
Generalized EM (GEM). The M-step doesn't have to fully maximize Q — it only has to increase it. A single gradient step on Q still guarantees monotone likelihood. This rescues EM when the M-step has no closed form.
ECM and ECME. Expectation Conditional Maximization replaces a hard joint M-step with a sequence of simpler conditional maximizations over parameter blocks — useful when the full M-step is coupled but each block is easy.
Stochastic / online EM. Run the E-step on a mini-batch and take a damped M-step, à la stochastic gradient. Essential when n is too large to sweep every iteration; underlies online topic models.
Variational EM (VBEM). When p(z|x,θ) is intractable, approximate it with a tractable q and maximize the ELBO instead. This is the bridge from classical EM to modern variational autoencoders, whose encoder is an amortized E-step.
Hard EM (a.k.a. classification EM). Replace the soft responsibilities with their argmax. This is literally k-means for a spherical GMM and is faster, but it sacrifices the monotone-likelihood guarantee on the true objective.
Baum-Welch. EM specialized to hidden Markov models: the E-step is the forward-backward algorithm computing state posteriors, the M-step re-estimates transition and emission probabilities.
Common bugs and edge cases
- Variance / covariance collapse. A component latches onto one outlier, its spread → 0, likelihood → ∞, and you get NaN. Always floor the variance, add a ridge to the covariance, or put an inverse-Wishart prior (MAP-EM) on it.
- Bad initialization. EM only finds a local optimum, so a poor start gives a poor answer. Seed with k-means++ and run several random restarts, keeping the highest final log-likelihood.
- Likelihood that decreases. It mathematically cannot if EM is correct — a drop means you updated parameters before recomputing responsibilities, double-counted a normalization, or hit floating-point underflow. Compute densities in log-space with log-sum-exp.
- Label switching. The components are exchangeable, so two runs may return the same clusters in swapped order. Harmless for the fit, but it breaks naive averaging across runs and confuses convergence diagnostics.
- Choosing k. More components always raise the training likelihood, so you can't pick
kby likelihood alone. Use BIC, AIC, or held-out likelihood instead. - Empty components. If
N_krounds to zero, the M-step divides by zero. Detect dead components and re-seed them on the worst-fit points or drop them. - Stopping on parameters instead of likelihood. Parameters can drift along a flat ridge while the likelihood is already converged. Monitor the marginal log-likelihood (or the ELBO), not the raw parameter delta.
Frequently asked questions
Why is EM guaranteed to converge?
Each iteration maximizes a lower bound (the ELBO) that touches the true log-likelihood at the current parameters. Because the M-step never decreases that bound and the bound equals the likelihood after the E-step, the marginal log-likelihood is monotonically non-decreasing. Since the likelihood of a bounded model is bounded above, the sequence converges — but only to a stationary point, not necessarily the global maximum.
What is the difference between EM and k-means?
K-means makes hard assignments — each point belongs to exactly one cluster — and only fits cluster means. EM for a Gaussian mixture makes soft assignments (a point is 70% cluster A, 30% cluster B) and also fits each cluster's covariance and mixing weight. K-means is the limiting case of EM where every covariance is the same spherical σ²I and σ → 0.
Does EM find the global maximum likelihood?
No. EM only converges to a local maximum or saddle point of the likelihood surface, and the result depends on initialization. The standard fix is to run EM from many random starts (or seed it with k-means++) and keep the run with the highest final log-likelihood.
What are the E-step and M-step actually computing?
The E-step computes the posterior probability that each data point came from each hidden component — the responsibilities γ — using the current parameters. The M-step then treats those responsibilities as soft labels and re-estimates the parameters with weighted maximum likelihood, as if each point were fractionally a member of every component.
Why can EM's likelihood diverge to infinity for Gaussian mixtures?
If a component collapses onto a single data point, its variance shrinks toward zero and the likelihood of that one point spikes to infinity — a degenerate singularity in the likelihood surface. EM can drive a component into this trap. The cures are a variance floor, a Bayesian MAP prior on the covariance (an inverse-Wishart), or re-initializing the collapsing component.
How fast does EM converge?
EM converges linearly, with a rate governed by the fraction of missing information — the ratio of missing-data Fisher information to complete-data information. When components overlap heavily (lots of ambiguous points), that fraction is near 1 and EM crawls; when clusters are well separated it can converge in a dozen iterations. This is why EM is fast early and slow near the optimum, and why methods like Aitken acceleration or quasi-Newton hybrids are sometimes layered on top.