Machine Learning

Gaussian Mixture Models

Clustering that admits doubt — every point gets a probability, not a label

A Gaussian mixture model represents data as a weighted blend of K Gaussian components, fit by the Expectation–Maximization algorithm in O(N·K·D²) per iteration for full covariances, giving soft cluster memberships as probabilities instead of hard labels.

  • ModelΣ πₖ 𝒩(μₖ, Σₖ)
  • Fit byExpectation–Maximization
  • Cost / iterationO(N·K·D²)
  • ConvergenceLocal maximum of likelihood
  • OutputSoft responsibilities γₙₖ

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: data from several bells

Imagine the heights of everyone in a room. Plot them and you might see two bumps — one centered around adult women, one around adult men — overlapping in the middle where a tall woman and a short man are genuinely ambiguous. A single bell curve can't capture two bumps. A mixture of two bells can, and crucially it can tell you that a 170 cm person is, say, 60% likely from the "women" bell and 40% from the "men" bell.

That is the whole idea of a Gaussian mixture model. You assume the data was generated by first secretly rolling a die to pick one of K "components," then drawing a point from that component's Gaussian (normal) distribution. You never saw the die rolls — those cluster identities are the latent variables. Fitting the model means recovering, from the points alone, the K bells (their centers, shapes, and weights) and, for each point, the probability it came from each bell.

That last part is the headline feature: soft clustering. K-means hands every point a single integer label. A GMM hands every point a probability vector that sums to one. Points deep inside a cluster get near-certain membership; points in the overlap zone get split memberships that honestly reflect the ambiguity.

The model and the EM algorithm

The density a GMM assigns to a point x is a weighted sum of K multivariate Gaussians:

p(x) = Σₖ πₖ · 𝒩(x | μₖ, Σₖ),    with   Σₖ πₖ = 1,  πₖ ≥ 0

𝒩(x | μ, Σ) = (2π)^(−D/2) |Σ|^(−1/2) · exp(−½ (x−μ)ᵀ Σ⁻¹ (x−μ))

The parameters are the mixing weights πₖ (how big each cluster's slice of the pie is), the means μₖ (centers), and the covariances Σₖ (the orientation and spread of each ellipsoid). We want the parameters that maximize the log-likelihood of the observed data, Σₙ log p(xₙ). But that log-of-a-sum has no closed-form maximizer, because we don't know which component produced each point.

The Expectation–Maximization (EM) algorithm, introduced by Dempster, Laird, and Rubin in 1977, breaks the chicken-and-egg deadlock by alternating two steps:

E-step (expectation). Pretend the current parameters are correct and compute the responsibility — the posterior probability that component k generated point n:

γₙₖ = πₖ 𝒩(xₙ | μₖ, Σₖ) / Σⱼ πⱼ 𝒩(xₙ | μⱼ, Σⱼ)

This is just Bayes' rule: prior (the weight) times likelihood (the bell's density at the point), normalized over all components. Each γₙₖ lands in [0, 1] and the K responsibilities for a point sum to 1.

M-step (maximization). Now treat the responsibilities as fixed soft counts and re-estimate the parameters in closed form. Let Nₖ = Σₙ γₙₖ be the effective number of points owned by component k:

μₖ  ←  (1/Nₖ) Σₙ γₙₖ xₙ
Σₖ  ←  (1/Nₖ) Σₙ γₙₖ (xₙ − μₖ)(xₙ − μₖ)ᵀ
πₖ  ←  Nₖ / N

Each new mean is a responsibility-weighted average of the points; each covariance is a weighted scatter matrix; each weight is the fraction of mass the component claimed. Alternate E and M until the log-likelihood stops climbing. Every iteration is provably non-decreasing in likelihood — that's the theorem that makes EM trustworthy.

Why EM never goes downhill

EM is not gradient descent and it is not random search. It optimizes a lower bound on the log-likelihood called the evidence lower bound (ELBO). The E-step makes that bound tight at the current parameters by setting the bound's variational distribution equal to the true posterior (the responsibilities). The M-step then maximizes the bound, which can only raise the true likelihood it touches. Because each step pushes the likelihood up or leaves it flat, the sequence is monotonic and bounded above, so it converges — to a stationary point, which is a local maximum or saddle.

The practical consequence: EM cannot oscillate or diverge in likelihood, unlike a poorly tuned gradient step. The trade-off is that "a local maximum" can be a bad one, which is why initialization and restarts matter.

When to reach for a GMM

  • Clusters are elliptical or differently sized. Full covariances let each cluster be a stretched, tilted ellipsoid — something k-means structurally cannot represent.
  • You need calibrated uncertainty. The responsibilities are probabilities you can threshold, rank, or feed downstream (e.g. "flag points whose top membership is below 0.6 for human review").
  • Density estimation, not just clustering. A fitted GMM is a full generative density. You can score the likelihood of new points for novelty/anomaly detection, or sample synthetic data from it.
  • Soft assignment matters. Speaker identification, image segmentation, and background subtraction all benefit from "70% this, 30% that" over a forced single label.

Reach elsewhere when clusters are non-convex (two interlocking moons defeat any ellipse — use DBSCAN or spectral clustering), when D is huge relative to N (covariances become unstable — reduce dimension first), or when you only need a fast, hard partition (plain k-means is simpler and cheaper).

GMM vs other clustering and density methods

GMM (EM)k-meansDBSCANSpectralKDE
AssignmentSoft (probabilities)HardHard + noise labelHardNone (density only)
Cluster shapeElliptical (full Σ)Spherical, equal sizeArbitrary, density-connectedArbitrary (graph)n/a
Needs K up frontYes (or BIC/AIC sweep)YesNo (uses ε, minPts)YesNo
Per-iteration costO(N·K·D²)O(N·K·D)O(N log N) w/ indexO(N³) eigendecompO(N·M·D) per query
Generative modelYes — full densityNoNoNoYes — nonparametric
Handles overlapYes (the whole point)Poorly (sharp boundary)Merges overlappingDepends on graphYes
Outlier behaviorPulls a component or needs ridgePulls a centroidLabels as noiseCan isolateSmooths over

The defining contrast is k-means. A GMM with covariances locked to σ²I and σ → 0 is k-means: the responsibilities harden into 0/1 assignments and the mean update becomes the centroid update. So k-means is the zero-temperature, isotropic special case of a GMM. Everything a GMM adds — covariances, weights, soft memberships — is what you buy for the extra factor.

What the numbers actually say

  • Per-iteration cost is O(N·K·D²) for full covariances. The D² is unavoidable: each E-step evaluation needs a quadratic form (x−μ)ᵀΣ⁻¹(x−μ), and inverting each Σ is O(D³) but done only K times per iteration, so for N ≫ D the N·K·D² term dominates. Switch to diagonal covariances and it falls to O(N·K·D) — for D = 100 that is a 100× speedup per iteration.
  • Parameter count grows fast with full covariances. Each component needs D for its mean plus D(D+1)/2 for a symmetric covariance. With K = 10 components in D = 50 dimensions that is 10·(50 + 1275) + 9 ≈ 13,259 free parameters — easy to overfit with only a few thousand points. Diagonal covariances cut the per-component cost from D(D+1)/2 to just D.
  • Iterations to converge. EM typically needs tens to low hundreds of iterations to settle within a relative log-likelihood tolerance of 1e-3. The early iterations move the means a lot; the tail is slow because EM's convergence near a maximum is only linear, not quadratic like Newton's method.
  • Restarts are cheap insurance. Because a single fit is fast and the answer is initialization-sensitive, libraries default to several restarts (scikit-learn's n_init defaults to 1 but k-means++ init is used) and keep the highest-likelihood run.

JavaScript implementation

A compact 1-D GMM via EM. One dimension keeps the linear algebra to scalars (variance instead of a covariance matrix) so the EM loop is visible without a matrix library.

// Fit a 1-D Gaussian mixture with K components via EM.
// data: number[]; returns { weights, means, vars, logLik }.
function fitGMM(data, K, { maxIter = 200, tol = 1e-6, ridge = 1e-6 } = {}) {
  const N = data.length;
  const gauss = (x, mu, v) =>
    Math.exp(-((x - mu) ** 2) / (2 * v)) / Math.sqrt(2 * Math.PI * v);

  // Init: spread means across the data range, equal weights, global variance.
  const lo = Math.min(...data), hi = Math.max(...data);
  const mean = data.reduce((a, b) => a + b, 0) / N;
  const v0 = data.reduce((a, b) => a + (b - mean) ** 2, 0) / N;
  let w = Array.from({ length: K }, () => 1 / K);
  let mu = Array.from({ length: K }, (_, k) => lo + ((k + 0.5) / K) * (hi - lo));
  let vr = Array.from({ length: K }, () => v0);

  let prevLL = -Infinity;
  for (let iter = 0; iter < maxIter; iter++) {
    // E-step: responsibilities gamma[n][k]
    const gamma = data.map((x) => {
      const px = mu.map((m, k) => w[k] * gauss(x, m, vr[k]));
      const s = px.reduce((a, b) => a + b, 0) || 1e-300;
      return px.map((p) => p / s);
    });

    // M-step
    for (let k = 0; k < K; k++) {
      let Nk = 0, sumX = 0;
      for (let n = 0; n < N; n++) { Nk += gamma[n][k]; sumX += gamma[n][k] * data[n]; }
      Nk = Nk || 1e-300;
      mu[k] = sumX / Nk;
      let sv = 0;
      for (let n = 0; n < N; n++) sv += gamma[n][k] * (data[n] - mu[k]) ** 2;
      vr[k] = sv / Nk + ridge;     // ridge prevents variance collapse
      w[k] = Nk / N;
    }

    // Log-likelihood for the convergence test
    let ll = 0;
    for (const x of data) {
      let px = 0;
      for (let k = 0; k < K; k++) px += w[k] * gauss(x, mu[k], vr[k]);
      ll += Math.log(px + 1e-300);
    }
    const converged = Math.abs(ll - prevLL) < tol;
    prevLL = ll;                 // always keep the latest log-likelihood
    if (converged) break;
  }
  return { weights: w, means: mu, vars: vr, logLik: prevLL };
}

Two details that matter in any GMM code. First, the || 1e-300 guards divide every responsibility sum and effective count so a component that loses all its mass doesn't produce NaN. Second, the ridge added to each variance is the singularity fix — without it, a component can clamp onto one point, drive its variance to zero, and send the likelihood to infinity.

Python implementation

The same algorithm in D dimensions with full covariance matrices, using NumPy. This is essentially what sklearn.mixture.GaussianMixture does under the hood.

import numpy as np
from numpy.linalg import slogdet, solve

def gmm_em(X, K, max_iter=200, tol=1e-6, ridge=1e-6, seed=0):
    rng = np.random.default_rng(seed)
    N, D = X.shape
    # Init means at random points; covariances at the global covariance.
    mu = X[rng.choice(N, K, replace=False)]
    cov = np.stack([np.cov(X.T) + ridge * np.eye(D) for _ in range(K)])
    w = np.full(K, 1.0 / K)

    def log_gauss(X, mu, S):
        S = S + ridge * np.eye(D)
        diff = X - mu                       # (N, D)
        sol = solve(S, diff.T).T            # Σ⁻¹ (x−μ), avoids explicit inverse
        maha = np.einsum('nd,nd->n', diff, sol)
        _, logdet = slogdet(S)
        return -0.5 * (D * np.log(2 * np.pi) + logdet + maha)

    prev_ll = -np.inf
    for _ in range(max_iter):
        # E-step in log-space for numerical stability
        log_r = np.array([np.log(w[k]) + log_gauss(X, mu[k], cov[k]) for k in range(K)]).T
        m = log_r.max(axis=1, keepdims=True)
        log_norm = m[:, 0] + np.log(np.exp(log_r - m).sum(axis=1))
        gamma = np.exp(log_r - log_norm[:, None])     # (N, K) responsibilities

        # M-step
        Nk = gamma.sum(axis=0) + 1e-300
        w = Nk / N
        mu = (gamma.T @ X) / Nk[:, None]
        for k in range(K):
            diff = X - mu[k]
            cov[k] = (gamma[:, k, None] * diff).T @ diff / Nk[k] + ridge * np.eye(D)

        ll = log_norm.sum()
        converged = abs(ll - prev_ll) < tol
        prev_ll = ll                 # always keep the latest log-likelihood
        if converged:
            break
    return w, mu, cov, gamma, prev_ll

def bic(X, K, log_lik):
    """Lower is better — penalizes free parameters to pick K."""
    N, D = X.shape
    n_params = K - 1 + K * D + K * D * (D + 1) // 2   # weights + means + cov
    return -2 * log_lik + n_params * np.log(N)

Note the two stability techniques real implementations rely on. The E-step works entirely in log-space and uses the log-sum-exp trick (subtract the row max before exponentiating) so that tiny densities don't underflow to zero. And solve(S, diff.T) computes Σ⁻¹(x−μ) without ever forming the explicit inverse, which is both faster and more numerically stable than np.linalg.inv.

Variants worth knowing

Covariance tying. Estimating a full Σ per component is parameter-hungry. Libraries offer four covariance types: full (each its own ellipsoid), tied (all components share one Σ), diagonal (axis-aligned ellipsoids, no D² blowup), and spherical (one variance scalar per component — closest to k-means). Diagonal is the workhorse in high dimensions.

Bayesian / variational GMM. Put a Dirichlet prior on the weights and a normal-inverse-Wishart prior on each Gaussian, then fit with variational inference. The prior shrinks unneeded components toward zero weight, so you can over-specify K and let the model prune it. This also dodges the variance-collapse singularity automatically.

Dirichlet-process mixture (infinite GMM). Don't fix K at all — a Dirichlet process places a prior over infinitely many components and the data decides how many are active. Useful when you genuinely don't know the cluster count.

Hard-EM / Classification EM. Replace soft responsibilities with a hard argmax in the E-step. This recovers exactly k-means for spherical covariances and runs faster, at the cost of the soft-clustering benefits and worse local optima.

Mixture of factor analyzers / probabilistic PCA mixtures. Model each component's covariance as a low-rank factor structure, giving GMM-like flexibility in high dimensions without the full D² parameters per component.

Common bugs and edge cases

  • Variance / covariance collapse (the singularity). A component grabs one point, its determinant → 0, and the log-likelihood shoots to +∞. Always add a ridge εI to covariances, or detect and reinitialize a collapsing component.
  • Numerical underflow in the E-step. Multiplying many small Gaussian densities underflows to 0, producing 0/0 responsibilities. Compute log-densities and use log-sum-exp for the normalization, as the Python code does.
  • Forgetting that label IDs are arbitrary. Components are unidentifiable up to permutation — "cluster 0" on one run is "cluster 2" on another. Don't compare raw label integers across runs; compare via a matching like the Hungarian algorithm.
  • Maximizing likelihood to pick K. Likelihood only ever rises with more components, so it can't choose K. Use BIC or AIC, which add a complexity penalty, or a held-out validation likelihood.
  • Assuming convergence equals correctness. EM converges to a local maximum that depends on the seed. A single run can land in a poor optimum; run multiple restarts (k-means++ init helps) and keep the best log-likelihood.
  • Using full covariances when D is large. The D(D+1)/2 parameters per component overfit and the matrices become ill-conditioned. Switch to diagonal covariances, reduce dimension with PCA first, or tie the covariances.

Frequently asked questions

How is a Gaussian mixture model different from k-means?

K-means assigns each point to exactly one cluster (a hard label) and only learns cluster centers, so it implicitly assumes spherical, equal-size clusters. A GMM learns a center, a full covariance matrix, and a weight per cluster, and assigns each point a probability of belonging to every cluster. In fact k-means is the limiting case of a GMM where all covariances are σ²I with σ → 0.

Why does EM use soft responsibilities instead of hard assignments?

EM maximizes the data likelihood, and the likelihood is a smooth function of the parameters. Soft responsibilities — the posterior probability that point i came from component k — make every parameter update differentiable and monotonically non-decreasing in likelihood. Hard assignments would reintroduce the discrete jumps that make k-means get stuck more easily.

Does EM always converge to the best solution?

No. EM is guaranteed to never decrease the log-likelihood, so it converges to a local maximum or saddle point, but not necessarily the global one. The result depends on initialization; standard practice is to run EM several times from different seeds (often k-means++ centers) and keep the run with the highest likelihood.

What is the time complexity of fitting a GMM?

Each EM iteration costs O(N·K·D²) for full covariances — N points, K components, D dimensions, with the D² coming from evaluating and inverting the covariance matrices. With diagonal covariances it drops to O(N·K·D). Total cost is that per-iteration figure times the number of iterations to converge, typically tens to low hundreds.

How do you choose the number of components K?

You can't just maximize likelihood — more components always fit the training data better. Instead penalize complexity with the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC), fit GMMs for a range of K, and pick the K that minimizes the score. A Dirichlet-process (infinite) mixture can also learn K automatically.

Why do GMM covariances sometimes collapse to zero?

If one component latches onto a single point, its covariance shrinks toward zero and the density at that point goes to infinity, sending the likelihood to +∞ — a singularity. The fix is to add a small ridge (regularization) to the diagonal of each covariance matrix, or to reset any component whose weight or determinant collapses.