Machine Learning

K-Means Clustering

Two steps, repeated until nothing moves — the workhorse of unsupervised learning

K-means clustering partitions n points into k groups by alternately assigning each point to its nearest centroid and recomputing each centroid as the mean of its points, converging to a local minimum of within-cluster variance.

  • Per-iteration costO(n · k · d)
  • Typical iterations10–50
  • Optimumlocal (NP-hard globally)
  • k chosen byyou, not the data
  • Cluster shapespherical, equal-ish

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How k-means works

You have a cloud of n data points and no labels — just the raw coordinates. You want to discover groups. K-means does it with a loop so simple you can run it by hand: pick k centers, then repeat two steps until the centers stop moving.

  1. Assignment step. Assign every point to the nearest of the k centroids, by Euclidean distance. This carves the space into k Voronoi cells — each cell is the region closer to its centroid than to any other.
  2. Update step. Move each centroid to the mean (the centre of mass) of all points currently assigned to it. The mean is the point that minimizes the sum of squared distances to its members, which is exactly what we're trying to shrink.

That's it. Alternate assignment and update. Each step can only decrease (or hold) the objective, so the loop is guaranteed to terminate. This is Lloyd's algorithm, published by Stuart Lloyd inside Bell Labs in 1957 as a technique for pulse-code modulation, and not declassified into the open literature until 1982 — by which point others had reinvented it several times over.

The objective being minimized is the within-cluster sum of squares, also called inertia:

J = Σ_clusters Σ_(points in cluster) ‖ x − μ_cluster ‖²

Crucially, this is a chicken-and-egg optimization: the best assignment depends on the centroids, and the best centroids depend on the assignment. K-means breaks the deadlock by fixing one and optimizing the other, alternately — a special case of expectation-maximization (EM) with hard assignments instead of soft probabilities.

When to use k-means

  • You already know roughly how many groups to expect. Customer tiers, color quantization to 16 palette entries, k delivery zones — anywhere k is a business constraint rather than a discovery.
  • The clusters are blobby and similar in size. K-means assumes round, convex, comparably-populated groups. That's a strong assumption, but it's true surprisingly often after sensible feature scaling.
  • You need speed at scale. Linear in the number of points per iteration, it clusters millions of rows where hierarchical clustering (O(n²) memory) falls over.
  • You want a vector-quantization codebook. Each centroid becomes a representative code; this is how k-means compresses images and builds the "visual words" in bag-of-features pipelines.

Reach for something else when clusters are elongated or nested (use a density-based method like DBSCAN), when you don't know k and the data has clear hierarchy (agglomerative clustering), or when clusters genuinely overlap and you want membership probabilities (a Gaussian mixture model).

K-means vs other clustering methods

K-meansGaussian mixture (GMM)DBSCANAgglomerativeMean-shift
Need to pick k?YesYesNo (picks ε, minPts)No (cut the dendrogram)No (picks bandwidth)
Cluster shapeSpherical, equal-ishElliptical (full covariance)Arbitrary / non-convexArbitrary (linkage-dependent)Arbitrary blobs
AssignmentHard (one cluster)Soft (probabilities)Hard + noise labelHardHard
Handles outliers?Poorly — they drag meansPoorlyYes — labels them noisePoorlyModerately
Time per fitO(n · k · d · i)O(n · k · d² · i)O(n log n) with indexO(n² log n)O(n² · i)
MemoryO(n + k)O(n + k·d²)O(n)O(n²) distance matrixO(n)
Scales to millions?Yes (mini-batch)MostlyWith a spatial indexNoNo

K-means is the fast, scalable, opinionated baseline. GMM is k-means with soft assignments and an elliptical covariance per cluster — strictly more expressive, strictly slower. DBSCAN trades the "pick k" problem for the "pick ε" problem but earns the ability to find arbitrary shapes and flag noise. Start with k-means; switch only when its spherical-equal-size assumption visibly breaks.

What the numbers actually say

  • One iteration costs n · k · d distance computations. Cluster 1,000,000 points into k=10 groups in d=50 dimensions and that's 5×10⁸ multiply-adds per pass — about 0.5 s on a single modern core, a few milliseconds on a GPU. Run 30 iterations and you're at roughly 15 s single-threaded.
  • Plain random init wastes runs. On the standard benchmark of well-separated Gaussians, uniform-random seeding lands in a bad local minimum a large fraction of the time; k-means++ cuts that dramatically and comes with a proven O(log k) expected-cost guarantee versus the optimum. That's why scikit-learn uses k-means++ as its default initializer (and lets you raise n_init to keep the best of several restarts).
  • Mini-batch k-means is roughly an order of magnitude faster on large datasets, sampling a few thousand points per update instead of all n, at the cost of a slightly higher final inertia (typically a low single-digit percentage).
  • Iteration count is small in practice. Despite a superpolynomial worst case (constructible adversarial inputs exist), real data converges in 10–50 iterations. Smoothed analysis explains why: tiny perturbations to the input bound the expected number of iterations to a polynomial.

JavaScript implementation

A complete, dependency-free Lloyd's algorithm with k-means++ seeding:

function dist2(a, b) {
  let s = 0;
  for (let i = 0; i < a.length; i++) { const d = a[i] - b[i]; s += d * d; }
  return s;                          // squared distance — no need for sqrt
}

// k-means++ seeding: spread the initial centers out
function seedPlusPlus(points, k) {
  const centroids = [points[Math.floor(Math.random() * points.length)]];
  while (centroids.length < k) {
    const d2 = points.map(p => Math.min(...centroids.map(c => dist2(p, c))));
    const total = d2.reduce((a, b) => a + b, 0);
    let r = Math.random() * total, i = 0;
    while ((r -= d2[i]) > 0) i++;     // pick proportional to squared distance
    centroids.push(points[i]);
  }
  return centroids.map(c => c.slice());
}

function kmeans(points, k, maxIter = 100) {
  const dim = points[0].length;
  let centroids = seedPlusPlus(points, k);
  let labels = new Array(points.length).fill(-1);

  for (let iter = 0; iter < maxIter; iter++) {
    let moved = false;

    // assignment step — nearest centroid wins
    for (let p = 0; p < points.length; p++) {
      let best = 0, bestD = Infinity;
      for (let c = 0; c < k; c++) {
        const d = dist2(points[p], centroids[c]);
        if (d < bestD) { bestD = d; best = c; }
      }
      if (labels[p] !== best) { labels[p] = best; moved = true; }
    }

    // update step — each centroid becomes the mean of its members
    const sums = Array.from({ length: k }, () => new Array(dim).fill(0));
    const counts = new Array(k).fill(0);
    for (let p = 0; p < points.length; p++) {
      const c = labels[p]; counts[c]++;
      for (let i = 0; i < dim; i++) sums[c][i] += points[p][i];
    }
    for (let c = 0; c < k; c++) {
      if (counts[c] === 0) continue;  // empty cluster — see "common bugs"
      for (let i = 0; i < dim; i++) centroids[c][i] = sums[c][i] / counts[c];
    }

    if (!moved) break;                // converged: no point changed cluster
  }
  return { centroids, labels };
}

Two things worth noting. First, we compare squared distances throughout — the square root in Euclidean distance is monotonic, so it never changes which centroid is nearest, and skipping it saves a costly Math.sqrt per comparison. Second, convergence is detected by checking whether any label changed; once the partition is stable, the centroids would land in the same place anyway.

Python implementation

The same algorithm vectorized with NumPy, which is how you'd actually write it for performance — the assignment step becomes a single broadcasted distance matrix:

import numpy as np

def seed_plusplus(X, k, rng):
    centroids = [X[rng.integers(len(X))]]
    for _ in range(1, k):
        d2 = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0)
        probs = d2 / d2.sum()
        centroids.append(X[rng.choice(len(X), p=probs)])
    return np.array(centroids, dtype=float)

def kmeans(X, k, max_iter=100, rng=None):
    rng = rng or np.random.default_rng()
    centroids = seed_plusplus(X, k, rng)
    labels = np.full(len(X), -1)

    for _ in range(max_iter):
        # assignment: (n, k) distance matrix via broadcasting, then argmin
        d = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
        new_labels = d.argmin(axis=1)
        if np.array_equal(new_labels, labels):
            break                                  # converged
        labels = new_labels

        # update: mean of each cluster
        for c in range(k):
            members = X[labels == c]
            if len(members):                        # guard empty clusters
                centroids[c] = members.mean(axis=0)

    inertia = float(((X - centroids[labels]) ** 2).sum())
    return centroids, labels, inertia

# Real code just calls scikit-learn, which does all of the above plus
# k-means++, n_init restarts, and Elkan's triangle-inequality speedup:
#   from sklearn.cluster import KMeans
#   km = KMeans(n_clusters=8, n_init=10, random_state=0).fit(X)
#   km.labels_, km.cluster_centers_, km.inertia_

For real work, use sklearn.cluster.KMeans: it ships k-means++, runs n_init restarts and keeps the best, and uses Elkan's algorithm — caching distance bounds via the triangle inequality to skip the vast majority of distance computations once clusters settle.

Variants worth knowing

K-means++. Not a different algorithm — a better way to choose the starting centroids, shown above. Seed proportional to squared distance from the nearest existing center. This single change is the difference between reliable and flaky results, and it's the default everywhere.

Mini-batch k-means. Each update uses a random sample of points instead of the full dataset, applying a decaying-learning-rate move to the centroids. Roughly 10× faster on large data with a small inertia penalty; the standard choice past a few hundred thousand rows.

K-medoids (PAM). The "centroid" must be an actual data point (a medoid), and you can plug in any distance metric, not just Euclidean. Far more robust to outliers, but O(n²) per iteration — much slower.

K-medians. Minimizes the sum of L1 (Manhattan) distances and uses the per-feature median as the center. The median resists outliers far better than the mean, so it's the robust cousin of plain k-means.

Fuzzy c-means and Gaussian mixtures. Replace hard assignment with soft membership. GMM additionally fits a full covariance per cluster, so it can model elliptical, differently-oriented groups — k-means is the limiting case of a GMM with shared spherical covariance shrunk to zero.

Spherical k-means. Uses cosine similarity instead of Euclidean distance, normalizing points to the unit sphere first. The standard choice for clustering high-dimensional text TF-IDF vectors, where direction matters more than magnitude.

Common bugs and edge cases

  • Forgetting to scale features. The single most common mistake. Euclidean distance lets the largest-range feature dominate; income in dollars will bury age in years. Standardize to zero mean and unit variance first.
  • Empty clusters. If no point is nearest to some centroid, its mean is undefined (a divide-by-zero). Handle it explicitly: re-seed that centroid to the point farthest from any existing center, or to a random point. Silently skipping it (as the guards above do) leaves a dead centroid.
  • Trusting a single run. One random init gives one local minimum, often a bad one. Always run multiple seeds and keep the lowest-inertia result — n_init exists for exactly this.
  • Inertia always drops as k rises. You cannot pick k by minimizing inertia — it monotonically decreases to zero at k=n. Use the elbow, the silhouette score, or the gap statistic instead.
  • Comparing inertia across different scalings. Inertia is in the units of your (scaled) features; it's only meaningful relative to other runs on the same preprocessing.
  • Using it on categorical data. The mean of "red, blue, green" is meaningless. Use k-modes for categorical features, or k-prototypes for mixed numeric-and-categorical data.
  • The curse of dimensionality. In very high dimensions, all pairwise distances converge toward equal, so "nearest centroid" loses meaning. Reduce dimensions first (PCA, UMAP) before clustering.

Frequently asked questions

Does k-means always find the best clustering?

No. K-means converges to a local minimum of within-cluster variance, not the global one. The result depends entirely on where the centroids start, which is why production code runs it 10 or more times with different seeds (n_init) and keeps the lowest-inertia run. Finding the true optimum is NP-hard.

How do you choose the number of clusters k?

There is no k inside the data — you supply it. The elbow method plots inertia against k and looks for the bend where adding clusters stops helping much. The silhouette score (−1 to 1) measures how well-separated clusters are. The gap statistic compares your inertia to that of uniform random data. None of these is definitive; domain knowledge usually wins.

Why does k-means only find round, equal-sized clusters?

Because it assigns points by Euclidean distance to a single mean, the implied decision boundary between two centroids is a straight line (a Voronoi cell), and minimizing squared distance favors spherical, similarly-sized blobs. Elongated, nested, or differently-sized clusters break it. Gaussian mixture models or DBSCAN handle those shapes better.

What is k-means++ and why use it?

K-means++ is a smarter way to pick the initial centroids: choose the first at random, then choose each next center with probability proportional to its squared distance from the nearest already-chosen center. This spreads the seeds out, giving an O(log k) expected approximation guarantee and far fewer bad local minima. It is the default initializer in scikit-learn.

How fast is k-means?

Each iteration of Lloyd's algorithm is O(n · k · d): for every one of n points you compute its distance to all k centroids in d dimensions. The number of iterations is usually 10 to 50 in practice. Worst-case iteration count is superpolynomial, but that essentially never happens on real data.

Do you need to scale features before running k-means?

Almost always. K-means uses Euclidean distance, so a feature measured in dollars (range 0–100000) will completely dominate one measured in years (range 0–80). Standardize each feature to zero mean and unit variance first, or the clustering just reflects whichever column has the largest numeric range.