Machine Learning

UMAP

Squash a million dimensions onto a page — and keep the neighborhoods intact

UMAP projects high-dimensional data to 2D or 3D by building a fuzzy topological graph of each point's nearest neighbors and then laying that graph out so local neighborhoods stay intact — faster than t-SNE and better at preserving global structure.

  • Author / yearMcInnes, Healy & Melville · 2018
  • Full nameUniform Manifold Approximation and Projection
  • k-NN graph build≈ O(n^1.14) (NN-Descent)
  • Layout optimizationO(n · n_epochs) with SGD
  • Key knobsn_neighbors · min_dist

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: stretch the rubber sheet flat

Imagine a galaxy of 784-dimensional points — each one a 28×28 handwritten digit from MNIST. You can't look at 784 dimensions, but the digits don't actually fill that space. They live on a thin, crumpled sheet — a manifold — embedded inside it. Most of the 784D volume is empty; the real structure is low-dimensional and folded up. UMAP's job is to find that sheet and lay it flat on a page without tearing apart points that were neighbors.

The naive approach — Principal Component Analysis — slices the cloud along its directions of greatest variance. That's a linear cut, and a crumpled sheet doesn't unfold with straight cuts: distant folds collapse on top of each other. UMAP is nonlinear. It throws away absolute coordinates entirely and instead asks one question of every point: who are your nearest neighbors, and how strongly? The answer is a weighted graph. Lay that graph out in 2D so connected points stay close and unconnected points drift apart, and you've unfolded the manifold.

That graph-then-layout structure is the whole trick. UMAP — Uniform Manifold Approximation and Projection, published by Leland McInnes, John Healy and James Melville in 2018 — formalizes "who are your neighbors" as a fuzzy topological object and "lay it out" as a force-directed optimization. Everything else is bookkeeping.

How it works: build a fuzzy graph, then unfold it

UMAP runs in two phases.

Phase 1 — build the fuzzy simplicial set (the graph). For each point, find its n_neighbors nearest neighbors using an approximate algorithm (NN-Descent). Then decide how strongly each point is connected to each neighbor. Two corrections make this "fuzzy":

  • Local connectivity. Every point is guaranteed at least one neighbor at full strength. UMAP subtracts the distance to the nearest neighbor (a per-point offset called ρ) so that even an isolated point isn't stranded — the manifold is assumed locally connected.
  • Adaptive bandwidth. A per-point scale σ is tuned by binary search so that the sum of edge weights equals log₂(n_neighbors). Dense regions get a small σ, sparse regions a large one. This is what "uniform" means: under each point's own ruler, neighbors are spread evenly.

The directed edge weight from point i to neighbor j is a decaying exponential:

w(i → j) = exp( −max(0, d(i, j) − ρᵢ) / σᵢ )

This weight is asymmetric — i may count j as a close neighbor while j barely notices i. UMAP symmetrizes with a probabilistic fuzzy union (a t-conorm): the chance that at least one direction fires.

w(i, j) = w(i → j) + w(j → i) − w(i → j)·w(j → i)

Phase 2 — lay the graph out. Drop the points into 2D (spectral initialization by default, which already roughly unfolds the manifold) and run stochastic gradient descent. In the low-dimensional space, the modeled connection strength between two points uses a Student-t-like curve 1 / (1 + a·‖yᵢ − yⱼ‖^(2b)), where a and b are fit once from min_dist. UMAP minimizes the fuzzy set cross-entropy between the high-D graph and this low-D graph:

  • Attractive forces pull together pairs that share a high-D edge (sampled proportionally to edge weight).
  • Repulsive forces push apart randomly sampled non-neighbors (negative sampling, ~5 per edge by default).

Negative sampling is the speed secret: instead of computing repulsion against all n−1 other points (the O(n²) trap), UMAP repels against a handful of random points per edge per epoch. Cost becomes proportional to the number of edges, not the number of pairs.

The mechanism, with real complexity bounds

The two phases have very different cost profiles, and people routinely blame the wrong one for a slow run.

  • k-NN graph construction via NN-Descent is the empirical bottleneck on small-to-medium data. Its observed scaling is about O(n1.14) in the original paper — not the O(n log n) you might hope for and not the O(n²) of a brute-force search. This phase dominates wall-clock time below ~100k points.
  • Fuzzy graph assembly (the ρ/σ binary searches and symmetrization) is O(n · n_neighbors) — cheap and linear.
  • Layout optimization is O(n · n_epochs) per dataset, since each epoch touches every edge once and edge count is O(n · n_neighbors). Default n_epochs is 200 for large data, 500 for small. This phase dominates on millions of points.

Memory is O(n · n_neighbors) for the sparse graph — far smaller than the O(n²) dense distance matrix that naive embeddings build. With default n_neighbors = 15, a million-point graph holds ~15M edges, a few hundred megabytes, not the four terabytes a dense float32 matrix would need.

When to reach for UMAP

  • Exploratory visualization of embeddings — word2vec vectors, transformer hidden states, single-cell RNA expression. This is UMAP's home turf and where it has displaced t-SNE in most bioinformatics pipelines.
  • A reusable preprocessing layer. Because UMAP supports transform(), you can fit it once on training data and project new points into the same space at inference time — a clustering or visualization step inside a real pipeline.
  • Reducing to a handful of dimensions before clustering. UMAP into 5-20 dimensions (not just 2) and then HDBSCAN is a standard, effective combination; the original authors built both libraries.
  • Large datasets where t-SNE is too slow but you still need nonlinear structure that PCA can't capture.

Reach for something else when you need interpretable axes (PCA), a provably distance-preserving projection (random projection, Johnson-Lindenstrauss), or you only have a few hundred points where t-SNE's polish on tiny local clusters can look nicer. And never feed a raw 2D UMAP layout into a downstream model expecting the coordinates to mean anything quantitative — they don't.

UMAP vs other dimensionality reduction

UMAPt-SNEPCAAutoencoderIsomapRandom projection
Linear?NoNoYesNoNoYes
Preserves local structureStrongStrongWeakMediumStrongWeak
Preserves global structureMediumWeakStrongMediumStrongStrong (distance)
Speed (1M points)MinutesHoursSecondsMinutes (GPU)HoursSeconds
Embeds new points?Yes (transform)NoYesYesAwkwardYes
Interpretable axes?NoNoYesNoNoNo
ObjectiveFuzzy cross-entropyKL divergenceVariance / SVDReconstruction lossGeodesic MDSJL lemma

The headline contrast is with t-SNE, which UMAP is most often compared to because both produce those familiar islands of colored dots. UMAP wins on speed and on global layout — t-SNE's KL objective has almost no repulsive signal between far-apart clusters, so where clusters land relative to each other is essentially random across t-SNE runs. UMAP's cross-entropy keeps a genuine repulsion term, so the macro arrangement is more stable and more meaningful (though still not to be over-read).

What the numbers actually say

  • MNIST (70,000 × 784): UMAP embeds to 2D in roughly 30-90 seconds on a laptop CPU; Barnes-Hut t-SNE on the same data takes several minutes to tens of minutes. The 5-10× speedup is typical, not cherry-picked.
  • The σ binary search converges fast. Solving Σ exp(−(dᵢⱼ − ρᵢ)/σᵢ) = log₂(n_neighbors) per point takes ~64 bisection iterations to float precision — a fixed constant, so the whole phase stays O(n · n_neighbors).
  • Negative sampling rate of 5 means each positive edge is paired with ~5 repulsive samples per epoch. Raising it sharpens cluster boundaries but linearly increases cost; the default is a deliberate quality/speed compromise.
  • Default n_neighbors = 15, min_dist = 0.1. These two knobs explain the overwhelming majority of "my UMAP looks different from yours" complaints. The data didn't change; the layout parameters did.
  • Memory: the sparse k-NN graph for 1M points at n_neighbors=15 is ~15M nonzeros — orders of magnitude below the ~4 TB a dense 1M×1M float32 distance matrix would require.

JavaScript implementation (core layout step)

The full algorithm is large, but the heart of UMAP — the SGD layout with attraction along edges and repulsion via negative sampling — is compact. This is the loop that actually moves the points:

// Low-dimensional connection curve: 1 / (1 + a * d^(2b))
// a, b are fit from min_dist; ~1.577 / 0.895 for min_dist = 0.1.
function optimizeLayout(embedding, edges, weights, {
  a = 1.577, b = 0.895, nEpochs = 200,
  negSamples = 5, dim = 2,
} = {}) {
  const n = embedding.length / dim;          // flat Float32Array, row-major
  const clip = (v) => Math.max(-4, Math.min(4, v));

  for (let epoch = 0; epoch < nEpochs; epoch++) {
    const alpha = 1.0 * (1 - epoch / nEpochs); // learning rate decays to 0

    for (let e = 0; e < edges.length; e++) {
      // Sample this edge with probability proportional to its weight.
      if (Math.random() > weights[e]) continue;
      const [i, j] = edges[e];
      const oi = i * dim, oj = j * dim;

      // ── Attraction: pull neighbors i and j together ──
      let d2 = 0;
      for (let k = 0; k < dim; k++) { const t = embedding[oi+k] - embedding[oj+k]; d2 += t*t; }
      const gradCoef = d2 > 0
        ? (-2 * a * b * Math.pow(d2, b - 1)) / (1 + a * Math.pow(d2, b))
        : 0;
      for (let k = 0; k < dim; k++) {
        const g = clip(gradCoef * (embedding[oi+k] - embedding[oj+k])) * alpha;
        embedding[oi+k] += g;
        embedding[oj+k] -= g;
      }

      // ── Repulsion: push i away from random non-neighbors ──
      for (let s = 0; s < negSamples; s++) {
        const c = (Math.random() * n) | 0;       // random point
        if (c === i) continue;
        const oc = c * dim;
        let rd2 = 0;
        for (let k = 0; k < dim; k++) { const t = embedding[oi+k] - embedding[oc+k]; rd2 += t*t; }
        const repCoef = rd2 > 0
          ? (2 * b) / ((0.001 + rd2) * (1 + a * Math.pow(rd2, b)))
          : 0;
        for (let k = 0; k < dim; k++) {
          const g = (repCoef ? clip(repCoef * (embedding[oi+k] - embedding[oc+k])) : 4) * alpha;
          embedding[oi+k] += g;
        }
      }
    }
  }
  return embedding;
}

Two details mirror the reference implementation. First, the gradient is clipped to ±4; without that clamp the very first epochs send overlapping points flying to infinity. Second, the learning rate alpha decays linearly to zero across epochs — the layout cools like simulated annealing, big moves early and fine adjustments late. Building the edges/weights arrays (the fuzzy graph) is the other half of the algorithm and is what the Python snippet below sketches.

Python: building the fuzzy graph

In practice you call the library — umap-learn — in three lines. But the conceptual core, the fuzzy-graph construction, is worth seeing spelled out:

# Production usage — what you actually write:
import umap
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
embedding = reducer.fit_transform(X)        # X: (n_samples, n_features)
new_coords = reducer.transform(X_new)        # project unseen points

# ── Conceptual core: fuzzy graph weights from a k-NN search ──
import numpy as np

def fuzzy_simplicial_set(knn_dists, knn_idx, n_neighbors):
    """knn_dists, knn_idx: (n, k) arrays from a k-NN search (column 0 is the
    point itself, distance 0), sorted ascending; returns symmetric weights W."""
    n, k = knn_dists.shape
    rho   = knn_dists[:, 1]            # distance to nearest *non-self* neighbor
    sigma = np.zeros(n)

    target = np.log2(n_neighbors)      # desired sum of edge weights per point
    for i in range(n):                 # binary search sigma_i
        lo, hi, mid = 0.0, np.inf, 1.0
        for _ in range(64):
            psum = np.sum(np.exp(-np.maximum(0, knn_dists[i, 1:] - rho[i]) / mid))
            if abs(psum - target) < 1e-5:
                break
            if psum > target:
                hi = mid; mid = (lo + hi) / 2
            else:
                lo = mid; mid = mid * 2 if hi == np.inf else (lo + hi) / 2
        sigma[i] = mid

    # Directed weights, then symmetrize with the fuzzy (probabilistic) union.
    W = np.zeros((n, n))
    for i in range(n):
        for jdx in range(1, k):
            j = knn_idx[i, jdx]        # index of the jdx-th neighbor of i
            W[i, j] = np.exp(-max(0.0, knn_dists[i, jdx] - rho[i]) / sigma[i])
    return W + W.T - W * W.T           # fuzzy union: a + b - a*b

Note rho uses column index 1, not 0 — column 0 is the point's distance to itself (zero). Anchoring each point's strongest edge to its true nearest neighbor is the "local connectivity" assumption that keeps sparse outliers from being orphaned.

Variants worth knowing

DensMAP. Standard UMAP discards density information — a tight cluster and a diffuse one can render the same size. DensMAP adds a density-correlation term so that relative local density is preserved in the embedding, recovering some of the quantitative information UMAP normally throws away.

Parametric UMAP. Replaces the free-coordinate optimization with a trained neural network that maps input vectors to embedding coordinates. The network generalizes to new data instantly (a forward pass) and can be inverted, bridging UMAP with autoencoders.

Supervised and metric UMAP. Feed labels alongside the data and UMAP blends label agreement into the graph weights, sharpening class separation — useful when you want a visualization that respects known categories, or a label-aware feature reducer.

AlignedUMAP. Embeds a sequence of related datasets (time slices, batches) into a shared, comparable space so you can watch structure evolve frame to frame without the layout jumping randomly between runs.

GPU UMAP (RAPIDS cuML). A drop-in GPU implementation that pushes million-point embeddings from minutes to seconds, at the cost of some approximation differences in the k-NN step.

Common bugs and edge cases

  • Over-reading the geometry. The single most common mistake: treating cluster sizes and inter-cluster gaps as real. They are layout artifacts of n_neighbors and min_dist. UMAP preserves who is near whom, not how far apart things are.
  • Not scaling features first. UMAP uses Euclidean distance by default. A column measured in millions will dominate one measured in fractions. Standardize, or pass a metric appropriate to your data (cosine for embeddings, Hamming for binary).
  • Forgetting random_state kills parallelism. Setting a seed for reproducibility silently forces single-threaded execution. A "why is my UMAP suddenly 4× slower" bug report is usually a pinned seed.
  • Spurious mini-clusters at tiny n_neighbors. With n_neighbors below ~5, the graph fragments and UMAP invents tiny disconnected islands that don't correspond to real groups. Raise n_neighbors until the structure stabilizes.
  • Using 2D UMAP output as model features. Two coordinates throw away almost all the manifold. If you want a feature reducer, embed to 10-50 components, not 2 — the 2D version is for human eyes only.
  • Expecting transform() to match fit_transform() exactly. Projecting the training set through transform() gives a slightly different layout than the original fit, because transform freezes the embedding and only optimizes the query points.

Frequently asked questions

Is UMAP faster than t-SNE?

Almost always, and the gap grows with n. Both build a k-nearest-neighbor graph, but UMAP optimizes the layout with stochastic gradient descent and negative sampling, which scales roughly with the number of edges rather than the number of point pairs. On a million-row dataset UMAP typically finishes in a few minutes where Barnes-Hut t-SNE takes hours; on MNIST (70,000 points) UMAP is commonly 5-10× faster.

Can I trust the distances between clusters in a UMAP plot?

Only loosely. UMAP preserves which points are neighbors far better than how far apart distant clusters are. The size of a blob and the gap between two blobs are largely artifacts of the n_neighbors and min_dist settings, not real density or distance. Read adjacency, not absolute geometry.

What do n_neighbors and min_dist actually control?

n_neighbors trades local detail for global structure — small values (5-15) expose fine local manifolds, large values (50-200) emphasize the big picture. min_dist controls how tightly points may pack in the embedding: near 0 gives dense clumps good for clustering, while 0.5-0.99 spreads points out for readable scatter plots. Neither changes the data, only how the same fuzzy graph is laid out.

Can UMAP embed new, unseen points without re-running?

Yes — unlike t-SNE, UMAP supports a transform() step. It freezes the learned embedding, finds each new point's nearest neighbors among the training data, and optimizes only the new point's position. This makes UMAP usable as a reusable preprocessing layer in a production pipeline, not just a one-off plotting tool.

Why does UMAP use cross-entropy instead of t-SNE's KL divergence?

UMAP models the data as two fuzzy simplicial sets — one in high dimensions, one in the embedding — and minimizes their fuzzy set cross-entropy. Because cross-entropy penalizes both missing attractive edges and spurious repulsive ones, it gives UMAP a meaningful repulsive term that pushes unrelated clusters apart, which is why global structure survives better than under t-SNE's KL objective.

Is UMAP deterministic?

No. The k-NN approximation, the random initialization (when not using spectral init), and the negative sampling in SGD are all stochastic. Fix random_state for reproducibility, but be aware that pinning the seed disables UMAP's internal parallelism, so a reproducible run is single-threaded and slower.