Machine Learning

t-SNE

Turn distances into probabilities — and watch the clusters fall out

t-SNE embeds high-dimensional data into 2D or 3D by turning pairwise distances into neighbor probabilities and minimizing the KL divergence between them, revealing clusters that linear methods like PCA hide.

  • Full namet-distributed Stochastic Neighbor Embedding
  • Introducedvan der Maaten & Hinton, 2008
  • Objectiveminimize KL(P ‖ Q)
  • Exact costO(n²)
  • Barnes-HutO(n log n)
  • Key knobperplexity (5–50)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How t-SNE works

You have 60,000 handwritten digits, each a 784-dimensional vector of pixel intensities. You want to see whether the threes cluster apart from the eights. You cannot plot 784 axes, and projecting onto the two highest-variance directions (PCA) smears the classes into an overlapping blob — variance and class separation are not the same thing. t-SNE solves a different problem: it preserves who is whose neighbor, and lets everything else go.

The algorithm runs in two halves, one in the original high-dimensional space and one in the 2D map it is building.

Step 1 — high-dimensional similarities. For each pair of points i, j, define a conditional probability that i would pick j as its neighbor, using a Gaussian centered on i:

p(j|i) = exp(−‖xᵢ − xⱼ‖² / 2σᵢ²) / Σₖ exp(−‖xᵢ − xₖ‖² / 2σᵢ²)

Crucially, each point gets its own bandwidth σᵢ. t-SNE binary-searches each σᵢ so that the entropy of point i's neighbor distribution matches a user-set perplexity — roughly, the effective number of neighbors. A point in a dense region gets a small bandwidth; a point in a sparse region gets a large one. The conditionals are symmetrized into joint probabilities pᵢⱼ = (p(j|i) + p(i|j)) / 2n so no single outlier dominates.

Step 2 — low-dimensional similarities. In the 2D map, similarity uses a Student-t distribution with one degree of freedom (a Cauchy):

qᵢⱼ = (1 + ‖yᵢ − yⱼ‖²)⁻¹ / Σₖ≠ₗ (1 + ‖yₖ − yₗ‖²)⁻¹

Step 3 — match the two distributions. t-SNE places the 2D points yᵢ to make Q look like P, minimizing the Kullback-Leibler divergence by gradient descent:

C = KL(P ‖ Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)

The gradient has a beautiful physical reading — it behaves like a system of springs. Points that are neighbors in high-D attract; all points repel each other like charged particles. The map settles when attraction and repulsion balance, and the heavy-tailed q lets clusters push far apart without penalty.

The crowding problem and why the t-distribution fixes it

Why a Student-t and not another Gaussian? Because of the crowding problem. In high dimensions there is exponentially more room: the volume of a sphere grows as r^d, so a point can have many neighbors all roughly equidistant. When you squeeze that neighborhood into 2D, there simply isn't enough area to keep every neighbor at its true distance. A Gaussian q would force moderately-distant points to be drawn close, collapsing clusters into a single ball.

The Student-t has heavy tails: its density falls off polynomially, not exponentially. That means a moderate distance in the map produces only a modest drop in q. Points that should be far can be drawn far at little cost, so distinct clusters separate cleanly instead of crowding into the center. This single design choice — KL divergence is the "SNE" part, the Student-t is the "t" — is the entire reason t-SNE plots look so much crisper than the original 2002 SNE.

When to use t-SNE (and when not to)

  • Use it for exploratory visualization. Sanity-checking that learned embeddings (word vectors, image features, single-cell RNA profiles) separate into the classes you expect.
  • Use it on a few thousand to ~50,000 points. Beyond that, runtime and the crowding of the 2D canvas both degrade; switch to UMAP or FIt-SNE.
  • Pre-reduce with PCA first. Running t-SNE on raw 784-D MNIST is slow and noisy; PCA down to ~30–50 dimensions denoises the input and speeds the neighbor search. scikit-learn does this for you when you pass init="pca" only for the map init, but you should still PCA the input yourself.

Do not use t-SNE as a general dimensionality-reduction preprocessing step feeding a downstream classifier — it has no parametric transform for new points and it destroys global geometry. Do not read cluster sizes or inter-cluster gaps as meaningful. And do not use it for clustering itself; run a real clustering algorithm (k-means, HDBSCAN) on the original space and color the t-SNE plot by the result.

t-SNE vs other dimensionality-reduction methods

t-SNEUMAPPCAMDSIsomapAutoencoder
Linear?NoNoYesMetric variantNo (geodesic)No
Preserves local structureExcellentExcellentPoorModerateGoodDepends
Preserves global structurePoorModerateExcellentGoodGoodDepends
Time complexityO(n log n) BHO(n log n)O(n d²)O(n²)–O(n³)O(n² log n)O(epochs · n)
Transforms new points?NoYesYesNoNo (approx)Yes
Distances trustworthy?Only local rankMostly localYesYesGeodesicNo
Main useVisualizationViz + featuresCompression, denoiseDistance mappingManifold unfoldingLearned features

The short version: PCA tells you the directions of greatest variance and is honest about distance; t-SNE and UMAP tell you the neighborhood graph and lie about distance. Pick the tool to match the question. If you need a map you can measure, t-SNE is the wrong instrument.

What the numbers actually say

  • Exact t-SNE is O(n²) in time and memory. The pairwise P matrix alone is floats — 10,000 points is 100M entries (~400 MB at 32-bit), 100,000 points is 10B entries and won't fit in RAM. This is the wall that forced the Barnes-Hut variant.
  • Barnes-Hut t-SNE is O(n log n). It approximates the repulsion sum with a quadtree, treating distant groups as a single point (the same trick as the n-body simulations it's named after). scikit-learn auto-selects it whenever method="barnes_hut" (the default), valid for output dimension ≤ 3.
  • Default iterations: 1,000, with the first 250 in "early exaggeration." During early exaggeration the pᵢⱼ are multiplied by ~12, exaggerating attraction so clusters form tight clumps with wide gaps before the layout relaxes.
  • Perplexity is bounded by data size. It must be less than the number of points; the original paper recommends 5–50. The cost of a wrong choice is real: perplexity 2 on MNIST shatters classes into specks; perplexity 100 merges neighboring digits.
  • FIt-SNE (2019) replaces Barnes-Hut repulsion with an FFT-accelerated interpolation and embeds the full MNIST 70k in roughly a minute on a laptop — about an order of magnitude faster than Barnes-Hut at that scale.

JavaScript implementation

A compact, readable exact t-SNE (O(n²), fine for a few hundred points in the browser). This is the core loop — perplexity calibration, symmetrized P, t-distributed Q, and the KL gradient.

// Pairwise squared Euclidean distances.
function pairwiseSqDist(X) {
  const n = X.length, D = Array.from({length: n}, () => new Float64Array(n));
  for (let i = 0; i < n; i++)
    for (let j = i + 1; j < n; j++) {
      let s = 0;
      for (let k = 0; k < X[i].length; k++) { const d = X[i][k] - X[j][k]; s += d * d; }
      D[i][j] = D[j][i] = s;
    }
  return D;
}

// Binary-search each sigma so neighbor entropy ≈ log2(perplexity).
function computeP(D, perplexity) {
  const n = D.length, P = Array.from({length: n}, () => new Float64Array(n));
  const target = Math.log(perplexity);
  for (let i = 0; i < n; i++) {
    let betaMin = -Infinity, betaMax = Infinity, beta = 1, sum = 1; // beta = 1/(2σ²)
    for (let iter = 0; iter < 50; iter++) {
      // unnormalized affinities for the current bandwidth, plus their sum and entropy
      sum = 0;
      for (let j = 0; j < n; j++) { P[i][j] = i === j ? 0 : Math.exp(-D[i][j] * beta); sum += P[i][j]; }
      let H = 0;
      for (let j = 0; j < n; j++) { const p = P[i][j] / sum; if (p > 1e-12) H -= p * Math.log(p); }
      const diff = H - target;
      if (Math.abs(diff) < 1e-5) break;
      // H falls as beta rises (tighter Gaussian ⇒ fewer effective neighbors)
      if (diff > 0) { betaMin = beta; beta = betaMax === Infinity ? beta * 2 : (beta + betaMax) / 2; }
      else          { betaMax = beta; beta = betaMin === -Infinity ? beta / 2 : (beta + betaMin) / 2; }
    }
    for (let j = 0; j < n; j++) P[i][j] /= sum; // normalize the converged row into p(j|i)
  }
  // Symmetrize: pij = (pj|i + pi|j) / 2n
  for (let i = 0; i < n; i++)
    for (let j = i + 1; j < n; j++) {
      const v = (P[i][j] + P[j][i]) / (2 * n);
      P[i][j] = P[j][i] = Math.max(v, 1e-12);
    }
  return P;
}

// One gradient-descent step of the KL objective in 2D.
function step(Y, P, lr, momentum, prev, exaggeration) {
  const n = Y.length;
  // Student-t affinities Q (and the normalizer Z).
  const num = Array.from({length: n}, () => new Float64Array(n));
  let Z = 0;
  for (let i = 0; i < n; i++)
    for (let j = i + 1; j < n; j++) {
      const dx = Y[i][0] - Y[j][0], dy = Y[i][1] - Y[j][1];
      const t = 1 / (1 + dx * dx + dy * dy);
      num[i][j] = num[j][i] = t; Z += 2 * t;
    }
  for (let i = 0; i < n; i++) {
    let gx = 0, gy = 0;
    for (let j = 0; j < n; j++) {
      if (i === j) continue;
      const q = num[i][j] / Z;
      const mult = (exaggeration * P[i][j] - q) * num[i][j]; // (pij − qij)·(1+d²)⁻¹
      gx += mult * (Y[i][0] - Y[j][0]);
      gy += mult * (Y[i][1] - Y[j][1]);
    }
    // Momentum + learning rate (gradient scaled by 4).
    prev[i][0] = momentum * prev[i][0] - lr * 4 * gx;
    prev[i][1] = momentum * prev[i][1] - lr * 4 * gy;
    Y[i][0] += prev[i][0];
    Y[i][1] += prev[i][1];
  }
}

function tsne(X, { perplexity = 30, iterations = 1000, lr = 200 } = {}) {
  const n = X.length;
  const P = computeP(pairwiseSqDist(X), perplexity);
  const Y = Array.from({length: n}, () => [ (Math.random() - 0.5) * 1e-4, (Math.random() - 0.5) * 1e-4 ]);
  const prev = Array.from({length: n}, () => [0, 0]);
  for (let it = 0; it < iterations; it++) {
    const momentum = it < 250 ? 0.5 : 0.8;
    const exaggeration = it < 250 ? 12 : 1; // early exaggeration
    step(Y, P, lr, momentum, prev, exaggeration);
  }
  return Y;
}

Two details that bite people. The gradient is multiplied by 4 — that factor falls out of differentiating the KL objective and is folded into the learning rate in many writeups. And note that early exaggeration multiplies P, not Q: it temporarily makes the targets larger so clusters pull together hard before the layout is allowed to relax.

Python implementation

In practice you call the library — and the call matters more than people think, because the defaults changed in scikit-learn 1.2.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

X, y = load_digits(return_X_y=True)        # 1797 samples, 64 dims

# 1) Denoise / speed up: PCA to ~30 dims BEFORE t-SNE.
X30 = PCA(n_components=30, random_state=0).fit_transform(X)

# 2) t-SNE to 2D. Pin the seed and use PCA init for a stable layout.
emb = TSNE(
    n_components=2,
    perplexity=30,        # 5–50; must be < n_samples
    init="pca",           # far more stable than "random"
    learning_rate="auto", # max(n / early_exaggeration / 4, 50); the post-1.2 default
    n_iter=1000,
    random_state=42,
).fit_transform(X30)

plt.scatter(emb[:, 0], emb[:, 1], c=y, cmap="tab10", s=8)
plt.gca().set_aspect("equal"); plt.axis("off")
plt.title("t-SNE of the digits dataset (perplexity=30)")
plt.show()

If you need a from-scratch reference rather than the library, the canonical one is van der Maaten's own tsne.py: it implements exactly the four stages above — pairwise distances, perplexity-calibrated P, t-distributed Q, and gradient descent with momentum and early exaggeration — in about 60 lines of NumPy.

Variants worth knowing

Barnes-Hut t-SNE (2014). The standard scalable version. A quadtree summarizes distant clusters of points so repulsive forces cost O(n log n) instead of O(n²). This is what scikit-learn runs by default.

FIt-SNE (2019). "Fast Fourier Transform-accelerated Interpolation-based t-SNE." Replaces the Barnes-Hut tree with polynomial interpolation on a grid and an FFT convolution, giving a roughly 10× speedup and enabling million-point embeddings. Pairs with approximate-nearest-neighbor search for the input similarities.

openTSNE. A maintained Python package that adds a parametric transform() to embed new points into an existing map — closing one of t-SNE's biggest gaps versus UMAP — plus better initialization and affinity options.

UMAP (2018). Not a t-SNE variant strictly, but its spiritual successor: it optimizes a cross-entropy over a fuzzy topological graph, runs faster, keeps more global structure, and supports out-of-sample transform. For most large modern datasets, UMAP is the first thing to try.

Parametric t-SNE. Train a neural network to output the embedding, so the mapping generalizes to unseen data. Slower to fit but gives you a reusable encoder.

Common bugs and misreadings

  • Reading cluster sizes or gaps as meaningful. The single most common error. t-SNE equalizes density; a tight real cluster and a diffuse one can render the same size, and gaps between clusters carry no distance information.
  • Trusting one perplexity. Different perplexities can show contradictory pictures. Always sweep a few values (e.g. 5, 30, 50) before drawing conclusions, as the classic "How to Use t-SNE Effectively" Distill article showed.
  • Too few iterations. A plot that still looks like a uniform pinched ball usually hasn't converged. Let it run the full 1,000+ iterations; the early-exaggeration phase alone is 250.
  • Skipping the PCA pre-step. Running on raw high-D, noisy input is slow and lets noise dominate the neighbor graph. Pre-reduce to ~30–50 dims.
  • Not seeding it. Without a fixed random_state, every run mirrors/rotates differently and you can't reproduce a figure. Use init="pca" for stability.
  • Using it as a feature transform for a model. t-SNE has no honest out-of-sample map and discards global geometry. For features feeding a classifier, use PCA, UMAP, or a learned autoencoder.
  • Seeing clusters in random data. Even on pure noise, t-SNE produces visually convincing clumps at low perplexity. Always validate apparent structure against the original space.

Frequently asked questions

What does perplexity actually control in t-SNE?

Perplexity is a smooth measure of how many effective neighbors each point has — roughly the number of points t-SNE treats as local. Each point gets its own Gaussian bandwidth tuned so the entropy of its neighbor distribution equals log₂(perplexity). Low perplexity (5) emphasizes tight local structure; high perplexity (50) blends in more global context. Typical values are 5 to 50, and the result can change visibly across that range.

Why does t-SNE use a Student-t distribution in the low-dimensional space?

The Student-t with one degree of freedom (a Cauchy distribution) has much heavier tails than a Gaussian. This lets moderately distant points in the embedding sit far apart without incurring a large probability mismatch, which directly counteracts the crowding problem — the fact that there isn't enough room in 2D to preserve all the volume of a high-dimensional neighborhood. The heavy tail gives clusters room to spread out.

Can I trust the distances between clusters in a t-SNE plot?

No. t-SNE preserves local neighborhoods, not global geometry. Cluster sizes are meaningless — a dense cluster and a sparse one can render at the same diameter — and the gaps between clusters do not reflect true between-cluster distances. Two clusters drawn close together are not necessarily more similar than two drawn far apart. Read t-SNE as 'which points are neighbors,' never as a map with a scale.

What is the difference between t-SNE and UMAP?

Both build a neighbor graph and lay it out in low dimensions, but UMAP optimizes a cross-entropy over fuzzy simplicial sets instead of a KL divergence over Gaussian/t probabilities. In practice UMAP is several times faster, preserves more global structure, and produces a reusable transform for new points. t-SNE often gives crisper local cluster separation. UMAP has largely replaced t-SNE for large single-cell and embedding datasets since around 2018.

Why does t-SNE give a different picture every time I run it?

The embedding is initialized randomly and optimized by gradient descent on a non-convex objective, so different random seeds reach different local minima. Orientation, mirroring, and cluster placement vary run to run even when the cluster membership is stable. Fix the random_state for reproducibility, and prefer PCA initialization, which makes the layout far more stable and preserves some global structure.

How expensive is t-SNE on a large dataset?

Naive t-SNE is O(n²) in both time and memory because it sums over every pair of points each iteration. Barnes-Hut t-SNE approximates the repulsive forces with a quadtree and runs in O(n log n), which is the default in scikit-learn for the exact-vs-barnes_hut switch. Even so, beyond roughly 100,000 points people reach for UMAP, FIt-SNE (which uses an FFT-based interpolation), or GPU implementations.