Machine Learning

Principal Component Analysis (PCA)

Spin the data until its longest shadow lines up with an axis

Principal Component Analysis (PCA) is a linear dimensionality-reduction method that rotates data onto the orthogonal axes of greatest variance — the eigenvectors of the covariance matrix — so you can keep most of the structure in far fewer dimensions.

  • Cost (covariance + eigen)O(n·d² + d³)
  • Cost (truncated SVD, k comps)O(n·d·k)
  • Components areorthogonal & uncorrelated
  • Captureslinear structure only
  • InventedPearson 1901 · Hotelling 1933

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How PCA works

Imagine a cloud of points stretched out like a flying-saucer disc — long in one direction, medium in another, paper-thin in a third. PCA finds those directions in order of how spread out the data is, then lets you throw away the thin ones. The directions are called principal components, and the spread along each one is its variance.

The recipe is four steps, and every PCA implementation is some variation on it:

  1. Center the data. Subtract the mean of each feature so the cloud is centered on the origin. Without this, the first component points at the mean instead of the direction of spread.
  2. Build the covariance matrix. For a centered data matrix X with n rows and d columns, the covariance is C = XᵀX / (n − 1), a d × d symmetric matrix whose entry C[i][j] measures how features i and j vary together.
  3. Eigendecompose it. Find the eigenvectors and eigenvalues of C. The eigenvectors are the principal-component directions; the eigenvalues are the variance captured along each. Sort them from largest eigenvalue to smallest.
  4. Project. Keep the top k eigenvectors as columns of a matrix W, and transform your data with Z = X·W. The result has k columns instead of d, ordered so the first column holds the most information.

The key insight: because C is real and symmetric, the spectral theorem guarantees its eigenvectors are mutually orthogonal. PCA is therefore a pure rotation of the coordinate system (plus a projection). It never warps distances within a plane — it just picks a better-aligned set of axes, then drops the ones along which nothing happens.

Reading the explained variance

Each eigenvalue λᵢ is the variance along component i. The explained variance ratio is λᵢ / Σλ — the fraction of the dataset's total spread that one component accounts for. This single number is how you decide how aggressively to compress.

A classic result: on the MNIST handwritten-digit dataset (784 pixel features), the first 2 components capture only about 17% of the variance — too lossy for classification, but enough for a readable 2-D scatter plot. Climb to roughly 154 components and you recover 95% of the variance, an 80% reduction in dimensions while keeping nearly all the signal. The numbers depend on the dataset, but the shape always holds: the first few components are worth a lot, and a long tail is worth almost nothing.

When to use PCA (and when not to)

  • Compression and denoising — store or transmit data in k dimensions, then reconstruct an approximation. Dropping the low-variance tail often discards mostly noise.
  • Speeding up downstream models — a 784-dimensional input squeezed to 50 components trains a classifier far faster, and frequently no less accurately.
  • Visualization — project to 2 or 3 components to plot a high-dimensional dataset the eye can actually read.
  • Decorrelating features — because components are uncorrelated, PCA is a standard preprocessing step for models that struggle with collinearity.

Skip PCA when interpretability matters — a principal component is a weighted blend of all your original features, so "0.3 × income + 0.5 × age − 0.2 × tenure" is rarely something a stakeholder can act on. Skip it when the structure is nonlinear (a spiral, a Swiss roll), because PCA can only rotate and project, never unbend. And skip it when you have few features already; reducing from 5 dimensions to 4 buys you nothing.

PCA vs other dimensionality-reduction methods

PCAt-SNEUMAPLDAAutoencoderRandom projection
CapturesLinear varianceLocal neighborhoodsLocal + some globalClass separationArbitrary nonlinearDistances (approx.)
Supervised?NoNoNoYes (uses labels)No (usually)No
Deterministic?Yes (up to sign)No (random init)NoYesNoNo (random matrix)
CostO(nd² + d³)O(n²)≈ O(n log n)O(nd² + d³)Training loopO(ndk)
Invertible / reconstructYesNoNoPartlyYes (decoder)Approx.
New points (transform)O(dk), trivialRefit or parametric variantHas transform()O(dk)Forward passO(dk)
Best forCompression, preprocessing2-D cluster plotsLarge-scale embeddingsClassification preprocessingNonlinear featuresCheap huge-d reduction

The headline trade-off: PCA is fast, deterministic, invertible, and trivial to apply to new data — but strictly linear. t-SNE and UMAP make gorgeous cluster plots but distort global distances and can't be inverted; their inter-cluster gaps are meaningless. A common pipeline runs PCA down to ~50 dimensions first (cheap, kills noise) and then t-SNE or UMAP to 2 — combining PCA's speed with the nonlinear method's local fidelity.

What the numbers actually say

  • Full eigendecomposition is O(n·d² + d³). Forming the d × d covariance from n points costs n·d²; decomposing it costs . For 1 million images at d = 784, the n·d² term (≈6 × 10¹¹ ops) dominates the term (≈5 × 10⁸).
  • You rarely need the full decomposition. Randomized / truncated SVD computes only the top k components in roughly O(n·d·k). Asking for 50 components out of 784 is about 16× cheaper than the full work per pass.
  • Reconstruction error equals the discarded variance. Eckart–Young proves the rank-k PCA projection is the best possible rank-k approximation under squared error — no other linear method beats it. The total squared error is exactly the sum of the dropped eigenvalues.
  • Whitening is one extra division. Dividing each projected component by √λᵢ ("whitening") makes the output have unit variance in every direction — useful before algorithms that assume isotropic inputs.

JavaScript implementation

A compact 2-D PCA using the closed-form eigenvectors of a 2×2 symmetric covariance matrix — enough to see every step without a linear-algebra library.

function pca2D(points) {
  const n = points.length;
  // 1. Center
  const mx = points.reduce((s, p) => s + p[0], 0) / n;
  const my = points.reduce((s, p) => s + p[1], 0) / n;
  const C = points.map(p => [p[0] - mx, p[1] - my]);

  // 2. Covariance matrix [[a, b], [b, d]]
  let a = 0, b = 0, d = 0;
  for (const [x, y] of C) { a += x * x; b += x * y; d += y * y; }
  a /= (n - 1); b /= (n - 1); d /= (n - 1);

  // 3. Eigenvalues of a symmetric 2x2 (closed form)
  const tr = a + d, det = a * d - b * b;
  const disc = Math.sqrt(tr * tr / 4 - det);
  const l1 = tr / 2 + disc;   // larger eigenvalue → PC1
  const l2 = tr / 2 - disc;

  // Eigenvector for l1: (b, l1 - a) normalized (fallback if b ≈ 0)
  let ex = b, ey = l1 - a;
  if (Math.abs(b) < 1e-12) { ex = a >= d ? 1 : 0; ey = a >= d ? 0 : 1; }
  const norm = Math.hypot(ex, ey);
  const pc1 = [ex / norm, ey / norm];
  const pc2 = [-pc1[1], pc1[0]];   // orthogonal

  // 4. Project onto PC1 (1-D reduction)
  const projected = C.map(([x, y]) => x * pc1[0] + y * pc1[1]);

  return { pc1, pc2, eigenvalues: [l1, l2], projected,
           explainedVariance: l1 / (l1 + l2) };
}

Two details that bite people. First, the eigenvector formula (b, λ − a) degenerates when the off-diagonal b is zero (already-aligned data) — the fallback handles that. Second, the sign of an eigenvector is arbitrary: pc1 and −pc1 are equally valid, so two libraries can hand you flipped axes from identical data. Never compare raw component signs across runs.

Python implementation

The honest production way is SVD on the centered matrix — never form XᵀX, which squares the condition number and loses precision on near-collinear features.

import numpy as np

def pca_svd(X, k):
    # 1. Center each column
    mean = X.mean(axis=0)
    Xc = X - mean

    # 2. SVD of the centered data:  Xc = U @ diag(S) @ Vt
    U, S, Vt = np.linalg.svd(Xc, full_matrices=False)

    # 3. Rows of Vt are the principal directions (already sorted by S)
    components = Vt[:k]                       # shape (k, d)

    # 4. Project; equivalently  Z = U[:, :k] * S[:k]
    Z = Xc @ components.T                     # shape (n, k)

    # Variance along each component, and the ratio
    explained = (S ** 2) / (len(X) - 1)
    ratio = explained / explained.sum()
    return Z, components, ratio[:k], mean

def reconstruct(Z, components, mean):
    # Best rank-k approximation (Eckart-Young optimal)
    return Z @ components + mean

# scikit-learn one-liner equivalent:
# from sklearn.decomposition import PCA
# Z = PCA(n_components=k, svd_solver="randomized").fit_transform(X)

Note that np.linalg.svd returns the directions pre-sorted by singular value, so no manual eigenvalue sort is needed — a frequent source of bugs in hand-rolled covariance-based versions. For large n and small k, swap in sklearn's randomized solver or scipy.sparse.linalg.svds to compute only the top components.

Variants worth knowing

Kernel PCA. Apply the kernel trick — implicitly map data into a high-dimensional feature space and run PCA there — so you can capture nonlinear structure. An RBF kernel can unroll the spiral that plain PCA collapses. The cost is an n × n kernel matrix, which is O(n²) memory.

Incremental / online PCA. Update the components as data streams in, processing one mini-batch at a time. Essential when the dataset doesn't fit in memory; scikit-learn's IncrementalPCA trades a tiny accuracy loss for constant memory.

Sparse PCA. Adds an L1 penalty so each component loads on only a few original features. This buys back the interpretability that vanilla PCA destroys — a component might become "just these 6 genes" instead of a blend of 20,000.

Robust PCA. Decomposes the data into a low-rank matrix plus a sparse matrix, isolating gross outliers and corruptions into the sparse part. Used in video background subtraction, where the static background is low-rank and moving objects are sparse.

Randomized PCA. Uses random projections to approximate the top k components without touching the full spectrum. It's the default solver in scikit-learn for large datasets and is typically several times faster than the exact route with negligible error.

Common bugs and edge cases

  • Forgetting to center. Skip the mean subtraction and your first component will point toward the data's mean rather than its direction of spread. Centering is non-negotiable; standardization is situational.
  • Not standardizing mixed-unit features. Variance is unit-dependent, so a feature in dollars swamps one in years. Divide by the standard deviation unless every feature already shares a unit.
  • Fitting PCA on train and test together. Leaks test statistics into the mean and covariance. Fit on the training set, then apply the same transform to test data — fit then transform, never fit_transform on the whole set.
  • Comparing eigenvector signs across runs. Sign is arbitrary; a component can flip between library versions or random seeds. Compare projected variances or absolute loadings, not raw signs.
  • Expecting PCA to find clusters. PCA maximizes variance, not class separation. Two well-separated clusters can collapse onto the same component if the spread within clusters happens to exceed the spread between them. Use LDA if you have labels.
  • Letting outliers steer the basis. A few extreme points can rotate the entire component basis because squared distance rewards them. Clip or remove outliers, or reach for robust PCA.
  • Reading too much into low-variance components. The tail components often encode noise or numerical artifacts, not meaningful structure. Don't over-interpret PC50.

Frequently asked questions

Do I have to standardize my data before PCA?

Almost always yes when your features are in different units. PCA maximizes variance, so a feature measured in dollars (variance in the millions) will dominate a feature measured in years (variance in the tens) purely because of scale. Center every column to mean zero, and divide by the standard deviation unless all features already share the same unit and scale.

How many principal components should I keep?

Pick the smallest number of components whose cumulative explained variance ratio clears your target — 90% or 95% are common. Plot the eigenvalues (a scree plot) and look for the elbow where they flatten, or keep all components with eigenvalue above 1 on standardized data (the Kaiser rule). There is no single correct answer; it is a variance-budget decision.

What is the difference between PCA and SVD?

They compute the same thing from different angles. PCA finds eigenvectors of the covariance matrix; the SVD of the centered data matrix X = UΣVᵀ gives those same directions in V, and each singular value relates to its eigenvalue by σ = √(λ·(n−1)). Production libraries run SVD on X directly because forming XᵀX squares the condition number and loses numerical precision.

Why are principal components always orthogonal?

Because they are eigenvectors of the covariance matrix, which is real and symmetric. The spectral theorem guarantees that a real symmetric matrix has a full set of mutually orthogonal eigenvectors. Orthogonality is what makes the rotated coordinates uncorrelated — each component captures variance the earlier ones missed.

Can PCA capture nonlinear structure like a spiral or a Swiss roll?

No. PCA only finds linear combinations of features, so it can rotate, stretch, and project but never unbend a curve. A spiral or a Swiss roll collapses into a meaningless blob. For nonlinear manifolds use kernel PCA, t-SNE, UMAP, or an autoencoder instead.

Is PCA sensitive to outliers?

Very. Because it maximizes squared variance, a single far-flung point can swing the first component toward itself. A handful of outliers can rotate the whole basis. Remove or clip outliers first, or use robust PCA, which decomposes the data into a low-rank part plus a sparse outlier part.