Linear Algebra

Principal Component Analysis

Find the directions a dataset stretches along, keep only the longest

Principal component analysis finds the directions of greatest variance in a dataset by computing the eigenvectors of its covariance matrix (or, equivalently, the right singular vectors of the centered data). It is the most-used dimensionality-reduction technique in statistics and machine learning.

  • GoalProject data onto axes that maximize variance
  • First stepCenter each column to zero mean
  • ComponentsEigenvectors of the covariance matrix
  • Variance explainedσi² / Σ σj²
  • Best computed viaSVD of the centered data matrix
  • Used inVisualization, denoising, feature engineering, eigenfaces, gene-expression analysis

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The setup

Stack n observations of d features into an n×d data matrix X. Each row is one data point. The aim of PCA is to find a small set of new axes — k of them, with k ≪ d — such that projecting onto those axes preserves as much variance as possible.

Equivalently, PCA finds the rank-k approximation of the centered data matrix that minimizes squared error. The two formulations are equivalent thanks to the Eckart–Young theorem.

The five-step recipe

  1. Center. Subtract each column's mean. Call the result X̃. This is non-negotiable — without it the first component points toward the centroid, not toward the direction of spread.
  2. (Optional) Scale. If features are on different units, divide each centered column by its standard deviation so each contributes equally.
  3. Decompose. Compute the SVD: X̃ = U·Σ·VT. The columns of V are the principal directions in feature space; the columns of U·Σ are the projected scores.
  4. Choose k. Look at the singular values. Keep enough components to explain a chosen fraction of total variance (typically 90% or 95%) or use the elbow of a scree plot.
  5. Project. Multiply X̃ · Vk to get an n×k matrix of low-dimensional coordinates.

That is it — every PCA in every textbook, library, or paper boils down to these five steps.

Why SVD instead of eigendecomposition

The covariance matrix is C = X̃TX̃ / (n − 1). Its eigenvectors are exactly the columns of V from the SVD of X̃, and its eigenvalues are σi2/(n−1). So you can compute PCA two ways:

  1. Form C and run an eigendecomposition. Costs O(d²·n) for the matrix product plus O(d³) for eigen.
  2. Run SVD on X̃ directly. Costs O(min(n, d)·max(n, d)²) but never forms C.

Path 2 wins on accuracy. Forming C squares the condition number, which destroys the small singular values you most want to study (they are the noise floor). For high-precision data analysis, always go through SVD.

Worked example — PCA on a 2-D dataset

Six observations of (x, y):

(2.5, 2.4)  (0.5, 0.7)  (2.2, 2.9)
(1.9, 2.2)  (3.1, 3.0)  (1.7, 1.6)

Step 1 — center. Column means are x̄ ≈ 1.983, ȳ ≈ 2.133. Subtract:

X̃ ≈ [  0.517   0.267 ]
     [ −1.483  −1.433 ]
     [  0.217   0.767 ]
     [ −0.083   0.067 ]
     [  1.117   0.867 ]
     [ −0.283  −0.533 ]

Step 2 — covariance matrix. C = X̃TX̃ / (n−1) ≈ [[0.7297, 0.7407], [0.7407, 0.8027]].

Step 3 — eigenvalues of C. Solve det(C − λI) = 0 → λ² − 1.5324·λ + 0.0382 = 0 → λ1 ≈ 1.5072, λ2 ≈ 0.0252.

Step 4 — eigenvectors.

  • For λ1: v1 ≈ [0.689, 0.725]T. (Normalized.)
  • For λ2: v2 ≈ [−0.725, 0.689]T. (Orthogonal to v1.)

Step 5 — variance explained.

fraction along PC1 = λ₁ / (λ₁ + λ₂) ≈ 1.5072 / 1.5324 ≈ 98.4%
fraction along PC2 ≈ 1.6%

Almost all of the dataset's variance lives along the diagonal direction v1 ≈ [0.689, 0.725]T. Project the centered data onto v1 and you have a 1-D summary that retains 98% of the original variance — perfect material for plotting or downstream regression.

PCA via SVD vs covariance eigendecomposition

SVD on centered dataEigendecomp of covariance
Inputn×d centered matrix X̃d×d symmetric covariance C
Cost (n > d)≈ 4nd² + 8d³/3nd² (form C) + d³ (eigen)
Cost (n ≪ d)≈ 4n²d + 8n³/3nd² + d³ — bad when d is large
Numerical accuracyExcellent — never squares condition numberSquares the condition number when forming C
Memoryndd² (the covariance matrix)
Recovers componentsRight singular vectors of X̃Eigenvectors of C — same vectors
Recovers variancesσi²/(n−1)λi directly
Streaming/incrementalRandomized and online SVD variantsWelford-style covariance updates
Library defaultscikit-learn's PCA, MATLAB's pcaOlder textbook code; rarely the right choice

Where PCA shows up

  • Visualization. Project a high-dimensional dataset onto its top 2 or 3 components and plot. Often surfaces clusters, outliers, and gradients invisible in raw coordinates.
  • Feature engineering. Decorrelated, lower-dimensional features as input to a downstream model — useful when features are highly collinear (multicollinearity in regression, ill-conditioning in neural nets).
  • Image compression and "eigenfaces". Treat images as long vectors and PCA the dataset; the top components form an orthonormal basis of low-rank reconstructions.
  • Denoising. Project a noisy dataset onto its top components and reconstruct. Components below the noise floor are dropped, removing the noise that lives in those small directions.
  • Genetics and bioinformatics. PC1 vs PC2 of genome-wide SNP data classically recovers population structure (Cavalli-Sforza et al.). PCA of gene-expression matrices reveals cell types and disease subtypes.
  • Quantitative finance. First PCs of yield-curve changes are interpreted as level, slope, and curvature shifts; first PCs of equity returns capture market, sector, and style factors.
  • Latent-semantic indexing. Truncated SVD of a term-document matrix recovers latent topics — historically the bridge between PCA and modern embedding methods.

Common mistakes

  • Skipping the centering step. The result is mathematically meaningful but not what most people call PCA. Without centering, the first component is biased toward the data centroid.
  • Forgetting to scale. Mixing dollars and counts with picojoules and percentages without standardization makes PCA pick whichever feature has the largest raw variance, not the most informative one.
  • Computing PCA from a covariance matrix in single precision. The smallest variances vanish into rounding error. Use SVD, or at least double-precision throughout.
  • Treating principal components as causes. PCs are linear combinations chosen for variance — they need not align with any physically meaningful axis. Naming a PC "the personality factor" is the realm of factor analysis, not PCA.
  • Applying PCA to non-linear manifolds. If the data lies on a curved surface (a Swiss roll, a sphere) PCA recovers the bounding box, not the manifold. Use kernel PCA, isomap, t-SNE, or UMAP for non-linear structure.
  • Reporting all components instead of selecting k. "Total variance explained" is 100% with all d components — that is just an orthonormal change of basis. Pick k to actually reduce dimensionality.
  • Forgetting sign indeterminacy. A component v and −v explain the same variance. If you reproduce a PCA plot from a paper and your axes are flipped, that is the reason — and not a bug.

Frequently asked questions

Why does PCA require centering the data?

Variance is measured from the mean. If you skip centering, the first principal component points toward the data centroid instead of the direction of greatest spread, and your downstream variance-explained numbers are wrong. Always subtract column means as the first step. Many libraries do this automatically; check the docs before assuming.

Should I scale features before PCA?

Yes, when features are on very different scales. If income is in dollars and age is in years, the variance of income dominates and PCA picks income as the first component regardless of structure. Standard practice is to z-score: subtract the mean, divide by the standard deviation. Skip scaling only when features share the same physical units (e.g. pixel intensities).

Should I compute PCA via SVD or via the covariance matrix?

Use SVD on the centered data matrix. Forming the covariance matrix X^T·X squares the condition number and loses information about small variances. SVD operates directly on X and is numerically far more accurate. Production libraries (scikit-learn's PCA, MATLAB's pca) all default to SVD.

How do I choose how many components to keep?

Three common rules: (1) fix a target — keep enough components to explain 90% or 95% of total variance; (2) elbow rule — plot σ_i² vs i and cut at the visible bend; (3) Kaiser criterion — drop components whose variance is below the average. The right answer depends on what comes after PCA. For visualization, 2 or 3 components; for downstream learning, cross-validate.

Are principal components always uncorrelated?

Yes, by construction. The principal components are eigenvectors of the covariance matrix, so they are orthogonal, and the projected scores have zero pairwise covariance — the diagonal covariance matrix has variances σ_i²/(n−1) on the diagonal and zeros elsewhere. That decorrelation is one of PCA's primary appeals as a preprocessing step.

Does PCA assume the data is Gaussian?

Not strictly — PCA only optimizes a variance objective and does not assume a distribution. But the directions PCA finds are most informative when the data clusters along low-dimensional Gaussian-like ellipsoids. Heavy-tailed data, manifolds with curvature, or categorical features often need ICA, t-SNE, UMAP, or autoencoders instead.