Linear Algebra

Cholesky Decomposition

Name: Cholesky Decomposition — 60-second explainer
Uploaded: 2026-05-18T09:00:00Z
Duration: 1 min
Description: A symmetric positive-definite matrix A factors uniquely as A = LL^T where L is lower triangular. Twice as fast as LU, half the memory, no pivoting needed — the workhorse for normal equations, Kalman filters, and Monte Carlo sampling.

The square root of a positive-definite matrix — twice as fast as LU

A symmetric positive-definite matrix A factors uniquely as A = LL^T where L is lower triangular. Twice as fast as LU, half the memory, no pivoting needed — the workhorse for normal equations, Kalman filters, and Monte Carlo sampling.

FactorizationA = L · Lᵀ (L lower triangular, positive diagonal)
RequiresA symmetric positive definite
Factor costn³/3 flops — exactly half of LU's 2n³/3
Solve cost per b≈ n² (two triangular solves)
Memoryn²/2 — only the lower triangle of L
Used inNormal equations, Kalman filter, Monte Carlo (MVN sampling), Gaussian processes, FEM

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The factorization

Cholesky writes a symmetric positive-definite matrix A as

A = L · Lᵀ

where L is lower triangular with strictly positive entries on its diagonal. Think of L as the matrix square root of A — applied twice (once as L, once transposed as Lᵀ) it reproduces A. Because A must be symmetric, only the lower triangle of A is ever read; because L is triangular, only n(n+1)/2 entries are stored.

The factorization is unique once you fix the convention that L's diagonal is positive. No pivoting, no permutation matrix, no row swaps — positive definiteness guarantees every pivot is safe.

How L is built — column by column

Equate the (i, j) entries of A with the (i, j) entries of LLᵀ. For i = j (the diagonal):

Aᵢᵢ = Σ_(k=1..i) Lᵢₖ² ⇒ Lᵢᵢ = sqrt(Aᵢᵢ − Σ_(k<i) Lᵢₖ²)

For i > j (below the diagonal):

Aᵢⱼ = Σ_(k=1..j) Lᵢₖ Lⱼₖ ⇒ Lᵢⱼ = (Aᵢⱼ − Σ_(k<j) Lᵢₖ Lⱼₖ) / Lⱼⱼ

Process column j from j = 1 to n. Compute Lⱼⱼ first (the diagonal), then fill in Lᵢⱼ for i > j. Each entry uses only entries already computed in earlier columns — the algorithm is forward-marching with no look-ahead.

Worked example — a 3×3 SPD matrix

Factor

A = [  4   2   2 ]
    [  2   5   3 ]
    [  2   3   6 ]

Column 1.

L₁₁ = sqrt(A₁₁) = sqrt(4) = 2
L₂₁ = A₂₁ / L₁₁ = 2/2 = 1
L₃₁ = A₃₁ / L₁₁ = 2/2 = 1

Column 2.

L₂₂ = sqrt(A₂₂ − L₂₁²) = sqrt(5 − 1) = sqrt(4) = 2
L₃₂ = (A₃₂ − L₃₁ · L₂₁) / L₂₂ = (3 − 1·1)/2 = 1

Column 3.

L₃₃ = sqrt(A₃₃ − L₃₁² − L₃₂²) = sqrt(6 − 1 − 1) = sqrt(4) = 2

Final factor:

L = [ 2   0   0 ]
    [ 1   2   0 ]
    [ 1   1   2 ]

Verify by multiplying out LLᵀ:

LLᵀ = [ 2   0   0 ] [ 2  1  1 ]   [ 4   2   2 ]
      [ 1   2   0 ] [ 0  2  1 ] = [ 2   5   3 ]   ✓
      [ 1   1   2 ] [ 0  0  2 ]   [ 2   3   6 ]

Variants

LDLᵀ (square-root-free Cholesky). Write A = LDLᵀ where L is unit lower triangular (1s on the diagonal) and D is diagonal. Avoids the n square roots, which is faster on hardware where sqrt is slow and also works for symmetric indefinite matrices (with Bunch–Kaufman pivoting). This is what LAPACK's sytrf implements.
Pivoted Cholesky. P^TAP = LL^T with diagonal pivoting (swap the largest diagonal entry to the top at each step). Produces a rank-revealing factorisation for positive-semidefinite or near-singular matrices; LAPACK's pstrf does this.
Modified Cholesky (Gill–Murray–Wright). When A is not quite positive definite (e.g. a Hessian during nonlinear optimisation), perturb diagonal entries on the fly so the factorisation always succeeds. The output is L for a nearby SPD matrix A + E, with E as small as possible.
Sparse Cholesky. For graph Laplacians and finite-element matrices, the sparsity of L depends on the elimination order. Libraries (CHOLMOD, SuiteSparse) reorder rows/columns first (AMD, METIS) to minimise fill-in.
Block Cholesky. Partition A into 2 × 2 blocks; apply Cholesky to the top-left block, update the rest with the Schur complement, and recurse. Maps perfectly onto GPU matrix-matrix multiply kernels.

Algorithm — Cholesky in place

function cholesky(A):
    n = number of rows of A
    // L overwrites lower triangle of A in place
    for j in 0 .. n-1:
        // Diagonal entry
        s = A[j][j]
        for k in 0 .. j-1:
            s -= A[j][k] * A[j][k]
        if s <= 0: error "matrix not positive definite"
        A[j][j] = sqrt(s)

        // Off-diagonal entries in column j
        for i in j+1 .. n-1:
            s = A[i][j]
            for k in 0 .. j-1:
                s -= A[i][k] * A[j][k]
            A[i][j] = s / A[j][j]
    return A   // lower triangle holds L

Flop count: n³/3 (vs LU's 2n³/3) plus n square roots. For an n = 1000 SPD system the factorisation takes ≈ 0.33 GFLOP — running in milliseconds on a modern CPU. Once factored, every Ax = b solve costs only 2n² flops.

Cholesky vs LU vs QR — when to use which

	Cholesky (LLᵀ)	LU with partial pivot	QR	SVD
Requirement on A	Symmetric positive definite	Square nonsingular	Any m × n (m ≥ n)	Any m × n
Factor cost	n³/3	2n³/3	2mn² − 2n³/3	≈ 22n³ (m = n, full SVD)
Memory	n²/2	n²	mn (in Householder form)	n² + mn
Pivoting?	None needed	Row swaps	None for stability; column for rank	None
Stable for ill-conditioned A?	Yes (κ(A) up to ~10⁸)	Yes	Yes (better for least squares)	Best — reveals rank
Square roots	n	None	None	None on the matrix
Typical use	Covariance, normal equations, FEM, Kalman	General dense systems	Least squares, orthonormalisation	Rank, pseudoinverse, PCA

Why positive-definite — and how to check

A is symmetric positive definite (SPD) if A = Aᵀ and xᵀAx > 0 for every nonzero x. Equivalent characterisations:

All eigenvalues of A are strictly positive.
All leading principal minors of A are positive (Sylvester's criterion).
A admits a Cholesky factorisation A = LLᵀ with L having positive diagonal.

Running Cholesky is itself the cheapest SPD test — if the algorithm produces a non-positive Lⱼⱼ² value, A is not SPD. Many scientific libraries use Cholesky-attempt as the SPD certificate, and fall back to LDLᵀ or LU when it fails.

Where Cholesky shows up

Normal equations in linear regression. Least squares solves AᵀA x = Aᵀb. AᵀA is SPD (when A has full column rank), so Cholesky gives the cheapest path. QR is more numerically stable for ill-conditioned A but slower; Cholesky-on-normal-equations is the textbook recipe and is still standard for well-conditioned problems.
Sampling from a multivariate normal. If Σ = LLᵀ and z ∼ 𝒩(0, I), then x = μ + Lz ∼ 𝒩(μ, Σ). Every Monte Carlo simulator with correlated noise (portfolio risk, particle filters, climate ensembles) leans on this.
Kalman filter — covariance updates. The square-root Kalman filter (SR-KF) propagates the Cholesky factor of the state covariance instead of the covariance itself. Half the work, twice the numerical condition number — the standard in aerospace navigation.
Gaussian processes. Marginal likelihood involves log det(K) and α = K⁻¹y for a covariance matrix K. Cholesky gives both in one O(n³/3) factorisation: log det = 2 Σ log(Lᵢᵢ), and α via two triangular solves.
Finite-element method stiffness matrices. The global stiffness matrix is symmetric positive definite (for well-posed elliptic PDEs). Sparse Cholesky with fill-reducing reorderings (METIS/AMD) is the workhorse direct solver in structural mechanics.
Interior-point methods. The Newton system in primal-dual algorithms reduces to a symmetric positive-definite normal-equation matrix at each iteration — solved by Cholesky for the bulk of optimisation runtime.
Whitening transforms in machine learning. z = L⁻¹(x − μ) where Σ = LLᵀ — produces a standard-normal-distributed variable from a correlated one, a building block in PCA preprocessing and ZCA whitening.

Common mistakes

Trying Cholesky on a non-SPD matrix. If A has even one zero or negative eigenvalue, Cholesky aborts with a non-positive Lⱼⱼ². The fix is either LDLᵀ (Bunch–Kaufman pivoting) or adding a small ridge δ·I to A.
Using normal-equation-with-Cholesky on ill-conditioned least squares. AᵀA squares the condition number of A. If κ(A) ≈ 10⁵, then κ(AᵀA) ≈ 10¹⁰ — pushing past the safe limit (≈ 10⁸ in double precision). Use QR for least squares whenever κ(A) is borderline.
Forgetting that Cholesky is in-place. Most implementations overwrite A's lower triangle with L. The original A is destroyed unless you copied it first — a classic "where did my covariance go" bug.
Mixing up Lᵀ and L⁻¹. Lᵀ is the upper-triangular transpose; L⁻¹ is the inverse. Whitening uses L⁻¹; recoloring uses L. They are almost always different matrices.
Computing Σ⁻¹ explicitly when you only need solves. Once you have L, every solve costs 2n². Forming Σ⁻¹ explicitly costs n³ extra and loses precision. Solve, don't invert.
Ignoring symmetry on input. Most libraries assume A is symmetric and only read the lower (or upper) triangle. Passing a slightly non-symmetric matrix (e.g. from accumulated round-off) silently produces wrong results. Symmetrise via (A + Aᵀ)/2 before calling Cholesky on data you don't trust.

Frequently asked questions

What is Cholesky decomposition?

It is the factorisation of a symmetric positive-definite matrix A as A = LL^T where L is a lower-triangular matrix with positive diagonal entries. Existence and uniqueness both depend on A being symmetric and positive definite — fail either and Cholesky aborts. When applicable, it is the cheapest, most stable factorisation in linear algebra.

Why is Cholesky faster than LU?

Cholesky uses symmetry to skip half the work. It touches only the lower triangle of A and computes only the lower triangle of L. The flop count is n³/3 — exactly half of LU's 2n³/3 (or n³/3 with the symmetric-storage convention, double Cholesky's n³/6 traditional accounting). Memory drops too: n²/2 instead of n².

Why does Cholesky need no pivoting?

Positive definiteness guarantees every leading principal submatrix has positive determinant, so the diagonal entry that becomes the next pivot is always positive. There is never a zero (or destructively small) pivot in finite precision, which makes Cholesky unconditionally backward stable without any row swaps. Numerical failure on a numerically positive-definite matrix is rare and signals you are at the edge of positive-definiteness.

How is Cholesky used for sampling from a multivariate normal?

If z ~ N(0, I_n) is a vector of independent standard normals and Σ = LL^T, then x = μ + Lz is distributed as N(μ, Σ). The Cholesky factor L is the cheap, deterministic square-root of the covariance matrix used by every Monte Carlo simulator: portfolio risk in finance, particle filters, Bayesian posterior sampling, generative models with Gaussian latents.

What if my matrix is positive semidefinite but not strictly positive definite?

Standard Cholesky fails — a zero appears on the diagonal during factorisation. Use pivoted Cholesky (LAPACK's potrf with pivoting) which produces P^T A P = LL^T plus a rank certificate, or fall back to LDL^T which handles rank deficiency more gracefully. Adding a small ridge δ·I to A is the practical hack used everywhere in ML when covariance matrices become numerically singular.

Can I update or downdate a Cholesky factor?

Yes. After a rank-one update A' = A + xx^T (covariance after a new observation), there is an O(n²) algorithm that produces the new L' without redoing the full factorisation. Downdating A' = A − xx^T is also O(n²) but only when A' remains positive definite. Kalman filters, recursive least squares, and online Gaussian processes all rely on these incremental updates.

How do I solve Ax = b once I have the Cholesky factor?

Two triangular solves. Solve Ly = b by forward substitution, then solve L^T x = y by back substitution. Each is O(n²), so the total solve after one O(n³/3) factorisation costs O(n²) per right-hand side — identical structure to LU but with half the constant.