Linear Algebra
Cholesky Decomposition
The square root of a positive-definite matrix — twice as fast as LU
A symmetric positive-definite matrix A factors uniquely as A = LL^T where L is lower triangular. Twice as fast as LU, half the memory, no pivoting needed — the workhorse for normal equations, Kalman filters, and Monte Carlo sampling.
- FactorizationA = L · Lᵀ (L lower triangular, positive diagonal)
- RequiresA symmetric positive definite
- Factor costn³/3 flops — exactly half of LU's 2n³/3
- Solve cost per b≈ n² (two triangular solves)
- Memoryn²/2 — only the lower triangle of L
- Used inNormal equations, Kalman filter, Monte Carlo (MVN sampling), Gaussian processes, FEM
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The factorization
Cholesky writes a symmetric positive-definite matrix A as
A = L · Lᵀ
where L is lower triangular with strictly positive entries on its diagonal. Think of L as the matrix square root of A — applied twice (once as L, once transposed as Lᵀ) it reproduces A. Because A must be symmetric, only the lower triangle of A is ever read; because L is triangular, only n(n+1)/2 entries are stored.
The factorization is unique once you fix the convention that L's diagonal is positive. No pivoting, no permutation matrix, no row swaps — positive definiteness guarantees every pivot is safe.
How L is built — column by column
Equate the (i, j) entries of A with the (i, j) entries of LLᵀ. For i = j (the diagonal):
Aᵢᵢ = Σ_(k=1..i) Lᵢₖ² ⇒ Lᵢᵢ = sqrt(Aᵢᵢ − Σ_(k<i) Lᵢₖ²)
For i > j (below the diagonal):
Aᵢⱼ = Σ_(k=1..j) Lᵢₖ Lⱼₖ ⇒ Lᵢⱼ = (Aᵢⱼ − Σ_(k<j) Lᵢₖ Lⱼₖ) / Lⱼⱼ
Process column j from j = 1 to n. Compute Lⱼⱼ first (the diagonal), then fill in Lᵢⱼ for i > j. Each entry uses only entries already computed in earlier columns — the algorithm is forward-marching with no look-ahead.
Worked example — a 3×3 SPD matrix
Factor
A = [ 4 2 2 ]
[ 2 5 3 ]
[ 2 3 6 ]
Column 1.
L₁₁ = sqrt(A₁₁) = sqrt(4) = 2
L₂₁ = A₂₁ / L₁₁ = 2/2 = 1
L₃₁ = A₃₁ / L₁₁ = 2/2 = 1
Column 2.
L₂₂ = sqrt(A₂₂ − L₂₁²) = sqrt(5 − 1) = sqrt(4) = 2
L₃₂ = (A₃₂ − L₃₁ · L₂₁) / L₂₂ = (3 − 1·1)/2 = 1
Column 3.
L₃₃ = sqrt(A₃₃ − L₃₁² − L₃₂²) = sqrt(6 − 1 − 1) = sqrt(4) = 2
Final factor:
L = [ 2 0 0 ]
[ 1 2 0 ]
[ 1 1 2 ]
Verify by multiplying out LLᵀ:
LLᵀ = [ 2 0 0 ] [ 2 1 1 ] [ 4 2 2 ]
[ 1 2 0 ] [ 0 2 1 ] = [ 2 5 3 ] ✓
[ 1 1 2 ] [ 0 0 2 ] [ 2 3 6 ]
Variants
- LDLᵀ (square-root-free Cholesky). Write A = LDLᵀ where L is unit lower triangular (1s on the diagonal) and D is diagonal. Avoids the n square roots, which is faster on hardware where sqrt is slow and also works for symmetric indefinite matrices (with Bunch–Kaufman pivoting). This is what LAPACK's
sytrfimplements. - Pivoted Cholesky. PTAP = LLT with diagonal pivoting (swap the largest diagonal entry to the top at each step). Produces a rank-revealing factorisation for positive-semidefinite or near-singular matrices; LAPACK's
pstrfdoes this. - Modified Cholesky (Gill–Murray–Wright). When A is not quite positive definite (e.g. a Hessian during nonlinear optimisation), perturb diagonal entries on the fly so the factorisation always succeeds. The output is L for a nearby SPD matrix A + E, with E as small as possible.
- Sparse Cholesky. For graph Laplacians and finite-element matrices, the sparsity of L depends on the elimination order. Libraries (CHOLMOD, SuiteSparse) reorder rows/columns first (AMD, METIS) to minimise fill-in.
- Block Cholesky. Partition A into 2 × 2 blocks; apply Cholesky to the top-left block, update the rest with the Schur complement, and recurse. Maps perfectly onto GPU matrix-matrix multiply kernels.
Algorithm — Cholesky in place
function cholesky(A):
n = number of rows of A
// L overwrites lower triangle of A in place
for j in 0 .. n-1:
// Diagonal entry
s = A[j][j]
for k in 0 .. j-1:
s -= A[j][k] * A[j][k]
if s <= 0: error "matrix not positive definite"
A[j][j] = sqrt(s)
// Off-diagonal entries in column j
for i in j+1 .. n-1:
s = A[i][j]
for k in 0 .. j-1:
s -= A[i][k] * A[j][k]
A[i][j] = s / A[j][j]
return A // lower triangle holds L
Flop count: n³/3 (vs LU's 2n³/3) plus n square roots. For an n = 1000 SPD system the factorisation takes ≈ 0.33 GFLOP — running in milliseconds on a modern CPU. Once factored, every Ax = b solve costs only 2n² flops.
Cholesky vs LU vs QR — when to use which
| Cholesky (LLᵀ) | LU with partial pivot | QR | SVD | |
|---|---|---|---|---|
| Requirement on A | Symmetric positive definite | Square nonsingular | Any m × n (m ≥ n) | Any m × n |
| Factor cost | n³/3 | 2n³/3 | 2mn² − 2n³/3 | ≈ 22n³ (m = n, full SVD) |
| Memory | n²/2 | n² | mn (in Householder form) | n² + mn |
| Pivoting? | None needed | Row swaps | None for stability; column for rank | None |
| Stable for ill-conditioned A? | Yes (κ(A) up to ~10⁸) | Yes | Yes (better for least squares) | Best — reveals rank |
| Square roots | n | None | None | None on the matrix |
| Typical use | Covariance, normal equations, FEM, Kalman | General dense systems | Least squares, orthonormalisation | Rank, pseudoinverse, PCA |
Why positive-definite — and how to check
A is symmetric positive definite (SPD) if A = Aᵀ and xᵀAx > 0 for every nonzero x. Equivalent characterisations:
- All eigenvalues of A are strictly positive.
- All leading principal minors of A are positive (Sylvester's criterion).
- A admits a Cholesky factorisation A = LLᵀ with L having positive diagonal.
Running Cholesky is itself the cheapest SPD test — if the algorithm produces a non-positive Lⱼⱼ² value, A is not SPD. Many scientific libraries use Cholesky-attempt as the SPD certificate, and fall back to LDLᵀ or LU when it fails.
Where Cholesky shows up
- Normal equations in linear regression. Least squares solves AᵀA x = Aᵀb. AᵀA is SPD (when A has full column rank), so Cholesky gives the cheapest path. QR is more numerically stable for ill-conditioned A but slower; Cholesky-on-normal-equations is the textbook recipe and is still standard for well-conditioned problems.
- Sampling from a multivariate normal. If Σ = LLᵀ and z ∼ 𝒩(0, I), then x = μ + Lz ∼ 𝒩(μ, Σ). Every Monte Carlo simulator with correlated noise (portfolio risk, particle filters, climate ensembles) leans on this.
- Kalman filter — covariance updates. The square-root Kalman filter (SR-KF) propagates the Cholesky factor of the state covariance instead of the covariance itself. Half the work, twice the numerical condition number — the standard in aerospace navigation.
- Gaussian processes. Marginal likelihood involves log det(K) and α = K⁻¹y for a covariance matrix K. Cholesky gives both in one O(n³/3) factorisation: log det = 2 Σ log(Lᵢᵢ), and α via two triangular solves.
- Finite-element method stiffness matrices. The global stiffness matrix is symmetric positive definite (for well-posed elliptic PDEs). Sparse Cholesky with fill-reducing reorderings (METIS/AMD) is the workhorse direct solver in structural mechanics.
- Interior-point methods. The Newton system in primal-dual algorithms reduces to a symmetric positive-definite normal-equation matrix at each iteration — solved by Cholesky for the bulk of optimisation runtime.
- Whitening transforms in machine learning. z = L⁻¹(x − μ) where Σ = LLᵀ — produces a standard-normal-distributed variable from a correlated one, a building block in PCA preprocessing and ZCA whitening.
Common mistakes
- Trying Cholesky on a non-SPD matrix. If A has even one zero or negative eigenvalue, Cholesky aborts with a non-positive Lⱼⱼ². The fix is either LDLᵀ (Bunch–Kaufman pivoting) or adding a small ridge δ·I to A.
- Using normal-equation-with-Cholesky on ill-conditioned least squares. AᵀA squares the condition number of A. If κ(A) ≈ 10⁵, then κ(AᵀA) ≈ 10¹⁰ — pushing past the safe limit (≈ 10⁸ in double precision). Use QR for least squares whenever κ(A) is borderline.
- Forgetting that Cholesky is in-place. Most implementations overwrite A's lower triangle with L. The original A is destroyed unless you copied it first — a classic "where did my covariance go" bug.
- Mixing up Lᵀ and L⁻¹. Lᵀ is the upper-triangular transpose; L⁻¹ is the inverse. Whitening uses L⁻¹; recoloring uses L. They are almost always different matrices.
- Computing Σ⁻¹ explicitly when you only need solves. Once you have L, every solve costs 2n². Forming Σ⁻¹ explicitly costs n³ extra and loses precision. Solve, don't invert.
- Ignoring symmetry on input. Most libraries assume A is symmetric and only read the lower (or upper) triangle. Passing a slightly non-symmetric matrix (e.g. from accumulated round-off) silently produces wrong results. Symmetrise via (A + Aᵀ)/2 before calling Cholesky on data you don't trust.
Frequently asked questions
What is Cholesky decomposition?
It is the factorisation of a symmetric positive-definite matrix A as A = LL^T where L is a lower-triangular matrix with positive diagonal entries. Existence and uniqueness both depend on A being symmetric and positive definite — fail either and Cholesky aborts. When applicable, it is the cheapest, most stable factorisation in linear algebra.
Why is Cholesky faster than LU?
Cholesky uses symmetry to skip half the work. It touches only the lower triangle of A and computes only the lower triangle of L. The flop count is n³/3 — exactly half of LU's 2n³/3 (or n³/3 with the symmetric-storage convention, double Cholesky's n³/6 traditional accounting). Memory drops too: n²/2 instead of n².
Why does Cholesky need no pivoting?
Positive definiteness guarantees every leading principal submatrix has positive determinant, so the diagonal entry that becomes the next pivot is always positive. There is never a zero (or destructively small) pivot in finite precision, which makes Cholesky unconditionally backward stable without any row swaps. Numerical failure on a numerically positive-definite matrix is rare and signals you are at the edge of positive-definiteness.
How is Cholesky used for sampling from a multivariate normal?
If z ~ N(0, I_n) is a vector of independent standard normals and Σ = LL^T, then x = μ + Lz is distributed as N(μ, Σ). The Cholesky factor L is the cheap, deterministic square-root of the covariance matrix used by every Monte Carlo simulator: portfolio risk in finance, particle filters, Bayesian posterior sampling, generative models with Gaussian latents.
What if my matrix is positive semidefinite but not strictly positive definite?
Standard Cholesky fails — a zero appears on the diagonal during factorisation. Use pivoted Cholesky (LAPACK's potrf with pivoting) which produces P^T A P = LL^T plus a rank certificate, or fall back to LDL^T which handles rank deficiency more gracefully. Adding a small ridge δ·I to A is the practical hack used everywhere in ML when covariance matrices become numerically singular.
Can I update or downdate a Cholesky factor?
Yes. After a rank-one update A' = A + xx^T (covariance after a new observation), there is an O(n²) algorithm that produces the new L' without redoing the full factorisation. Downdating A' = A − xx^T is also O(n²) but only when A' remains positive definite. Kalman filters, recursive least squares, and online Gaussian processes all rely on these incremental updates.
How do I solve Ax = b once I have the Cholesky factor?
Two triangular solves. Solve Ly = b by forward substitution, then solve L^T x = y by back substitution. Each is O(n²), so the total solve after one O(n³/3) factorisation costs O(n²) per right-hand side — identical structure to LU but with half the constant.