Optimization
Conjugate Gradient
Iteratively solve Ax = b for symmetric positive-definite A in at most N steps using A-orthogonal search directions
Conjugate gradient minimises the quadratic ½ x^T A x − b^T x by picking each new search direction A-orthogonal to all previous ones. Each step kills error in a fresh subspace, so the exact minimum is reached in at most N iterations — and on real well-conditioned problems, a few dozen.
- IntroducedHestenes & Stiefel, 1952
- Problem classAx = b, A symmetric positive-definite
- Exact convergenceN steps (N×N system, exact arithmetic)
- Iterative boundO(√κ(A)) iters for tol ε
- Per-step cost1 SpMV + 2 dot + 3 axpy
- ML scaleUp to N = 10⁹ with preconditioner
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The problem CG solves
Given a symmetric positive-definite matrix A and a vector b, find x with A x = b. Equivalently, minimise the quadratic energy
φ(x) = ½ xᵀ A x − bᵀ x
The gradient of φ is ∇φ = A x − b, which vanishes exactly when A x = b. Because A is positive-definite, φ has a unique global minimum, and that minimiser is the solution to the linear system. Conjugate gradient (CG) is the workhorse iterative method for this problem when A is too large to factor — typical N in modern simulations is 10⁶ to 10⁹, where a direct Cholesky factorisation would need terabytes of storage for the factor, and CG keeps memory to a handful of length-N vectors.
Hestenes, Stiefel, and the original 1952 paper
Magnus Hestenes at the National Bureau of Standards (now NIST) and Eduard Stiefel at the ETH Zürich independently invented essentially the same algorithm in 1951-1952. Their joint paper, "Methods of Conjugate Gradients for Solving Linear Systems," appeared in the Journal of Research of the National Bureau of Standards in December 1952. They presented CG as a direct method that always finishes in N steps — appropriate framing for the small dense problems of the day. It took two more decades, until John Reid's 1971 paper, for the community to recognise CG as an iterative method that gets close to the solution in far fewer than N steps on large sparse problems with moderate condition number. That re-interpretation is what made CG the universal solver it is today.
The algorithm in six lines
x_0 = 0 (or a guess)
r_0 = b - A x_0 ← initial residual
p_0 = r_0 ← initial search direction
for k = 0, 1, 2, …
α_k = (r_kᵀ r_k) / (p_kᵀ A p_k) ← step length
x_{k+1} = x_k + α_k p_k ← update solution
r_{k+1} = r_k - α_k A p_k ← update residual
β_k = (r_{k+1}ᵀ r_{k+1}) / (r_kᵀ r_k) ← orthogonalisation coefficient
p_{k+1} = r_{k+1} + β_k p_k ← new search direction
stop when ‖r_{k+1}‖ < tol
The algorithm uses exactly one sparse matrix-vector product (A p_k), two inner products, and three vector updates per iteration. Memory cost is four length-N vectors. Compare with Gaussian elimination, which would need N²-to-N³ memory for the LU factor — CG is the only reason simulations with billion-dimensional state vectors are tractable.
Why A-conjugate directions kill error one subspace at a time
Two vectors p and q are A-conjugate if pᵀ A q = 0. When A is symmetric positive-definite this is an inner product, and A-conjugacy is the natural notion of orthogonality for the energy norm ‖v‖_A = √(vᵀ A v). The CG construction guarantees that all the search directions p_0, …, p_{N-1} are A-conjugate to each other.
This matters because once CG has performed an exact line search along p_k, the residual r_{k+1} = b − A x_{k+1} is A-conjugate to p_k. That means subsequent steps along p_{k+1}, p_{k+2}, … cannot reintroduce error in the direction p_k. Each iteration zeros error in one A-conjugate direction permanently. After N iterations, all N directions in the A-conjugate basis have been zeroed and x_N must be the exact solution. The construction is direct yet only needs a single line of arithmetic per coefficient update — no Gram-Schmidt loop over all previous directions.
Worked example: a 2×2 system
Take A = [[4, 1], [1, 3]] and b = (1, 2). The exact solution is x = (1/11, 7/11) ≈ (0.0909, 0.6364). Start CG at x_0 = (0, 0):
k=0: r_0 = b - A x_0 = (1, 2)
p_0 = (1, 2)
A p_0 = (4·1 + 1·2, 1·1 + 3·2) = (6, 7)
α_0 = (r_0ᵀ r_0) / (p_0ᵀ A p_0) = (1 + 4) / (1·6 + 2·7) = 5/20 = 0.25
x_1 = (0,0) + 0.25 · (1, 2) = (0.25, 0.50)
r_1 = (1, 2) - 0.25 · (6, 7) = (-0.50, 0.25)
β_0 = (r_1ᵀ r_1) / (r_0ᵀ r_0) = (0.25 + 0.0625) / 5 = 0.0625
p_1 = (-0.50, 0.25) + 0.0625 · (1, 2) = (-0.4375, 0.375)
k=1: A p_1 = (4·-0.4375 + 1·0.375, 1·-0.4375 + 3·0.375) = (-1.375, 0.6875)
α_1 = (r_1ᵀ r_1) / (p_1ᵀ A p_1) = 0.3125 / (-0.4375·-1.375 + 0.375·0.6875)
= 0.3125 / 0.859375 = 0.3636…
x_2 = (0.25, 0.50) + 0.3636 · (-0.4375, 0.375) = (0.0909, 0.6364)
r_2 = (-0.50, 0.25) - 0.3636 · (-1.375, 0.6875) = (0, 0)
In exactly 2 iterations CG hit the exact solution x = (1/11, 7/11) — N steps for an N=2 system, as promised. On a 1000×1000 SPD matrix the same theoretical bound says CG finishes in 1000 iterations exactly; in practice rounding error introduces drift, and CG is run with a tolerance ‖r‖ < ε, hitting it in O(√κ(A)) iterations where κ is the condition number.
CG vs other linear-system solvers
| Conjugate Gradient | Steepest Descent | GMRES | Direct Cholesky | |
|---|---|---|---|---|
| Matrix requirement | SPD | Any (slow on hard ones) | Any non-singular | SPD |
| Convergence iters | O(√κ(A)) | O(κ(A)) | O(√κ(A)) optimal | 1 (exact) |
| Memory | O(N) — 4 vectors | O(N) — 2 vectors | O(kN) — Arnoldi basis | O(N²) — full factor |
| Per-iter cost | 1 SpMV + O(N) | 1 SpMV + O(N) | 1 SpMV + O(kN) | O(N³) once |
| Exact in N steps | Yes (exact arithmetic) | No | Yes (non-symmetric) | Yes |
| Parallelisation | SpMV + dot products | Same as CG | Restarts hurt scaling | Factor is hard to scale |
| Used when | Sparse SPD, N up to 10⁹ | Pedagogy only | Sparse non-symmetric | Dense or small SPD |
The middle column is the punchline of CG. Steepest descent on the same SPD system needs O(κ(A)) iterations rather than O(√κ(A)) — for a moderately ill-conditioned problem with κ = 10⁴, CG converges in ~100 iterations where steepest descent would need ~10,000. The square-root speed-up alone is why no one teaches steepest descent as a serious linear solver any more.
Preconditioning
Real-world A often has a condition number that makes raw CG unacceptably slow. The remedy is preconditioning: pick a matrix M ≈ A that is cheap to apply and run CG on the implicit system M⁻¹A x = M⁻¹b. The new condition number κ(M⁻¹A) is ideally close to 1, so the iteration count collapses.
Standard preconditioners on industrial codes:
- Diagonal (Jacobi): M = diag(A). Trivially cheap, modest reduction in κ. Default in most teaching codes.
- Incomplete Cholesky (ICC): M = L̃ L̃ᵀ where L̃ is a sparsified Cholesky factor. Standard in serial FEM solvers.
- Algebraic multigrid (AMG): Build a hierarchy of coarse problems automatically from A. The state of the art for elliptic PDE-derived matrices — turns iteration counts from thousands to single digits.
- Domain decomposition: Solve local subproblems on each MPI rank, glue with a coarse correction. Standard on massively parallel codes.
For a 3D Poisson problem discretised on a 100×100×100 grid, κ ≈ 10⁴ and unpreconditioned CG needs roughly 300 iterations. AMG-preconditioned CG reaches machine precision in 10. That gap — two orders of magnitude — explains why every modern PDE solver is "Krylov plus a good preconditioner" rather than direct factorisation.
Where CG actually runs
- Finite element analysis. Every modern FEM code (Abaqus, ANSYS, COMSOL, FEniCS, deal.II) ships preconditioned CG (or its MINRES/GMRES cousins for non-SPD) as the default solver for elliptic and parabolic problems. Linear elasticity, heat conduction, electrostatics — all SPD, all CG inside.
- Image deblurring and tomography. Computed tomography reconstructs the patient from a sparse Radon transform; the normal equations AᵀA x = Aᵀb are SPD, so CG (often as CGLS or LSQR) is the standard inner solver. Industry tools like ASTRA Toolbox and TIGRE run CG on millions of voxels per scan.
- Large-scale machine learning. Trust-region Newton methods (TRON inside
liblinearfor L2-regularised logistic regression) solve the Newton step H Δ = −g with CG. Hessian-free deep learning (Martens 2010) uses CG with matrix-free Hessian-vector products. Influence Functions, implicit MAML, and meta-learning all use CG to invert the Hessian without forming it. - Gaussian-process regression. Computing (K + σ² I)⁻¹ y for a kernel matrix K is SPD; the gpytorch library defaults to preconditioned CG with rank-k pivoted-Cholesky preconditioner, scaling GP regression to N ≈ 10⁶.
- Quantum chemistry. Density functional theory's self-consistent field iteration solves an eigenvalue problem; the inner linear systems (Sternheimer equation for response theory) are SPD and run on CG. Quantum ESPRESSO and Gaussian use CG inside DFT.
Variants and cousins
- Preconditioned CG (PCG): CG with an inner preconditioner application M⁻¹ r each iteration. Standard form on every real workload.
- MINRES (1975): Variant for symmetric indefinite A. Minimises the 2-norm of the residual instead of the energy norm. Replaces CG when A has mixed-sign eigenvalues.
- GMRES (Saad & Schultz 1986): Extends the Krylov idea to non-symmetric A. Builds an explicit Arnoldi basis, so memory grows with iteration count — usually restarted every 30-50 steps in practice.
- BiCGStab (Van der Vorst 1992): A short-recurrence (constant-memory) Krylov method for non-symmetric A. Common alternative to GMRES when restarts hurt convergence.
- CGNE / CGLS / LSQR: Apply CG to the normal equations AᵀA x = Aᵀb (or LSQR's stabler equivalent) to solve least squares min ‖A x − b‖. Standard for over-determined systems in tomography and ML.
- Nonlinear CG (Fletcher-Reeves, Polak-Ribière): Generalises CG from quadratic to general smooth f(x). Each step does a line search along a direction built from the gradient and the previous direction. Cheaper per iteration than L-BFGS, used in older neural-net training codes and inside scipy.optimize.
Common pitfalls
- Using CG on non-SPD A. CG can break down (zero division) or wander when A is not positive-definite. Use MINRES for symmetric indefinite, GMRES or BiCGStab for non-symmetric. Many homemade codes silently produce nonsense by ignoring this.
- No preconditioner. Plain CG on a stiff FEM matrix can need millions of iterations. Always wrap with a problem-appropriate preconditioner; the difference between κ = 10⁶ and κ = 10 in iteration count is six orders of magnitude.
- Loss of orthogonality in finite precision. After many iterations rounding corrupts A-conjugacy and convergence stalls. Restart CG every ~50 iterations or switch to an explicit reorthogonalisation variant.
- Wrong stopping criterion. The natural stopping test ‖r_k‖ < ε can be misleading on badly-scaled problems — a small residual does not always mean a small error. Use a relative test ‖r_k‖ < ε ‖b‖ or a normwise backward-error check.
- Forgetting that CG is matrix-free. You only need to compute A · v, not store A. On matrix-free problems (Hessian-free ML, finite-difference PDEs) writing the SpMV as a function call lets CG run on problems that would never fit in memory if A were assembled.
Why CG is still everywhere
The conjugate-gradient method turns 74 years old in 2026 and remains the default iterative solver for symmetric positive-definite systems. Three properties keep it alive: provable O(√κ) convergence, constant memory cost (four vectors regardless of problem size), and full matrix-free operation. Newer methods improve on it for specific subclasses — MINRES for indefinite, GMRES for non-symmetric, AMG for elliptic PDEs — but every one of them is built on the same Krylov-subspace framework Hestenes and Stiefel laid down at NBS in 1952.
Inside every petascale climate model, every commercial FEM solver, every Gaussian-process library, and every Hessian-free deep-learning trainer is a small, ancient, six-line algorithm that does exactly one matrix-vector product per iteration and walks toward the answer in directions no one outside numerical analysis has heard of. That single algorithm is the reason simulation-at-scale exists.
Frequently asked questions
Why does CG converge in N steps?
CG generates N search directions p_0, p_1, …, p_{N-1} that are A-conjugate, meaning p_i^T A p_j = 0 whenever i ≠ j. Because A is symmetric positive-definite, these N vectors form a basis of ℝ^N. The CG step along p_k zeros the error component in the direction p_k, and once that component is gone it stays gone — subsequent steps are A-orthogonal to p_k and cannot disturb it. After N steps every component of the initial error has been eliminated, so x_N is the exact solution. In floating-point, rounding lets old components creep back, so CG is run as an iterative method that converges in O(√κ(A)) iterations rather than waiting for the theoretical N.
Why does CG only work on symmetric positive-definite matrices?
CG is built on the equivalence Ax = b ⇔ minimise φ(x) = ½ x^T A x − b^T x. That quadratic has a unique global minimum only when A is symmetric positive-definite — symmetric so the quadratic is well-defined and the gradient is A x − b, positive-definite so the minimum is a unique finite point. If A is non-symmetric the equivalence fails and you need GMRES or BiCGStab. If A is symmetric but indefinite (eigenvalues of mixed sign) the quadratic is a saddle and CG can break down with a zero step length. MINRES handles symmetric indefinite cases.
What does the condition number do to CG's speed?
The standard CG convergence bound is ‖x_k − x*‖_A ≤ 2 ((√κ − 1)/(√κ + 1))^k ‖x_0 − x*‖_A where κ = λ_max / λ_min is the condition number of A. So to reduce the error by a factor of 10^d you need roughly k ≈ ½√κ · ln(2 · 10^d / ε) iterations — convergence depends on the square root of the condition number, not the condition number itself like steepest descent. A typical FEM stiffness matrix has κ that grows like 1/h^2 in the mesh size h, so doubling the resolution doubles the required iterations — manageable, where steepest descent would quadruple them.
What is a preconditioner and why use one?
A preconditioner is a matrix M ≈ A that is cheap to apply. Preconditioned CG solves M⁻¹ A x = M⁻¹ b implicitly, which has condition number κ(M⁻¹A) ideally much smaller than κ(A). The iteration count drops in proportion. Common choices: incomplete Cholesky (ICC), algebraic multigrid (AMG), block-Jacobi for parallel runs, and domain-decomposition preconditioners for FEM. On a 3D Poisson FEM problem AMG-preconditioned CG can reach machine precision in 10 iterations where plain CG would need thousands. The Conjugate-Gradient half of any modern sparse linear solver is the easy part; almost all the engineering is in the preconditioner.
Why is CG considered a Krylov subspace method?
The k-th iterate of CG (starting from x_0 = 0) lies in the Krylov subspace K_k(A, b) = span{b, Ab, A^2 b, …, A^{k-1} b}. CG picks the specific point in K_k that minimises the energy norm ‖x − x*‖_A. Other Krylov methods pick different optima in the same subspace: MINRES minimises the 2-norm of the residual, GMRES does the same for non-symmetric A. The Krylov framework unifies them: each method is a particular projection from a small Krylov subspace onto an approximate solution, with cost dominated by matrix-vector products A · v rather than by storing or factoring A itself.
What does CG cost per iteration?
One sparse matrix-vector product A · p (the dominant cost), two inner products r·r and p·Ap, and three vector updates of length N. Total roughly 2 · nnz(A) + 5N flops per iteration where nnz is the number of non-zeros. No factorisation, no extra storage beyond four length-N vectors. That O(nnz) per-iteration cost is why CG scales to billion-dimension problems where direct Cholesky would need petabytes of memory for the factor — and why every petascale FEM and PDE-constrained-optimisation code is CG-based with a preconditioner.
Where does CG appear in machine learning?
Inside trust-region Newton methods (e.g. TRON for L2-regularised logistic regression in liblinear), CG solves the Newton step H · Δ = −g where H is the Hessian. Inside Hessian-free optimisation for deep nets (Martens 2010), CG solves the same equation using Hessian-vector products computed via R-operators. CG is also the standard inner solver in Gaussian-process regression where you must compute (K + σ² I)⁻¹ y for kernel matrix K, in implicit differentiation through fixed points (Bai 2020), and in Influence Functions for interpretability (Koh & Liang 2017). Wherever you'd write H⁻¹ v and H is SPD, CG is the practical answer.