Optimization

Steepest Descent

Move in the direction of −∇f at every step — the simplest first-order optimisation and the textbook starting point

Steepest descent — Cauchy's 1847 method — moves along the negative gradient and line-searches to the minimum on that ray. On a circular quadratic it converges in one step. On an ellipsoidal one with condition number κ = 1000, it zigzags for hundreds of iterations per digit of accuracy.

InventedAugustin-Louis Cauchy, 1847
Update rulex_{k+1} = x_k − α_k ∇f(x_k)
ConvergenceLinear, rate (κ−1)/(κ+1)
Bad caseκ = 1000 → ~1150 iters/digit
Memory2 vectors (x, ∇f)
Used asReference baseline; foundation of SGD, Adam, momentum

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The negative-gradient direction

The gradient ∇f(x) of a differentiable scalar function points in the direction of fastest local increase, with magnitude equal to the slope in that direction. Reverse the sign and you get the direction of fastest local decrease — the steepest-descent direction. Steepest descent is the iteration that picks this direction at every step:

x_{k+1} = x_k − α_k · ∇f(x_k)

where α_k is a positive step length, ideally chosen by an exact line search: α_k = argmin_{α > 0} f(x_k − α ∇f(x_k)). Each iteration is a single 1D minimisation along a ray. The first-order optimality of that 1D problem says the new gradient is perpendicular to the old gradient: ∇f(x_{k+1}) · ∇f(x_k) = 0. That orthogonality is the source of every interesting property of the method, good and bad.

Cauchy, 1847

Augustin-Louis Cauchy invented steepest descent in 1847, in a five-page note to the Comptes rendus titled "Méthode générale pour la résolution des systèmes d'équations simultanées." His motivation was to solve nonlinear systems F(x) = 0 by minimising f(x) = ½ ‖F(x)‖². He framed the method as a continuous-time flow dx/dt = −∇f, with the discrete iteration as a numerical approximation. Cauchy's note is one of the founding documents of numerical optimisation; the only earlier ideas in the same family were Newton's mechanics-flavoured fluxion methods and Jacobi's Gauss-Seidel-style sweeps. The method was reborn in the 1950s as the foundation for conjugate gradient and quasi-Newton, and again in the 2010s as the foundation for every neural-network optimiser in use today.

The algorithm

choose x_0
for k = 0, 1, 2, …
    g_k = ∇f(x_k)
    if ‖g_k‖ < tol: stop
    α_k = argmin_α f(x_k − α g_k)        ← 1D line search
    x_{k+1} = x_k − α_k g_k

For a quadratic f(x) = ½ xᵀ A x − bᵀ x the line search has a closed form: α_k = (g_kᵀ g_k) / (g_kᵀ A g_k). For general f the line search is itself an iterative routine — bisection, golden-section, or a Wolfe-condition variant. In practice "steepest descent" without an exact line search means using a backtracking Armijo step instead, which gives most of the convergence guarantees at a fraction of the cost.

The convergence-rate proof

For the quadratic f(x) = ½ xᵀ A x with A symmetric positive-definite, define the energy norm ‖v‖_A = √(vᵀ A v). Then exact-line-search steepest descent satisfies

‖x_{k+1} − x*‖_A² ≤ ((κ − 1)/(κ + 1))² · ‖x_k − x*‖_A²

where κ = λ_max(A) / λ_min(A) is the spectral condition number. The Kantorovich inequality (1952) gives the proof in three lines from the definition. The contraction rate (κ−1)/(κ+1) is tight: there exist starting points where the bound is attained. Concretely:

κ        rate ρ      iters / digit
─        ──────      ──────────────
1        0.000       1
4        0.600       ~5
10       0.818       ~12
100      0.980       ~115
1000     0.998       ~1150
10^4     0.9998      ~11500
10^6     0.999998    ~1.15 million

Each digit of accuracy takes roughly κ/2 iterations. Compare CG on the same problem: O(√κ) iterations per digit. For κ = 1000, CG needs ~35 iterations; steepest descent needs ~1150 — a 33× gap that grows as √κ.

Why the zigzag happens

Consecutive steepest-descent steps are orthogonal: g_{k+1} ⊥ g_k. That's the first-order optimality of the exact line search — if g_{k+1} had any component along g_k, you'd not be at the minimum of f along that ray. On a circular quadratic, "perpendicular at each step" matches the geometry of the level sets and the iterates head straight inward. On an elongated ellipse, perpendicular directions don't point toward the centre. The iterates instead bounce side-to-side down the long valley, making tiny progress along the long axis each step.

The zigzag is visible in any plot of steepest descent on Rosenbrock's banana function or on a quadratic with κ = 100. Moving the minimum diagonally a single unit can take a hundred zigzag iterations.

Worked example: a banana quadratic

Take f(x, y) = 5 x² + y² with starting point (1, 1). The Hessian is diag(10, 2), condition number κ = 5, rate (κ−1)/(κ+1) = 0.667.

k    x_k                f(x_k)        ‖∇f‖
─    ────────            ──────────    ────
0    ( 1.000, 1.000)     6.000         11.18
1    (-0.286, 0.714)     0.918          5.10
2    ( 0.143, 0.510)     0.362          2.04
3    (-0.041, 0.364)     0.140          1.41
4    ( 0.020, 0.260)     0.069          0.62
5    (-0.006, 0.186)     0.035          0.44
6    ( 0.003, 0.133)     0.018          0.27
…
20   (~10⁻⁵, 0.024)      6 × 10⁻⁴      ~0.05

The error in y shrinks by factor (κ−1)/(κ+1) = 2/3 per iteration. Reaching y = 10⁻³ takes about 17 steps; reaching y = 10⁻⁶ takes 34. The x coordinate is killed almost instantly (eigenvalue 10 dominates), but the y direction with the smaller eigenvalue 2 is the bottleneck. On a κ = 1000 problem the same pattern gives 0.998 contraction — over a thousand iterations per digit. The zigzag is the symptom: iterates bouncing in x while drifting slowly in y.

Steepest descent vs everything that beats it

	Steepest Descent	Conjugate Gradient	Nesterov	Newton	L-BFGS
Convergence rate	(κ−1)/(κ+1)	(√κ−1)/(√κ+1)	1 − 1/√κ	Quadratic (local)	Superlinear
Iters / digit (κ=10³)	~1150	~35	~30	~5 (near min)	~15
Per-iter cost	1 gradient	1 gradient	1 gradient	1 Hessian solve	2 gradients + O(m N)
Memory	2 vectors	4 vectors	3 vectors	O(N²) for H	2 m vectors (m = 5-20)
Needs second order	No	No	No	Yes	Implicit (gradient diffs)
Convex non-quadratic	O(1/k)	Nonlinear CG required	O(1/k²)	Local quadratic	Superlinear
Real-world role	Pedagogy + SGD/Adam stem	Sparse SPD systems	Convex ML	Stiff systems	scipy default

The right-hand columns are why nobody runs raw steepest descent on serious problems. Every other line-search-based first-order method has at least an O(√κ) rate; steepest descent is alone at O(κ). It survives because every more sophisticated method either inherits its update rule (SGD, Adam, momentum) or uses it as a fallback when fancier directions misbehave (trust-region globalisation).

Standard fixes

Preconditioning. Replace ∇f with M⁻¹ ∇f for some symmetric positive-definite M ≈ Hessian. Geometrically you've changed coordinates so the level sets are more circular. With perfect preconditioning (M = Hessian) one step finishes.
Momentum (Polyak 1964). Add a fraction of the previous step: x_{k+1} = x_k − α g_k + β (x_k − x_{k−1}). The cross-valley component of consecutive gradients cancels; the down-valley component reinforces. Achieves rate (√κ−1)/(√κ+1) on quadratics with the right β.
Nesterov accelerated gradient (1983). A subtly different momentum that achieves the optimal O(1/k²) rate on convex non-quadratic problems. Foundation of modern deep-learning theory.
Variable step (Barzilai-Borwein 1988). Use α_k = (s_{k−1}ᵀ y_{k−1}) / (y_{k−1}ᵀ y_{k−1}) where s, y are step and gradient differences. Cheap quasi-Newton-flavoured rescaling; surprisingly fast in practice.
Conjugate gradient. A direct fix: use A-conjugate search directions instead of negative gradients. Achieves N-step exact convergence on quadratics. Far better than any version of steepest descent for sparse SPD systems.

Where steepest descent (or its descendants) actually runs

Stochastic gradient descent (SGD) for neural-network training. The mini-batch version of steepest descent, with a fixed or scheduled step size in place of an exact line search. Every neural net trained before 2014 used essentially this; the giant networks of 2023-2026 use Adam/AdamW which is SGD-with-momentum plus per-parameter scaling.
Online learning and recommender systems. When data arrives one example at a time, you cannot afford second-order methods. Single-pass SGD (a stochastic steepest descent) is the only feasible algorithm.
Early phase of trust-region methods. Far from the optimum, a Newton step can over-shoot or land in a region where the quadratic model is invalid. Many production solvers (TRON, CG-Steihaug, dogleg) use a steepest-descent step or a steepest-descent / Newton combination ("dogleg") as the safe inner step.
Inverse problems and image reconstruction. Gradient-descent-style methods (ISTA, FISTA) solve compressed-sensing problems min ‖A x − b‖² + λ ‖x‖_1. The base ISTA iteration is steepest descent on the smooth quadratic plus a proximal operator on the L1 term.
Reference baseline. Every paper introducing a new optimiser compares against steepest descent or vanilla SGD. The contraction rate (κ−1)/(κ+1) is the universal yardstick.

Common pitfalls

Fixed step size that is too large. Without a line search or backtracking, a step longer than 2/λ_max diverges on a quadratic. Always estimate the local Lipschitz constant L of ∇f and use α ≤ 1/L.
Stopping at small gradient on saddle. ∇f = 0 is necessary for a minimum, not sufficient. On non-convex problems steepest descent can stall at a saddle. Check the Hessian or add stochastic noise to escape.
Treating exact line search as free. An exact 1D minimisation per outer iteration can be more expensive than several outer iterations of an Armijo backtracking line search. Backtracking is almost always preferred in practice.
Reporting raw iteration counts on ill-conditioned problems. "It converged in 50,000 iterations" is meaningless without κ — preconditioned CG might finish the same problem in 50.
Forgetting to scale. If inputs to a neural net or features in a regression are on different scales (one in [0, 1], another in [0, 10⁶]), the Hessian has huge κ and steepest descent crawls. Normalising features before training is the cheapest preconditioning.

Why the method still matters

Steepest descent is slow, occasionally pathologically slow, and almost never the right choice on a quadratic linear system or a moderate-size convex optimisation. But every faster first-order method either descends from it or measures itself against it: momentum, Nesterov, conjugate gradient, L-BFGS, Adam, RMSProp, SGD, ISTA, FISTA, AdaGrad. Cauchy's 1847 idea — point in the direction of fastest local decrease — is the load-bearing intuition behind 150 years of optimisation.

The diagnostic value alone is worth knowing the method: if your fancy optimiser is misbehaving, run steepest descent on the same problem and check whether the rate of progress is roughly (κ−1)/(κ+1). If yes, your fancy method is doing nothing more than gradient descent — and you should be asking why. The κ that breaks steepest descent is also what tells you whether to add momentum, preconditioning, or a second-order method on top.

Frequently asked questions

Why does steepest descent zigzag?

Because an exact line search along the gradient direction stops at the point where the next gradient is perpendicular to the current direction — that's the first-order optimality condition for the 1D subproblem. So consecutive search directions are exactly orthogonal in the Euclidean sense. On a nearly-circular quadratic this is fine; on an elongated ellipse, the orthogonal directions don't point toward the centre. The iterates bounce side-to-side down the narrow valley, making microscopic progress along the long axis each step. The zigzag is the visible symptom of orthogonal-direction-tying with elongated geometry.

How bad is the convergence rate?

On a quadratic f(x) = ½ x^T A x − b^T x with condition number κ = λ_max / λ_min of A, the energy-norm error of steepest descent satisfies ‖x_k − x*‖_A ≤ ((κ−1)/(κ+1))^k ‖x_0 − x*‖_A. For κ = 1, the bound is 0 and one iteration finishes. For κ = 10 the rate is 0.818, needing about 12 iterations per digit. For κ = 1000 the rate is 0.998, needing about 1150 iterations per digit. For κ = 10⁶ — typical of unscaled neural-net Hessians — it would take a million iterations per digit. Conjugate gradient on the same problem needs O(√κ) iterations rather than O(κ).

Is gradient descent the same as steepest descent?

They share the search direction −∇f but differ in step length. Steepest descent does an exact line search to find α_k that minimises f along the ray. Plain gradient descent uses a fixed or scheduled α_k — much cheaper per iteration, no inner optimisation, and with the right choice it has the same asymptotic rate (κ−1)/(κ+1). In deep learning 'gradient descent' usually means stochastic gradient descent (SGD) with a tuned learning-rate schedule and no line search. Steepest descent is the textbook reference; SGD is what runs.

How do I fix the zigzag?

Five standard fixes. (1) Preconditioning: solve in transformed coordinates so the level sets are nearly spherical — κ drops, zigzag disappears. (2) Momentum (Polyak 1964): add β · (x_k − x_{k−1}) to each update to cut across the valley. (3) Nesterov acceleration (1983): a smarter momentum that achieves the optimal O(√κ) rate on convex problems. (4) Conjugate gradient: pick A-conjugate directions instead of plain gradients, achieving N-step exact convergence on quadratics. (5) Quasi-Newton (L-BFGS): approximate the inverse Hessian from gradient differences, multiplying the gradient before stepping. Modern ML defaults to Adam, which combines momentum with per-coordinate scaling — essentially a cheap diagonal preconditioner.

When is steepest descent actually a good choice?

When the problem is well-conditioned (κ ≈ 1) and the gradient is dramatically cheaper than any second-order information. Examples: huge stochastic settings where computing the full gradient is already painful, online learning with constantly-changing objectives, or as the first phase of a two-phase optimiser before switching to BFGS. It's also taught as the foundation method because every other first-order optimiser — momentum, Nesterov, Adam, RMSProp — is a tweak on plain gradient descent.

Who invented steepest descent?

Augustin-Louis Cauchy proposed the method in 1847 as a way to solve nonlinear systems of equations, in his paper 'Méthode générale pour la résolution des systèmes d'équations simultanées' published in Comptes rendus de l'Académie des sciences. He framed it as a continuous-time descent flow dx/dt = −∇f, with the discrete iteration as a numerical approximation. The geometric idea — that the gradient is the direction of fastest increase, so its negative is the direction of fastest decrease — is much older, going back implicitly to Newton's mechanics. Cauchy was the first to write the explicit iterative algorithm.

Does steepest descent work on non-convex functions?

It converges to a stationary point — a point where ∇f = 0 — but that point can be a saddle, a local minimum, or only a saddle of a strongly-non-convex function. Steepest descent has no mechanism to escape saddles unless stochasticity is added. In deep learning, stochastic gradient noise empirically helps escape strict saddles, and methods like perturbed gradient descent (Jin et al. 2017) make this rigorous. On the original Cauchy problem of solving F(x) = 0 with f(x) = ½‖F‖², steepest descent converges to a stationary point of f, but that need not be a zero of F if f has spurious minima.