Numerical Optimization

Newton's Method (Optimization)

Use 2nd-order information x_{k+1} = x_k − H⁻¹∇f for quadratic local convergence to a minimum

Newton's method fits a local quadratic to f using the gradient and Hessian, then jumps to that quadratic's minimum. Near a local minimum the error squares each step — five digits become ten, ten become twenty — so production codes hit machine precision in five or six iterations.

Update rulex_{k+1} = x_k − H(x_k)⁻¹ ∇f(x_k)
Local convergenceQuadratic (errors square)
Per-step cost∇f + H + 1 linear solve
MemoryO(n²) for H (dense)
Iters to 50 digits~6 (from 1-digit guess)
Backbone ofInterior point, IRLS, SQP, trust region

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The quadratic-model idea

At any point x_k where f is twice differentiable, Taylor expand to second order:

f(x_k + p) ≈ f(x_k) + ∇f(x_k)ᵀ p + ½ pᵀ H(x_k) p

where H = ∇²f is the n × n Hessian matrix of second partial derivatives. This is the local quadratic model of f. If H is positive-definite the model has a unique minimum, found by setting its gradient to zero:

∇_p q(p) = ∇f(x_k) + H(x_k) p = 0   ⇒   p = − H(x_k)⁻¹ ∇f(x_k)

The next iterate is x_{k+1} = x_k + p. That is the Newton step for optimisation. Geometrically: at every iterate, draw a paraboloid that matches f's gradient and Hessian; jump to the bottom of the paraboloid; repeat. On a true quadratic, one step finishes. On general smooth f, the iteration's error squares each step near a minimum.

Newton-Raphson on the gradient

Newton's method for optimisation is exactly Newton-Raphson for root finding applied to the system ∇f(x) = 0. The Jacobian of ∇f is the Hessian of f, so Newton-Raphson's update x_{k+1} = x_k − J⁻¹ F becomes x_{k+1} = x_k − H⁻¹ ∇f. Everything we know about Newton-Raphson's convergence — quadratic near simple zeros, the basin-of-attraction question, the need for damping when far from the answer — transfers verbatim. The optimisation-specific twist is that we want a minimum, not just a stationary point, so we additionally require H to be positive-definite at the limit. Plain Newton on a saddle's gradient converges to the saddle.

The algorithm

choose x_0
for k = 0, 1, 2, …
    g_k = ∇f(x_k)
    if ‖g_k‖ < tol: stop
    H_k = ∇²f(x_k)
    solve   H_k · p_k = − g_k          ← Hessian-solve
    x_{k+1} = x_k + α_k p_k             ← with line search or trust region

The Hessian-solve is the dominant cost. For dense H of dimension n, factor via Cholesky in O(n³) flops with O(n²) memory. For sparse H, use sparse Cholesky or run CG on H matrix-free. The line-search step length α_k is typically chosen by backtracking-with-Armijo to ensure f decreases; without it, the full Newton step (α = 1) can overshoot and diverge far from the optimum. Near the minimum α = 1 satisfies Armijo and full quadratic convergence kicks in.

Why convergence is quadratic

Assume f is twice differentiable with Lipschitz Hessian (constant L) and H(x*) is positive-definite at the minimum x*. Then for x_k close enough to x* the Newton step satisfies

‖x_{k+1} − x*‖ ≤ (L / 2 λ_min(H(x*))) · ‖x_k − x*‖²

The error squares each step. Starting from ‖x_0 − x*‖ = 0.1, six iterations reach 0.1 → 0.01 → 10⁻⁴ → 10⁻⁸ → 10⁻¹⁶ → 10⁻³² → 10⁻⁶⁴, blowing past double-precision machine epsilon at iteration 4. Real implementations stop at iteration 5 or 6 with relative residual at machine precision. Compare steepest descent on the same problem with κ = 100: same precision takes 1000+ iterations. The ratio is 200×, and it grows as κ grows.

Worked example: minimising the Rosenbrock banana

Rosenbrock's function f(x, y) = 100 (y − x²)² + (1 − x)² is the standard test case: a narrow curved valley, minimum at (1, 1). Plain steepest descent needs thousands of iterations; full Newton with a line search finishes in a handful. Start at (-1.2, 1.0).

k    x_k                   f(x_k)        ‖∇f‖          ‖x_k − x*‖
─    ────────────           ──────────    ──────────    ───────────
0    (-1.200,  1.000)       24.20         232.9          2.200
1    (-1.179, -1.389)       9.92          146.0          3.077 (line search backtracks)
2    (-1.020,  1.020)       4.08            7.6          2.020
3    ( 0.764,  0.582)       0.064           0.74         0.481
4    ( 0.995,  0.989)       3 × 10⁻⁴        0.030        0.011
5    ( 0.99999, 0.99998)    2 × 10⁻⁸        0.0003       1.4 × 10⁻⁴
6    ( 1.0000000, 1.0000000) 8 × 10⁻²⁰     2 × 10⁻¹⁰     ~10⁻⁹

Six steps from the standard starting point land essentially on the minimum. Steepest descent on the same problem needs 5000+ iterations to reach the same accuracy; conjugate gradient ~50; L-BFGS ~30. Full Newton is overkill for n = 2, but it shows the squaring of error: ‖x_3 − x*‖ = 0.481, ‖x_4 − x*‖ = 0.011 (roughly 0.481² · constant), ‖x_5 − x*‖ = 1.4 × 10⁻⁴ (roughly 0.011²). Each step is squaring.

Newton vs first-order vs quasi-Newton

	Steepest Descent	Conjugate Gradient	L-BFGS	Newton
Convergence (local)	Linear, rate (κ−1)/(κ+1)	Linear, rate (√κ−1)/(√κ+1)	Superlinear	Quadratic
Per-iter cost	1 ∇f	1 ∇f	2 ∇f + O(m n)	1 ∇f + 1 H + O(n³)
Memory	O(n)	O(n)	O(m n), m ≈ 10	O(n²)
Needs Hessian	No	No	No (gradient diffs)	Yes
Convex non-quadratic rate	O(1/k)	Nonlinear CG	Superlinear	Quadratic
Globalisation	Easy (line search)	Easy (line search)	Easy (line search)	Trust region / damping
Real-world use	SGD / Adam stems	Sparse SPD systems	Default for large smooth f	IRLS, interior point, SQP

The cost gap matters. Newton's quadratic rate is unbeatable per iteration, but each iteration is O(n³) and O(n²) memory. For n = 10⁴ that's 10¹² flops and 10⁸ memory entries — feasible. For n = 10⁹ (neural nets, large statistical models) it's impossible, and L-BFGS or Hessian-free Newton-CG is the answer.

Trust regions and damping

The pure Newton step assumes the quadratic model is accurate. Far from the minimum it isn't, and full Newton can overshoot, oscillate, or diverge. Two standard globalisation strategies:

Line search. Take a fraction α ∈ (0, 1] of the Newton direction: x_{k+1} = x_k + α p_k. Choose α to satisfy the Armijo condition f(x_k + α p_k) ≤ f(x_k) + c · α · ∇f(x_k)ᵀ p_k. Backtracks until found. Near the minimum α = 1 satisfies the condition and full quadratic rate kicks in.
Trust region. Restrict the step to a ball ‖p‖ ≤ Δ_k where Δ_k is the trust-region radius. Solve the constrained subproblem min q(p) s.t. ‖p‖ ≤ Δ_k by dogleg, CG-Steihaug, or exact methods. Update Δ_k based on the actual-vs-predicted reduction ratio. Robust on indefinite Hessians since the constraint keeps the subproblem well-posed.

Modern industrial codes (IPOPT for interior-point, KNITRO, fmincon's interior-point, scipy.optimize trust-ncg) almost universally use trust regions. They handle indefinite Hessians, ill-conditioned problems, and starting points far from the solution. Line search is simpler and works fine on convex problems with mild conditioning.

Where Newton's method actually runs

Generalised linear models in statistics. R's glm(), Python's statsmodels, Stata's glm, SAS PROC GENMOD all use Iteratively Reweighted Least Squares — a Newton method on the negative log-likelihood. Logistic regression, Poisson regression, Gamma regression are all fitted this way. Converges in 6–10 iterations to machine precision on well-conditioned problems.
Interior-point methods for LP and QP. Karmarkar's 1984 algorithm and its modern primal-dual descendants (Mehrotra 1992) apply Newton's method to a sequence of barrier-augmented problems. Inside CPLEX, Gurobi, MOSEK, HiGHS and OSQP, the inner loop is a Newton solve on the KKT system. Achieves polynomial complexity O(n^3.5 L) for LP — the result that broke the simplex method's monopoly in 1984.
SQP for nonlinear programming. Sequential Quadratic Programming approximates a nonlinear constrained problem by a sequence of QP subproblems, each solved by Newton on the KKT conditions. SNOPT, NPSOL, MATLAB's fmincon-sqp use it. Standard for aircraft trajectory optimisation and robotics inverse kinematics.
PDE-constrained optimisation. Optimal control, shape optimisation, full-waveform inversion. The Hessian comes from solving the adjoint PDE; Newton with a matrix-free CG inner solve (Newton-Krylov) is the standard. Used in seismic imaging at oil companies and in aerodynamic shape optimisation at Boeing and Airbus.
Hessian-free deep learning. Martens (2010) demonstrated Newton on neural-net training via Hessian-vector products computed by R-operators. Modern second-order optimisers (Shampoo, K-FAC) follow the same template — Newton-like updates with structured Hessian approximations.

Variants

Damped Newton. Replace full step with α p_k for α < 1 selected by line search.
Modified Newton. Replace H with H + λ I (Levenberg-Marquardt-style) whenever H is not positive-definite. Bridges Newton (λ = 0) and steepest descent (λ → ∞).
Gauss-Newton. For nonlinear least squares f(x) = ½ ‖r(x)‖², approximate H ≈ JᵀJ where J = ∂r/∂x. Drops the second-derivative term, much cheaper. Standard for nonlinear regression and computer-vision bundle adjustment.
Levenberg-Marquardt. Gauss-Newton with damping: solve (JᵀJ + λI) p = -Jᵀr. Robust default for nonlinear least squares (Marquardt 1963). scipy.optimize.least_squares uses it.
Cubic regularisation (Nesterov-Polyak 2006). Add ⅙ M ‖p‖³ to the quadratic model. Guarantees descent even with indefinite Hessian and gives state-of-the-art global convergence rates.
Newton-CG / Newton-Krylov. Don't form H; solve H p = −g with CG using Hessian-vector products. The standard for very large problems.

Common pitfalls

Skipping globalisation. Plain Newton from a far-away start can diverge. Always wrap with a line search or trust region.
Indefinite Hessian. If H has a negative eigenvalue (non-convex f), the Newton step can move uphill in that direction. Use modified Newton, trust region, or cubic regularisation.
Forming H when n is too large. O(n²) memory and O(n³) factorisation kill any method for n ≥ 10⁵ unless H is sparse. Use Hessian-free / Newton-CG instead.
Wrong-precision Hessian. Finite-difference Hessians need step size h ≈ ε^(1/3) for cubic convergence; too small h causes catastrophic cancellation. Use automatic differentiation when available.
Confusing the iteration count with cost. Six Newton iterations can be more expensive than 600 L-BFGS iterations on a large problem. Iteration counts are not directly comparable across methods.

Why Newton remains the gold standard

For small-to-medium smooth optimisation, no other method comes close to Newton's local convergence rate. Five or six iterations to machine precision is unbeatable, and it falls out of nothing more than fitting a quadratic and jumping to its minimum. The cost is one Hessian and one linear solve per iteration — affordable for n ≤ 10⁴ — and that affordability is why Newton is the inner engine of essentially every interior-point method, every SQP, every IRLS-based GLM fitter, every trust-region nonlinear solver, and every second-order ML optimiser.

The places Newton doesn't run are precisely the places where forming the Hessian is too expensive — deep learning with N = 10⁹ parameters, online streaming problems, real-time control. There quasi-Newton (BFGS, L-BFGS) takes over, paying a memory and per-iteration cost reduction for superlinear-instead-of-quadratic convergence. Both methods descend from the same 1669 Newton-Raphson idea applied here to the gradient. Three and a half centuries later, the curvature-fitting step Newton wrote down is still the fastest local optimisation move we have.

Frequently asked questions

How does Newton's method for optimisation differ from Newton-Raphson for root finding?

Newton-Raphson solves F(x) = 0 by linearising F. Newton-for-optimisation minimises f(x) by setting ∇f(x) = 0 — i.e. it applies Newton-Raphson to the gradient. So the optimisation update is x_{k+1} = x_k − H(x_k)⁻¹ ∇f(x_k) where H = ∇²f is the Hessian, playing the role of the Jacobian of ∇f. The mathematics is identical; the framing differs because optimisation needs a positive-definite Hessian to ensure the stationary point is a minimum rather than a saddle. In practice every modern optimisation Newton method adds a check that the Hessian is positive-definite and modifies it (e.g. add λ I) if not.

What is quadratic convergence in concrete numbers?

Near a local minimum where H is positive-definite and Lipschitz continuous, Newton's method satisfies ‖x_{k+1} − x*‖ ≤ K ‖x_k − x*‖² for some constant K. Concretely: if ‖x_k − x*‖ = 10⁻³ then ‖x_{k+1} − x*‖ is roughly K · 10⁻⁶, three digits become six. From a starting error of 0.1, six Newton iterations typically reach machine precision: 0.1 → 0.01 → 10⁻⁴ → 10⁻⁸ → 10⁻¹⁶. Even on a problem starting at 1.0, six iterations reach 50 digits of accuracy in exact arithmetic. Compare steepest descent on the same problem: 1000+ iterations for the same precision on ill-conditioned cases.

Why is the Hessian expensive?

Three costs. (1) Computation: the Hessian has n² entries; each entry is a second derivative. For a neural net with N = 10⁹ parameters, even storing H is impossible (8 exabytes for doubles). (2) Storage: O(n²) memory. (3) Factorisation: solving H Δ = −∇f by direct Cholesky is O(n³) flops per iteration. The standard fix is matrix-free Newton: don't form H, supply a routine that computes H · v cheaply (via automatic differentiation or finite differences), and solve H Δ = −∇f with CG. This is the basis of Hessian-free deep learning (Martens 2010) and CG-Steihaug trust regions.

What happens at saddle points and indefinite Hessians?

Plain Newton goes to a stationary point — ∇f = 0 — but cannot distinguish minima from saddles or maxima. At a saddle, H has at least one negative eigenvalue and the Newton step can move uphill in the negative-curvature direction. Three standard fixes. (1) Modified Newton: replace H with H + λI where λ is large enough to make H positive-definite. (2) Trust region: restrict the step to a ball where the quadratic model is trusted, and pick the step that minimises the model in that ball (subproblem solved with CG-Steihaug or dogleg). (3) Cubic regularisation (Nesterov-Polyak 2006): add ⅙ M ‖x − x_k‖³ to the model, which makes the subproblem well-posed even with indefinite H.

When should I use Newton's method vs L-BFGS?

Use full Newton when n is small (≤ 10⁴), the Hessian is sparse or has special structure, and quadratic convergence justifies the cost — typical in PDE-constrained optimisation, structural mechanics, and SQP for nonlinear programming. Use L-BFGS when n is large (10⁶ and beyond), gradients are cheap, the Hessian is dense and unstructured, and you can tolerate superlinear (rather than quadratic) convergence. Most machine-learning workloads, image reconstruction, and large MAP-estimation problems use L-BFGS. scipy.optimize.minimize default for unconstrained problems is L-BFGS-B.

What is Iteratively Reweighted Least Squares?

IRLS is Newton's method applied to maximum likelihood estimation of generalised linear models (logistic regression, Poisson regression, etc.). The Hessian of the negative log-likelihood factors as XᵀWX where X is the design matrix and W is a diagonal weight matrix that changes each iteration. The Newton step becomes a weighted-least-squares solve: β_{k+1} = (XᵀW_k X)⁻¹ Xᵀ W_k z_k where z is a working response. R's glm() and Python's statsmodels GLM fitter run this loop. It typically converges in 6–10 iterations to machine precision on well-posed problems — quadratic-convergence Newton inside a statistics package.

Why is Newton's method called second-order?

It uses both first-order information (∇f, the gradient) and second-order information (H = ∇²f, the Hessian). First-order methods (steepest descent, SGD, Adam) use only ∇f. The second-order step is informed by curvature: it knows whether the function is sharply curved upward (large eigenvalues of H, take a short step) or nearly flat (small eigenvalues, take a long step). This curvature-awareness is what gives the O(√κ)→1 conditioning behaviour and the quadratic local rate. Methods that only approximate H (BFGS, L-BFGS, Gauss-Newton, Hessian-free) are called quasi-Newton — they sit between first- and second-order in cost and convergence.