Numerical Analysis

Newton-Raphson Method

Iterate x_{n+1} = x_n - f(x_n)/f'(x_n) and quadratically converge to a root — the method behind every calculator's square-root button

The Newton-Raphson method solves f(x) = 0 by drawing the tangent line at the current guess and using its x-intercept as the next guess. Near a simple root the error squares at every step — three or four iterations is usually enough to hit machine precision, which is why every modern CPU implements division and square-root as Newton iterations in hardware.

  • Newton's draft1669
  • Raphson's iteration1690
  • Convergence order2 (quadratic)
  • Iterations to 16 digits~5 (from 1-digit guess)
  • Per-step cost1 f-eval + 1 f'-eval

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The tangent-line idea

Suppose f is differentiable and you have an estimate x_n that's close to a true root r where f(r) = 0. Linearise f at x_n by drawing its tangent line. The tangent has slope f'(x_n) and value f(x_n) at x = x_n, so its equation is

y = f(x_n) + f'(x_n) · (x - x_n)

Setting y = 0 and solving for x gives the tangent's x-intercept. Call that the next guess:

x_{n+1} = x_n - f(x_n) / f'(x_n)

That is the entire Newton-Raphson iteration. Geometrically it says: instead of solving the curve f(x) = 0, solve the easier linear approximation f(x_n) + f'(x_n)(x − x_n) = 0, then repeat with the new estimate. Because the tangent agrees with f to first order, the new error is quadratic in the old error: the better your guess, the dramatically better the next one.

Why convergence is quadratic

Let r be a simple root, meaning f(r) = 0 and f'(r) ≠ 0. Taylor-expand f around r:

f(x_n) = f'(r) (x_n - r) + ½ f''(r) (x_n - r)² + O((x_n - r)³)
f'(x_n) = f'(r) + f''(r) (x_n - r) + O((x_n - r)²)

Plug into the iteration formula and keep terms up to second order in the error e_n = x_n − r. After algebra,

e_{n+1} = (f''(r) / 2 f'(r)) · e_n²  +  O(e_n³)

The next error is proportional to the square of the current error. Concretely, if e_n = 10⁻⁴, then e_{n+1} ≈ (constant) · 10⁻⁸ — eight digits of accuracy from four. Once you are within a few digits of the root, machine precision (~16 digits in double precision) arrives in two or three more steps. Bisection by contrast halves the error per step (linear convergence), gaining one binary digit per iteration; Newton turns that into a doubling rule.

The proportionality constant K = |f''(r) / 2f'(r)| is the asymptotic error constant. For ill-conditioned problems with large second derivative or small first derivative near the root, K can be big, slowing the early phase of convergence — but the squaring still wins eventually.

Worked example: computing √2

To find √2, solve f(x) = x² − 2 = 0. Then f'(x) = 2x and the iteration becomes

x_{n+1} = x_n - (x_n² - 2) / (2 x_n) = ½ (x_n + 2/x_n)

This is the ancient Babylonian method for square roots — Newton-Raphson rediscovers it as a special case. Start from a deliberately weak guess x₀ = 1:

n   x_n                          digits correct
─   ──────────────────────────   ──────────────
0   1.000000000000000000          0
1   1.500000000000000000          0  (correct value: 1.4142...)
2   1.416666666666666666          2
3   1.414215686274509803          5
4   1.414213562374689910         11
5   1.414213562373095049         16  (machine precision)

From a starting error of 0.41 the algorithm reaches the limit of double-precision arithmetic in five iterations. Each step roughly doubles the number of correct digits: 0 → 2 → 5 → 11 → 16. The digit count doesn't quite double each time near machine precision because rounding noise dominates the residual.

The same trick generalises: to compute √a from any positive starting guess, iterate x_{n+1} = ½(x_n + a/x_n). To compute 1/a (no division needed) iterate x_{n+1} = x_n (2 − a x_n) — a derivation: solve f(x) = 1/x − a = 0 gives x_{n+1} = x_n (2 − a x_n). This is the formula CPUs use to compute reciprocals from a hardware reciprocal-estimate.

Newton vs bisection vs secant

BisectionNewton-RaphsonSecant
Convergence order1 (linear)2 (quadratic)≈ 1.618 (golden ratio)
Per-step work1 f-eval1 f-eval + 1 f'-eval1 f-eval
Needs derivativeNoYesNo
Needs bracketYes (sign change)No, but one good guessTwo starting guesses
Guaranteed to convergeYes (continuous f)Only locally near simple rootOnly locally
Typical iterations to 10⁻¹⁵~50~5~7
Best whenYou need certainty, not speedf' is cheap, guess is goodf' is expensive or unknown

In practice, production root-finders combine the two: Brent's method uses an inverse quadratic interpolation step (similar to secant) most of the time, but falls back to bisection if the quadratic step would step outside the bracket. The result is bisection's guarantee with most of secant's speed. SciPy's scipy.optimize.brentq is the canonical implementation.

Multivariate Newton

The same idea generalises to vector functions. Given F: ℝⁿ → ℝⁿ, solve F(x) = 0 by iterating

x_{n+1} = x_n - J(x_n)⁻¹ · F(x_n)

where J(x) is the Jacobian matrix of partial derivatives ∂F_i / ∂x_j. In code you never form the inverse explicitly — you solve the linear system J(x_n) · Δx = −F(x_n) and update x_{n+1} = x_n + Δx. Each iteration costs O(n³) for dense Jacobians, dropping to roughly O(n) per iteration when J is sparse and you use a sparse LU or Krylov solver.

Multivariate Newton retains quadratic convergence near simple zeros (where J(r) is invertible). Practical industrial codes wrap it in a line search or trust region globalisation to make it converge from far-away starts: shorten Δx if it would increase ‖F‖, or restrict the step to where the linear model is trusted. Without such safeguards, plain Newton on stiff problems can diverge or oscillate.

Inside the CPU

Modern CPUs do not implement floating-point division as a long sequence of subtraction-and-shift the way schoolbook division would. They use a hardware reciprocal-estimate and one or two Newton refinement steps. Intel's RCPSS instruction returns 1/a accurate to about 11 bits in two cycles; one Newton iteration squares that to ~22 bits, and a second to full single-precision (24 bits) or close to double (53 bits) for RCPPD-style microcode. AMD, ARM, and PowerPC use almost identical schemes.

Square-root is the same idea: the hardware returns a low-precision RSQRTSS (approximate 1/√a), then refines via the iteration y_{n+1} = ½ y_n (3 − a y_n²), which converges quadratically to 1/√a. Multiplying by a gives √a. The Quake III "fast inverse square root" trick (a famous bit-cast hack from John Carmack's team in 1999) used exactly one of these Newton steps with a magic-constant initial guess to compute lighting normals about 4× faster than the hardware path of the time.

Where Newton goes wrong

Newton's method is fast but not robust. Its three classic failure modes:

  1. Zero derivative. If f'(x_n) is near zero, the tangent is nearly horizontal and the next iterate is far away — sometimes infinitely far. Solving f(x) = x³ − 2x + 2 from x₀ = 0 gives x₁ = 1, then x₂ = 0 again, then x₃ = 1: a perfect 2-cycle that never converges. Detect cycles and restart, or fall back to bisection.
  2. Multiple roots. At a root of multiplicity k > 1, both f and f' vanish, so the iteration is 0/0. With a careful limit, Newton still converges but only linearly (gaining one digit per step) instead of quadratically. Modified Newton (multiply the step by k) restores quadratic convergence if you know the multiplicity.
  3. Wrong basin. If the function has several roots, the basin of attraction for each root can be intricate — fractal in the complex case. Starting just slightly off your intended root can converge to a different root entirely.

Where Newton-Raphson shows up

  • CPU floating-point division and square-root. Intel x86 since the Pentium, ARM since ARMv7, AMD's Zen architectures, NVIDIA GPUs and Apple Silicon all use Newton iterations on hardware seed estimates. Every floating-point divide on every desktop and phone in the world runs a Newton step inside.
  • Maximum likelihood estimation in statistics. Logistic regression, Poisson regression, and most generalized linear models are fit by Newton's method on the log-likelihood (called Iteratively Reweighted Least Squares). R's glm() and Python's statsmodels default to it.
  • Implicit time integration of stiff ODEs. Backward Euler and BDF methods reduce each time step to a nonlinear system F(y_{n+1}) = 0 solved by multivariate Newton. Used in chemistry kinetics (CHEMKIN), circuit simulation (SPICE), and structural mechanics (Abaqus).
  • Computer graphics ray-surface intersection. Tracing a ray against an implicit surface or NURBS patch reduces to finding the smallest t ≥ 0 with F(t) = 0, solved by Newton's method or its bracketed variants. Pixar's RenderMan and Blender's Cycles renderer both ship Newton root-finders.
  • Internal rate of return in finance. Solving Σ CF_t / (1 + r)^t = 0 for r is a polynomial root problem run by Newton inside Excel's IRR() function and Bloomberg's YIELD().

Variants and extensions

  • Damped Newton. Replace the full step with α · Δx for some 0 < α ≤ 1 chosen by line search to ensure ‖F‖ decreases. Required for global convergence on hard nonlinear systems.
  • Quasi-Newton (BFGS, Broyden). Approximate the Jacobian (or Hessian, for optimisation) by a low-rank update from previous steps. Avoids the O(n²) cost of recomputing J each iteration. Standard in machine learning and large-scale nonlinear least squares.
  • Modified Newton for multiple roots. Replace x_{n+1} = x_n − f/f' with x_{n+1} = x_n − k · f/f' where k is the suspected multiplicity. Restores quadratic convergence at multiple roots.
  • Halley's method. Uses f, f' and f'' to achieve cubic convergence: digits triple per step. Twice the work per step but useful when high-order derivatives are cheap.
  • Newton-Krylov. Solves the linear system J Δx = −F via an iterative Krylov method (GMRES, BiCGStab) rather than direct factorisation. Critical for very large or matrix-free problems where storing J is impossible.

Common pitfalls

  • No convergence test. A real implementation must check that ‖F(x_n)‖ has actually shrunk and not just oscillated. Naïvely stopping at ‖Δx‖ < tol can be fooled at saddle points and near singular Jacobians.
  • Numerical derivative blowups. Approximating f' by (f(x + h) − f(x))/h with too-small h causes catastrophic cancellation and nonsense slopes. h ≈ √(machine epsilon) × |x| is a safer default; use complex-step or automatic differentiation if you need full precision.
  • Forgetting the bracketing fallback. Pure Newton can diverge on hard problems. Always wrap it with bisection or trust-region globalisation if robustness matters more than peak speed.
  • Counting iterations instead of work. Each Newton iteration costs an f-evaluation plus an f'-evaluation (or a Jacobian factorisation). For expensive f, the secant method or a quasi-Newton update can be cheaper overall even though it needs more iterations.
  • Trusting the result without checking. Newton can converge to a different root than intended, or to a saddle point in optimisation contexts. Always verify f(x_∞) is small and that you have the root you wanted.

Why Newton's method endures

Newton-Raphson is more than three centuries old and still the first thing every numerical analyst reaches for. The reason is the squaring of error per step: a 60-iteration bisection becomes a 5-iteration Newton run, and that ratio holds across most of practical scientific computing. Combine it with a robust globalisation (line search or trust region) and a good starting guess, and you have an algorithm that's used inside every CPU's divider, every nonlinear least-squares fitter, every implicit ODE solver, and every solver of the dozens of nonlinear systems hidden in a modern engineering simulation.

Calling something "the" method of numerical analysis is rarely fair, but Newton-Raphson comes closest. Bisection is more robust, the secant method is cheaper per step, but for raw rate of convergence on smooth problems with a derivative, nothing beats the squared-error pattern Newton wrote down in 1669.

Frequently asked questions

What does quadratic convergence actually mean?

Near a simple root r where f'(r) ≠ 0, the error of Newton's method satisfies |x_{n+1} - r| ≤ K · |x_n - r|² for some constant K. So if you currently have d digits of accuracy, the next iterate has roughly 2d digits — the digits double per step. Five iterations starting at a one-digit guess can produce thirty-two-digit accuracy. This is dramatically faster than the linear convergence of bisection, which gains only one bit per step.

When does Newton-Raphson fail?

Three main failure modes. First, if f'(x_n) is zero or near-zero, the tangent is horizontal or nearly so and the next iterate shoots to infinity — the algorithm divides by zero. Second, at multiple roots where f'(r) = 0, convergence drops from quadratic to linear. Third, the iteration can converge to a different root than expected, or oscillate between two values, depending on the starting point — the basin of attraction can be fractal in the complex plane.

How is Newton-Raphson used inside computer hardware?

Most modern CPUs implement floating-point division and square-root as Newton iterations on a hardware-coded initial reciprocal estimate. Intel's pentium-era SRT division algorithm, AMD's reciprocal-estimate instruction RCPSS, and ARM VRECPE all return a low-precision starting guess, and one or two Newton-style refinement steps double the precision. A double-precision divide that completes in 14 cycles is doing about three Newton refinements internally.

How do I generalise Newton's method to several variables?

For a vector function F: ℝⁿ → ℝⁿ the iteration becomes x_{n+1} = x_n - J(x_n)⁻¹ F(x_n) where J is the Jacobian matrix of F. In practice you do not invert J — you solve the linear system J · Δx = -F(x_n) and update x_{n+1} = x_n + Δx. Each step requires computing the n × n Jacobian and solving an n × n linear system, so cost per iteration is O(n³) for dense problems.

Newton vs bisection vs secant — which should I use?

Bisection is the safest: guaranteed convergence on any continuous f with a sign change, but only one bit per step. Newton is the fastest if it converges: quadratic, doubling digits per step, but it requires f'(x) and a good starting guess. The secant method approximates f' by a finite difference between the last two iterates — it converges with order ≈1.618 (the golden ratio), faster than bisection and not needing f'. Modern root-finders like Brent's method combine bisection's safety with secant's speed.

Why is the method named after both Newton and Raphson?

Newton described a similar iteration in his 1669 De analysi for solving polynomial equations, but his version was an algebraic substitution rather than the modern derivative-based formula. Joseph Raphson published a cleaner, iterative version in 1690, recognisably modern. Thomas Simpson generalised it to non-polynomial functions in 1740. The name 'Newton-Raphson' acknowledges both early authors; some texts use just 'Newton's method' since Newton predates Raphson.

Can Newton-Raphson find complex roots?

Yes — the iteration formula extends verbatim to complex inputs. Picking a complex starting point can converge to one of several roots; which one depends sensitively on the start. Plotting the basin of attraction over the complex plane produces the famous Newton fractal, where the boundary between basins has Hausdorff dimension greater than one. This is the source of John Hubbard's famous problem and Plotnick's 1980s computer art.