Numerical Optimization
Quasi-Newton (BFGS)
Approximate the Hessian from gradient differences for superlinear convergence without forming H
BFGS builds an approximate Hessian on the fly from successive gradient differences via a rank-2 update. L-BFGS stores only the last 10 vector pairs and runs a two-loop recursion to apply the implicit inverse — Newton-style scaling at gradient-only cost.
- DiscoveredBroyden, Fletcher, Goldfarb, Shanno (1970)
- Update typeRank-2, symmetric positive-definite
- Secant equationB_{k+1} s_k = y_k
- ConvergenceSuperlinear (Powell 1976)
- L-BFGS memorym ≈ 10 vector pairs, O(m·n)
- Default inscipy.optimize, Stan, TFP, Theano
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The problem quasi-Newton solves
Newton's method for minimising a smooth function f: ℝⁿ → ℝ takes the step p_k = − H(x_k)⁻¹ ∇f(x_k) where H = ∇²f is the Hessian. The Hessian is what makes Newton fast — quadratic local convergence — but also what makes it expensive: O(n²) memory and O(n³) factorisation per iteration. For n ≥ 10⁵ this is infeasible.
Quasi-Newton methods replace the true Hessian with an approximation B_k built up from information that's already available: function values and gradients at previous iterates. The approximation is cheap to maintain, cheap to apply, and good enough that the method still converges superlinearly — slower than Newton's quadratic rate but dramatically faster than steepest descent's linear rate. BFGS is the canonical quasi-Newton update; L-BFGS is its limited-memory variant. Between them they are the default smooth-optimisation method in essentially every modern scientific stack.
Four papers, one year, one method
The BFGS update was discovered four times independently in 1970:
- Charles Broyden, "The Convergence of a Class of Double-rank Minimization Algorithms," IMA Journal of Applied Mathematics, 1970.
- Roger Fletcher, "A New Approach to Variable Metric Algorithms," The Computer Journal, 1970.
- Donald Goldfarb, "A Family of Variable-metric Methods Derived by Variational Means," Mathematics of Computation, 1970.
- David Shanno, "Conditioning of Quasi-Newton Methods for Function Minimization," Mathematics of Computation, 1970.
All four authors derived essentially the same rank-2 symmetric positive-definite update from slightly different starting principles. The combined initials gave the name BFGS. The older Davidon-Fletcher-Powell (DFP) method (1959, 1963) was the first practical quasi-Newton, and BFGS is its dual — derived by swapping the roles of step and gradient-difference in the secant equation. Empirically BFGS is more numerically robust than DFP and won the popularity contest. By the late 1970s BFGS was the standard.
The core idea: secant equation
Suppose we have iterate x_k with gradient g_k = ∇f(x_k), take some step to x_{k+1}, and observe the new gradient g_{k+1}. Define
s_k = x_{k+1} − x_k (step taken)
y_k = g_{k+1} − g_k (gradient difference)
For a quadratic f, the Hessian satisfies H · s_k = y_k exactly. For general smooth f, the relation holds to first order in s_k. The secant equation demands the new Hessian approximation match this finite-difference observation:
B_{k+1} · s_k = y_k
This is n equations in n² unknowns — wildly underdetermined. BFGS picks the unique rank-2 update that satisfies the secant equation, preserves symmetry and positive-definiteness, and minimises a particular weighted distance from B_k. The explicit formula for the inverse-Hessian H_k = B_k⁻¹ (which is what we actually want, since the step is p = −H g) is
H_{k+1} = (I − ρ_k s_k y_kᵀ) H_k (I − ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ
where ρ_k = 1 / (y_kᵀ s_k)
Symmetry of H_k is preserved structurally. Positive-definiteness requires y_kᵀ s_k > 0, which the Wolfe line search enforces automatically. The whole iteration is then: pick p = −H_k g_k, line-search for α_k, set x_{k+1} = x_k + α_k p, update H_k via the formula above. No second derivatives, no matrix factorisations, just vector operations and one rank-2 outer-product matrix update per step.
Full BFGS algorithm
x_0 = starting point; H_0 = I (or a problem-specific diagonal)
for k = 0, 1, 2, …
g_k = ∇f(x_k)
if ‖g_k‖ < tol: stop
p_k = − H_k · g_k ← search direction
line search for α_k satisfying Wolfe conditions
x_{k+1} = x_k + α_k p_k
s_k = x_{k+1} − x_k; y_k = ∇f(x_{k+1}) − g_k
ρ_k = 1 / (y_kᵀ s_k)
H_{k+1} = (I − ρ_k s_k y_kᵀ) H_k (I − ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ
Per-iteration cost is dominated by the O(n²) matrix-vector products in the update. Memory is O(n²) for H_k. For n = 10³ this is fine; for n = 10⁶ it is impossible. That brings us to L-BFGS.
L-BFGS: limited-memory variant
L-BFGS (Liu and Nocedal 1989) never forms H_k. Instead it stores the most recent m pairs (s_i, y_i), i = k−m, …, k−1, and computes H_k · g_k implicitly by a two-loop recursion that does m vector inner products forward, then m vector outer products backward — total O(m n) work. The implicit H_k corresponds to "start from a simple H_0 (e.g. γ I) and apply m BFGS updates with the stored pairs."
L-BFGS two-loop recursion (computes r = H_k g)
q = g
for i = k−1 downto k−m:
α_i = ρ_i s_iᵀ q
q = q − α_i y_i
r = γ q ← γ = (s_{k−1}ᵀ y_{k−1}) / (y_{k−1}ᵀ y_{k−1})
for i = k−m to k−1:
β = ρ_i y_iᵀ r
r = r + (α_i − β) s_i
return r ← r = H_k · g
Memory is O(m n). For typical m = 10 and n = 10⁶, that's 20 length-10⁶ vectors — about 160 MB at double precision. Per-iteration work is 2 m n flops plus the gradient evaluation. The trade-off is that "limited memory" means BFGS only sees the last m search directions; earlier curvature information is forgotten, slowing convergence on problems where the Hessian varies a lot. For typical smooth ML problems with m = 5-20, L-BFGS is empirically nearly indistinguishable from full BFGS on convergence speed, with O(n) memory and O(n) per-iteration work.
Worked example: 2D BFGS
Minimise f(x, y) = (x − 1)² + 10 (y − 2)² (a stretched bowl with min at (1, 2)). Start at (0, 0). Initial H_0 = I (no problem-specific scaling). Wolfe-condition line search.
k x_k f(x_k) p_k = -H_k ∇f ‖x_k − x*‖
─ ────────── ────── ───────────── ──────────
0 ( 0.00, 0.00) 41.00 (2.00, 40.00) 2.236
1 ( 0.10, 1.95) 23.31 (1.80, 0.10) 1.305
2 ( 1.00, 2.00) 0.00 (0.00, 0.00) 0.000 ← exact in 2 steps
On a quadratic, BFGS in n dimensions converges in at most n + 1 steps in exact arithmetic — here n = 2, so 2 steps suffice. After the first step, H_1 has been updated by the rank-2 formula and effectively encodes the diagonal scaling (1, 1/10) that the true Hessian would imply. The second step is along the true Newton direction and lands exactly on the minimum. On non-quadratic f, the same superlinear behaviour kicks in once iterates are close enough to the minimum.
BFGS vs other smooth optimisers
| Steepest Descent | Conjugate Gradient | L-BFGS | Full BFGS | Newton | |
|---|---|---|---|---|---|
| Convergence (local) | Linear, rate (κ−1)/(κ+1) | Linear, rate (√κ−1)/(√κ+1) | Superlinear (m large) | Superlinear | Quadratic |
| Per-iter work | 1 ∇f | 1 ∇f | 2 ∇f + O(m n) | 1 ∇f + O(n²) | 1 ∇f + 1 H + O(n³) |
| Memory | O(n) | O(n) | O(m n), m ≈ 10 | O(n²) | O(n²) |
| Needs Hessian | No | No | No | No | Yes |
| Convex non-quadratic | O(1/k) | Restarts hurt | Robust, default choice | Robust, costly memory | Quadratic if H smooth |
| Stochastic OK? | SGD works fine | Sensitive | Variants exist (SQN) | Sensitive | No |
| Typical use | DL via SGD/Adam | Sparse SPD systems | scipy default; MAP fits | Small to medium smooth | SQP, IRLS, interior point |
The L-BFGS column is the punchline. Superlinear convergence at O(n) memory and O(n) per-iteration cost, with gradient-only access — there is no other deterministic smooth-optimisation method that matches this combination. CG is comparable per iteration but only linear convergence on non-SPD problems; Newton is faster locally but needs the Hessian. Adam and friends are stochastic and slower asymptotically. For batch smooth optimisation of moderate-to-large n, L-BFGS wins on almost every benchmark.
Where BFGS / L-BFGS actually runs
- scipy.optimize.minimize default. The
method='L-BFGS-B'(bound-constrained L-BFGS) is the default for any minimize() call without explicit method. Used in tens of thousands of scientific Python pipelines for parameter fitting, MAP inference, and miscellaneous nonlinear regression. - Stan, PyMC, TensorFlow Probability. All three Bayesian-inference platforms use L-BFGS as the inner optimiser for maximum a-posteriori (MAP) inference and for finding initial values for Hamiltonian Monte Carlo. Posterior mode finding before MCMC is L-BFGS.
- Theano / PyTorch / TensorFlow second-order ops. When users want a true line-search-style optimiser (not SGD), PyTorch's
torch.optim.LBFGSand TensorFlow'stf.contrib.opt.ScipyOptimizerInterfaceboth wrap L-BFGS. Used in style-transfer (Gatys 2015), DeepDream, neural-net pretraining. - Logistic regression and conditional random fields. CRFs (Lafferty 2001) are trained by maximising regularised log-likelihood; the standard inner optimiser is L-BFGS, with billions of features in production NLP systems like Stanford NER and OpenNLP.
- Variational inference. Maximising the ELBO over variational parameters is a smooth nonconvex problem; libraries like Edward and Pyro use L-BFGS or Adam, with L-BFGS preferred when batch sizes are large.
- Atmospheric and oceanographic data assimilation. 4D-Var data assimilation (ECMWF, the Met Office, JMA) cycles L-BFGS on 10⁸–10⁹ control variables every six hours to fit numerical-weather-prediction models to observations.
Variants and extensions
- L-BFGS-B (Byrd-Lu-Nocedal-Zhu 1995). Adds simple box constraints l_i ≤ x_i ≤ u_i to L-BFGS via active-set projection. The scipy default. Handles 90% of real-world bound-constrained problems.
- OWL-QN (Andrew & Gao 2007). Orthant-Wise Limited-memory Quasi-Newton: extends L-BFGS to L1-regularised problems (Lasso, sparse logistic regression). The standard for large sparse classification.
- SR1 (Symmetric Rank 1). Older rank-1 quasi-Newton update. Doesn't preserve positive-definiteness, so usually used inside trust-region methods that can tolerate indefinite Hessians.
- DFP (Davidon-Fletcher-Powell, 1959-1963). The original quasi-Newton update; dual to BFGS. Theoretically equivalent in many ways but empirically less robust.
- SQN (Stochastic Quasi-Newton, Byrd 2016). Subsampled gradient-difference quasi-Newton for stochastic problems. Active research direction for deep learning.
- Anderson Acceleration. A close cousin: extrapolates fixed-point iterates using past residuals via a quasi-Newton-like rank-m update. Standard for accelerating DFT self-consistent field, MCMC, and EM iterations.
Common pitfalls
- Skipping the Wolfe line search. BFGS positive-definiteness depends on y_kᵀ s_k > 0, which is guaranteed by the strong Wolfe condition. Backtracking-only line searches can violate it and corrupt H_k. Use a Wolfe-compliant line search routine, e.g. Moré-Thuente.
- Forgetting to scale H_0. The default H_0 = I gives terrible early steps when the gradient is in unusual units. Use γ = (s₀ᵀ y₀) / (y₀ᵀ y₀) as an initial scaling and restart fresh between phases.
- Using L-BFGS on stochastic gradients. Gradient-difference noise corrupts the secant equation. Use SQN or large-batch L-BFGS; for purely stochastic settings stick with Adam/SGD.
- Treating m too large as free. Beyond m = 20-30, L-BFGS gains nothing and adds memory cost. Empirically m = 5-10 is the sweet spot on most problems.
- Using L-BFGS on non-smooth f. BFGS assumes Lipschitz gradient. On L1-regularised problems with discontinuities, use OWL-QN or proximal gradient methods.
Why L-BFGS is the workhorse
L-BFGS occupies a unique sweet spot in the space of optimisers. It needs only gradients, scales to n = 10⁹, converges superlinearly, uses O(n) memory, requires no problem-specific tuning, and works on essentially any smooth f. Newton is faster locally but unaffordable past n = 10⁴. CG is comparable per step but requires SPD problems and degrades on non-quadratic. Steepest descent and its descendants (SGD, Adam) are cheaper per step but linear convergence. L-BFGS is the most-used scientific optimiser in the world by sheer breadth of application.
Three numerical-analysts in 1970 spotted a rank-2 symmetric positive-definite update from the secant equation, two more added the two-loop recursion in 1989, and a fourth gave it box-constraint support in 1995. Fifty-five years after the original four BFGS papers, every Bayesian-inference platform, every scientific-Python pipeline, every operational weather-forecasting centre runs L-BFGS as its default smooth optimiser. The combination of gradient-only access, O(n) cost, and superlinear convergence has no equal.
Frequently asked questions
What is the secant equation BFGS enforces?
Let s_k = x_{k+1} - x_k and y_k = ∇f(x_{k+1}) - ∇f(x_k). The secant equation requires the new Hessian approximation B_{k+1} to satisfy B_{k+1} s_k = y_k. In one dimension this is just B = y/s — the finite-difference approximation of f''. In higher dimensions n equations in n² unknowns gives a hugely underdetermined problem; BFGS picks the unique rank-2 symmetric positive-definite update with smallest weighted distance to B_k. The same constraint appears in DFP (an older method) and Broyden's update for non-symmetric problems.
Why is BFGS superlinear?
Powell (1976) proved that under standard smoothness assumptions and Wolfe line search, BFGS iterates converge superlinearly: lim_{k→∞} ‖x_{k+1} − x*‖ / ‖x_k − x*‖ = 0. The intuition is that the approximate Hessian B_k → ∇²f(x*) along the directions of recent steps, so eventually B_k acts like the true Hessian on the steps BFGS actually takes — and we know full Newton has quadratic convergence. BFGS is slightly slower than full Newton because B_k matches the true Hessian only along the step directions, not globally. In practice this means BFGS reaches 12-digit accuracy in 15-25 iterations on smooth problems, vs 6 for Newton.
What does L-BFGS save vs full BFGS?
Full BFGS stores the n × n inverse-Hessian approximation H_k = B_k⁻¹ explicitly: O(n²) memory. For n = 10⁶ that's 8 TB at double precision, impossible. L-BFGS (Liu and Nocedal 1989) instead stores only the last m ≈ 10 pairs (s_k, y_k) and computes H_k g_k by a two-loop recursion in O(m n) work without ever forming H_k. Memory drops to O(m n) ≈ 80 MB for n = 10⁶ at m = 10, and per-iteration cost drops from O(n²) to O(m n). The trade-off is slower convergence — superlinear becomes approximately linear once you outrun the m most recent directions. Standard m: 5-20.
Why is BFGS the default in scipy.optimize?
scipy.optimize.minimize default is L-BFGS-B (the bound-constrained variant of L-BFGS). The choice reflects what works on the broadest range of problems with the least user tuning. BFGS-style methods need only the gradient (no Hessian), use O(n) memory at L-BFGS scale, achieve superlinear convergence, handle non-quadratic objectives robustly via line search, and are theoretically and empirically the most reliable smooth optimiser. Newton needs Hessians most users can't supply; CG breaks down on non-quadratic non-SPD problems; gradient descent is too slow. L-BFGS is the workhorse compromise.
What is the rank-2 update formula?
For the inverse-Hessian form H_k ≈ B_k⁻¹, the BFGS update is H_{k+1} = (I − ρ_k s_k y_kᵀ) H_k (I − ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ where ρ_k = 1/(y_kᵀ s_k). This is a rank-2 modification: H_{k+1} − H_k has rank at most 2. The form preserves symmetry and (under the Wolfe condition y_kᵀ s_k > 0) positive definiteness. Applied implicitly via the two-loop recursion, this never forms a matrix — the formula is purely in vector operations.
Where does BFGS come from historically?
Broyden (1970), Fletcher (1970), Goldfarb (1970), and Shanno (1970) independently derived the same update in four papers published the same year. They all sought symmetric positive-definite updates that solve the secant equation while minimising distance to the previous approximation in different weighted matrix norms. By a near-miracle, four researchers picked the same answer, and the combined initials gave the name BFGS. The earlier DFP method (Davidon 1959, Fletcher-Powell 1963) was the first practical quasi-Newton method; BFGS is its self-dual cousin and proved to be more robust in practice.
When does BFGS fail?
Three modes. (1) Non-smooth f: gradient differences become noisy and the rank-2 update can lose positive-definiteness. Use sub-gradient methods or smoothing. (2) Inaccurate gradients: BFGS is very sensitive to gradient noise because the update extracts curvature from gradient differences. Stochastic-gradient BFGS variants like SQN (Byrd 2016) exist but are not as robust as plain L-BFGS. (3) Bad starting curvature: the initial H_0 (usually identity) gives poor scaling for the first few iterations. Use an initial diagonal estimate H_0 = γ I with γ from the first (s, y) pair. Practical L-BFGS implementations all do this.