Linear Algebra
Sherman–Morrison Formula
Update a matrix inverse after a rank-1 change in O(n²) instead of O(n³)
When a matrix A is modified by a rank-1 perturbation uvᵀ, the Sherman–Morrison formula expresses (A + uvᵀ)⁻¹ in terms of A⁻¹. The update costs O(n²) flops — dramatically cheaper than the O(n³) cost of re-inverting from scratch. It is the linear-algebra trick behind recursive least squares, the BFGS quasi-Newton optimiser, the Kalman filter, and leave-one-out cross-validation.
- Formula(A + uvᵀ)⁻¹ = A⁻¹ − (A⁻¹uvᵀA⁻¹)/(1 + vᵀA⁻¹u)
- CostO(n²) — vs O(n³) for full re-inversion
- Condition1 + vᵀA⁻¹u ≠ 0 (else A + uvᵀ is singular)
- GeneralisationWoodbury identity for rank-k updates
- Used inBFGS, recursive least squares, Kalman filter, LOOCV
- PublishedSherman & Morrison, Annals of Math. Stat., 1950
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The formula
Let A be an invertible n×n matrix and u, v two column vectors in ℝn (or ℂn). The outer product uvᵀ is an n×n matrix of rank 1. If 1 + vᵀA⁻¹u ≠ 0, then A + uvᵀ is invertible and
(A + uvᵀ)⁻¹ = A⁻¹ − (A⁻¹ u vᵀ A⁻¹) / (1 + vᵀ A⁻¹ u)
The right side is "the old inverse minus a rank-1 correction". The numerator A⁻¹uvᵀA⁻¹ is an n×n rank-1 matrix; the denominator is a scalar. Every quantity is computable from A⁻¹ and the two vectors — you never touch A + uvᵀ directly.
Why O(n²) is enormous
The naive plan after a rank-1 update is: form the new matrix B = A + uvᵀ, then call inv(B) at cost O(n³). For n = 1000, that is roughly a billion floating-point operations.
Sherman–Morrison only ever multiplies the existing A⁻¹ by vectors, plus one outer product. Specifically:
- Compute
p = A⁻¹ u: one matrix–vector product, O(n²). - Compute
qᵀ = vᵀ A⁻¹: one matrix–vector product (or transposed version), O(n²). - Compute the scalar denominator
1 + vᵀ p = 1 + qᵀ u: dot product, O(n). - Form the rank-1 correction
p qᵀ / (1 + qᵀu): outer product, O(n²). - Subtract from A⁻¹: O(n²).
Total: 4 · O(n²) ≈ O(n²). For n = 1000, that is ~106 flops vs. 109 for full re-inversion — three orders of magnitude faster.
Worked numerical example — rank-1 update of the identity
Take A = I, the n×n identity, and let u = v = 1 = (1, 1, …, 1)T, the all-ones vector. The outer product uvT = J is the n×n all-ones matrix, so A + uvT = I + J.
Apply Sherman–Morrison. Since A⁻¹ = I:
- Step 1. A⁻¹u = u = 1.
- Step 2. vTA⁻¹ = vT = 1T.
- Step 3. Denominator 1 + vTA⁻¹u = 1 + 1T·1 = 1 + n.
- Step 4. Correction (A⁻¹u)(vTA⁻¹)/(1+vTA⁻¹u) = J / (1 + n).
So (I + J)⁻¹ = I − J/(1 + n). Verify for n = 3: (1 + J)(I − J/4) = I − J/4 + J − J²/4. Since J² = 3J (every entry of J² is 1·1 + 1·1 + 1·1 = 3), this becomes I + 3J/4 − 3J/4 = I. ✓
The asymptotics: re-inverting I + J directly is O(n³). Sherman–Morrison gives the closed-form correction in O(n²). For n = 1000, the closed form is essentially free; the direct inversion is ~109 flops.
Worked numerical example — generic 2×2
Take A = I2, u = (1, 0)T, v = (1, 1)T. Then uvT = [[1, 1], [0, 0]] and A + uvT = [[2, 1], [0, 1]].
- Step 1. A⁻¹u = (1, 0)T.
- Step 2. vTA⁻¹ = (1, 1).
- Step 3. Denominator 1 + vTu = 1 + 1 = 2.
- Step 4. Correction (1, 0)T(1, 1) / 2 = [[1, 1], [0, 0]] / 2 = [[1/2, 1/2], [0, 0]].
- Step 5. A⁻¹ − correction = [[1, 0], [0, 1]] − [[1/2, 1/2], [0, 0]] = [[1/2, −1/2], [0, 1]].
Direct check. The inverse of [[2, 1], [0, 1]] has det 2, inverse (1/2)[[1, −1], [0, 2]] = [[1/2, −1/2], [0, 1]]. ✓ Both routes agree.
Derivation
Postulate the answer has the form (A + uvᵀ)⁻¹ = A⁻¹ − α · A⁻¹uvᵀA⁻¹ for some scalar α to be determined. Multiply by (A + uvᵀ) on the right and demand the product equal I:
[A⁻¹ − α A⁻¹ u vᵀ A⁻¹] · [A + u vᵀ]
= A⁻¹A + A⁻¹ u vᵀ − α A⁻¹ u vᵀ A⁻¹ A − α A⁻¹ u vᵀ A⁻¹ u vᵀ
= I + A⁻¹ u vᵀ − α A⁻¹ u vᵀ − α (vᵀ A⁻¹ u) A⁻¹ u vᵀ
= I + A⁻¹ u vᵀ [1 − α − α (vᵀ A⁻¹ u)]
We need the bracket to be zero. Solve: 1 − α (1 + vᵀA⁻¹u) = 0, so α = 1 / (1 + vᵀA⁻¹u). Substituting back gives the Sherman–Morrison formula.
The same derivation works on the left side and the proof of (A + uvᵀ)(A⁻¹ − α A⁻¹uvᵀA⁻¹) = I is symmetric. The formula thus produces both the left and right inverse, and (A + uvᵀ)⁻¹ exists iff α is defined — i.e. iff 1 + vᵀA⁻¹u ≠ 0.
Sherman–Morrison vs alternatives
| Approach | What it updates | Cost | Best when |
|---|---|---|---|
| Full re-inversion | (A + uvᵀ)⁻¹ | O(n³) | One-off problem, no prior A⁻¹ |
| Sherman–Morrison | Rank-1 inverse update | O(n²) | A⁻¹ already known; rank-1 perturbation |
| Woodbury identity | Rank-k inverse update | O(k n² + k³) | k ≪ n; multiple columns added at once |
| Rank-1 LU update | LU factors after rank-1 change | O(n²) | You have LU rather than the explicit inverse |
| Solve A x = b directly | x only, no inverse | O(n²) per solve after one O(n³) LU | You only need x = A⁻¹b, never A⁻¹ itself |
| Iterative (CG / GMRES) | Approximate solve | O(n²) per iter, sparse-friendly | n huge, A sparse or implicit |
Application — BFGS quasi-Newton optimisation
Quasi-Newton methods minimise a function f(x) by Newton-like steps xk+1 = xk − Hk⁻¹ ∇f(xk), but they avoid forming the true Hessian. Instead, they maintain an approximate inverse Hessian Hk−1 that is updated cheaply at every step. The BFGS update is a rank-2 modification of Hk−1, derivable from two applications of Sherman–Morrison. Cost per iteration: O(n²) instead of the O(n³) you would pay re-inverting a true Hessian.
L-BFGS goes further and stores only the last m gradient/step pairs (typically m = 10–20), reconstructing the matrix–vector product Hk−1·g implicitly in O(m n) flops. The hierarchy Newton → BFGS → L-BFGS is exactly an O(n³) → O(n²) → O(n) cascade powered by inverse-update tricks.
Application — recursive least squares
Ordinary least squares solves β̂ = (XTX)⁻¹ XTy. When a new observation (xnew, ynew) arrives, the new design matrix has one extra row, so XTX gets a rank-1 update by xnew xnewT. Sherman–Morrison updates (XTX)⁻¹ in O(p²) (where p = number of features), and the new coefficient β̂ is then a cheap O(p²) correction of the old one. This is the recursive least squares (RLS) algorithm, used in adaptive signal processing, online linear regression, and the analytical engine of the Kalman filter.
NumPy implementation
import numpy as np
def sherman_morrison(A_inv, u, v):
"""Compute (A + u vᵀ)⁻¹ given A⁻¹, in O(n²)."""
Ainv_u = A_inv @ u # O(n²)
vT_Ainv = v @ A_inv # O(n²)
denom = 1.0 + vT_Ainv @ u # O(n) scalar
if abs(denom) < 1e-12:
raise ValueError("Update is singular (denominator near zero)")
correction = np.outer(Ainv_u, vT_Ainv) / denom # O(n²)
return A_inv - correction # O(n²)
# Quick sanity check
n = 50
A = np.random.randn(n, n) + n * np.eye(n) # well-conditioned
A_inv = np.linalg.inv(A)
u = np.random.randn(n); v = np.random.randn(n)
B_inv_fast = sherman_morrison(A_inv, u, v)
B_inv_slow = np.linalg.inv(A + np.outer(u, v))
print(np.allclose(B_inv_fast, B_inv_slow)) # True
For a sequence of updates, store and propagate the inverse — never recompute from scratch.
Common pitfalls
- Forgetting to check the denominator. 1 + vTA⁻¹u = 0 means A + uvT is singular and has no inverse. The formula does not invent one — it blows up.
- Confusing uvT with vTu. The first is an n×n outer product (rank-1 matrix). The second is a 1×1 scalar (inner product). Mixing them up flips everything.
- Numerical drift over many updates. Applying Sherman–Morrison thousands of times accumulates floating-point error. Periodically re-factor A explicitly (every ~n updates) to refresh the inverse.
- Using the formula when A⁻¹ is not actually stored. Sherman–Morrison needs A⁻¹ or at least cheap matrix–vector products with it. If you only have an LU factorisation, use the rank-1 LU update instead.
- Applying it to rank-k perturbations one column at a time without absorbing. You can apply Sherman–Morrison k times, but each iteration must update the running inverse (not the original A⁻¹) — otherwise the next correction is wrong.
- Reading "O(n²)" as "always faster". For very small n (say n ≤ 5), the constant factors swamp the asymptotic win, and re-inverting is fine. Sherman–Morrison shines for large n with many updates.
Frequently asked questions
What does the Sherman–Morrison formula say?
If A is invertible and 1 + vTA⁻¹u ≠ 0, then (A + uvT)⁻¹ = A⁻¹ − (A⁻¹uvTA⁻¹) / (1 + vTA⁻¹u). The right side is the old inverse minus a single rank-1 correction.
Why is the update cheaper than re-inverting?
Every operation on the right is a matrix–vector product (O(n²)), outer product (O(n²)), or scalar dot product (O(n)). Total cost O(n²) versus O(n³) for full re-inversion — three orders of magnitude faster at n = 1000.
When does the formula fail?
When 1 + vTA⁻¹u = 0, the matrix A + uvT is singular. The formula honestly reports this by blowing up. Numerically, a near-zero denominator means you should re-factor instead of trusting the update.
What is the Woodbury identity?
The rank-k generalisation: (A + UCV)⁻¹ = A⁻¹ − A⁻¹U (C⁻¹ + VA⁻¹U)⁻¹ VA⁻¹. Sherman–Morrison is the k = 1 case. The expensive piece is a k×k inverse — cheap when k is small.
Where does Sherman–Morrison show up in machine learning?
Recursive least squares maintains (XTX)⁻¹ as rows of X arrive. BFGS quasi-Newton maintains an approximate inverse Hessian via rank-2 (= two Sherman–Morrison) updates. Kalman filters update posterior covariance the same way. Leave-one-out cross-validation removes one row at a time using the formula.
Does it generalise to rank-k updates without the full Woodbury?
Yes — apply Sherman–Morrison k times in succession, each iteration updating the running inverse. Costs O(k n²) versus one O(n³) inversion. Better than Woodbury when you process updates one at a time (online setting).
Can I use it to add or remove one data point in linear regression?
Yes — XTX has a rank-1 update when a row is added or removed. Sherman–Morrison gives the new inverse in O(n²) and is the engine behind fast leave-one-out cross-validation and online learning.