Does the formula generalise to rank-k updates without the full Woodbury?

Yes — apply Sherman–Morrison k times in succession, each iteration absorbing one column pair of the rank-k update into the running inverse. Costs O(k n²) instead of one O(n³) inversion, which is the win whenever k ≪ n.

Can I use it to add or remove a single data point in linear regression?

Yes. The Gram matrix XᵀX has rank-1 updates whenever you add a row x* (XᵀX → XᵀX + x*x*ᵀ) or remove one (subtract). Sherman–Morrison gives you the new (XᵀX)⁻¹ in O(n²) without redoing the regression. This is the engine behind fast leave-one-out cross-validation and online learning.

Linear Algebra

Sherman–Morrison Formula

Update a matrix inverse after a rank-1 change in O(n²) instead of O(n³)

When a matrix A is modified by a rank-1 perturbation uvᵀ, the Sherman–Morrison formula expresses (A + uvᵀ)⁻¹ in terms of A⁻¹. The update costs O(n²) flops — dramatically cheaper than the O(n³) cost of re-inverting from scratch. It is the linear-algebra trick behind recursive least squares, the BFGS quasi-Newton optimiser, the Kalman filter, and leave-one-out cross-validation.

Formula(A + uvᵀ)⁻¹ = A⁻¹ − (A⁻¹uvᵀA⁻¹)/(1 + vᵀA⁻¹u)
CostO(n²) — vs O(n³) for full re-inversion
Condition1 + vᵀA⁻¹u ≠ 0 (else A + uvᵀ is singular)
GeneralisationWoodbury identity for rank-k updates
Used inBFGS, recursive least squares, Kalman filter, LOOCV
PublishedSherman & Morrison, Annals of Math. Stat., 1950

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The formula

Let A be an invertible n×n matrix and u, v two column vectors in ℝⁿ (or ℂⁿ). The outer product uvᵀ is an n×n matrix of rank 1. If 1 + vᵀA⁻¹u ≠ 0, then A + uvᵀ is invertible and

(A + uvᵀ)⁻¹ = A⁻¹ − (A⁻¹ u vᵀ A⁻¹) / (1 + vᵀ A⁻¹ u)

The right side is "the old inverse minus a rank-1 correction". The numerator A⁻¹uvᵀA⁻¹ is an n×n rank-1 matrix; the denominator is a scalar. Every quantity is computable from A⁻¹ and the two vectors — you never touch A + uvᵀ directly.

Why O(n²) is enormous

The naive plan after a rank-1 update is: form the new matrix B = A + uvᵀ, then call inv(B) at cost O(n³). For n = 1000, that is roughly a billion floating-point operations.

Sherman–Morrison only ever multiplies the existing A⁻¹ by vectors, plus one outer product. Specifically:

Compute p = A⁻¹ u: one matrix–vector product, O(n²).
Compute qᵀ = vᵀ A⁻¹: one matrix–vector product (or transposed version), O(n²).
Compute the scalar denominator 1 + vᵀ p = 1 + qᵀ u: dot product, O(n).
Form the rank-1 correction p qᵀ / (1 + qᵀu): outer product, O(n²).
Subtract from A⁻¹: O(n²).

Total: 4 · O(n²) ≈ O(n²). For n = 1000, that is ~10⁶ flops vs. 10⁹ for full re-inversion — three orders of magnitude faster.

Worked numerical example — rank-1 update of the identity

Take A = I, the n×n identity, and let u = v = 1 = (1, 1, …, 1)^T, the all-ones vector. The outer product uv^T = J is the n×n all-ones matrix, so A + uv^T = I + J.

Apply Sherman–Morrison. Since A⁻¹ = I:

Step 1. A⁻¹u = u = 1.
Step 2. v^TA⁻¹ = v^T = 1^T.
Step 3. Denominator 1 + v^TA⁻¹u = 1 + 1^T·1 = 1 + n.
Step 4. Correction (A⁻¹u)(v^TA⁻¹)/(1+v^TA⁻¹u) = J / (1 + n).

So (I + J)⁻¹ = I − J/(1 + n). Verify for n = 3: (1 + J)(I − J/4) = I − J/4 + J − J²/4. Since J² = 3J (every entry of J² is 1·1 + 1·1 + 1·1 = 3), this becomes I + 3J/4 − 3J/4 = I. ✓

The asymptotics: re-inverting I + J directly is O(n³). Sherman–Morrison gives the closed-form correction in O(n²). For n = 1000, the closed form is essentially free; the direct inversion is ~10⁹ flops.

Worked numerical example — generic 2×2

Take A = I₂, u = (1, 0)^T, v = (1, 1)^T. Then uv^T = [[1, 1], [0, 0]] and A + uv^T = [[2, 1], [0, 1]].

Step 1. A⁻¹u = (1, 0)^T.
Step 2. v^TA⁻¹ = (1, 1).
Step 3. Denominator 1 + v^Tu = 1 + 1 = 2.
Step 4. Correction (1, 0)^T(1, 1) / 2 = [[1, 1], [0, 0]] / 2 = [[1/2, 1/2], [0, 0]].
Step 5. A⁻¹ − correction = [[1, 0], [0, 1]] − [[1/2, 1/2], [0, 0]] = [[1/2, −1/2], [0, 1]].

Direct check. The inverse of [[2, 1], [0, 1]] has det 2, inverse (1/2)[[1, −1], [0, 2]] = [[1/2, −1/2], [0, 1]]. ✓ Both routes agree.

Derivation

Postulate the answer has the form (A + uvᵀ)⁻¹ = A⁻¹ − α · A⁻¹uvᵀA⁻¹ for some scalar α to be determined. Multiply by (A + uvᵀ) on the right and demand the product equal I:

[A⁻¹ − α A⁻¹ u vᵀ A⁻¹] · [A + u vᵀ]
  = A⁻¹A + A⁻¹ u vᵀ − α A⁻¹ u vᵀ A⁻¹ A − α A⁻¹ u vᵀ A⁻¹ u vᵀ
  = I + A⁻¹ u vᵀ − α A⁻¹ u vᵀ − α (vᵀ A⁻¹ u) A⁻¹ u vᵀ
  = I + A⁻¹ u vᵀ [1 − α − α (vᵀ A⁻¹ u)]

We need the bracket to be zero. Solve: 1 − α (1 + vᵀA⁻¹u) = 0, so α = 1 / (1 + vᵀA⁻¹u). Substituting back gives the Sherman–Morrison formula.

The same derivation works on the left side and the proof of (A + uvᵀ)(A⁻¹ − α A⁻¹uvᵀA⁻¹) = I is symmetric. The formula thus produces both the left and right inverse, and (A + uvᵀ)⁻¹ exists iff α is defined — i.e. iff 1 + vᵀA⁻¹u ≠ 0.

Sherman–Morrison vs alternatives

Approach	What it updates	Cost	Best when
Full re-inversion	(A + uvᵀ)⁻¹	O(n³)	One-off problem, no prior A⁻¹
Sherman–Morrison	Rank-1 inverse update	O(n²)	A⁻¹ already known; rank-1 perturbation
Woodbury identity	Rank-k inverse update	O(k n² + k³)	k ≪ n; multiple columns added at once
Rank-1 LU update	LU factors after rank-1 change	O(n²)	You have LU rather than the explicit inverse
Solve A x = b directly	x only, no inverse	O(n²) per solve after one O(n³) LU	You only need x = A⁻¹b, never A⁻¹ itself
Iterative (CG / GMRES)	Approximate solve	O(n²) per iter, sparse-friendly	n huge, A sparse or implicit

Application — BFGS quasi-Newton optimisation

Quasi-Newton methods minimise a function f(x) by Newton-like steps x_k+1 = x_k − H_k⁻¹ ∇f(x_k), but they avoid forming the true Hessian. Instead, they maintain an approximate inverse Hessian H_k⁻¹ that is updated cheaply at every step. The BFGS update is a rank-2 modification of H_k⁻¹, derivable from two applications of Sherman–Morrison. Cost per iteration: O(n²) instead of the O(n³) you would pay re-inverting a true Hessian.

L-BFGS goes further and stores only the last m gradient/step pairs (typically m = 10–20), reconstructing the matrix–vector product H_k⁻¹·g implicitly in O(m n) flops. The hierarchy Newton → BFGS → L-BFGS is exactly an O(n³) → O(n²) → O(n) cascade powered by inverse-update tricks.

Application — recursive least squares

Ordinary least squares solves β̂ = (X^TX)⁻¹ X^Ty. When a new observation (x_new, y_new) arrives, the new design matrix has one extra row, so X^TX gets a rank-1 update by x_new x_new^T. Sherman–Morrison updates (X^TX)⁻¹ in O(p²) (where p = number of features), and the new coefficient β̂ is then a cheap O(p²) correction of the old one. This is the recursive least squares (RLS) algorithm, used in adaptive signal processing, online linear regression, and the analytical engine of the Kalman filter.

NumPy implementation

import numpy as np

def sherman_morrison(A_inv, u, v):
    """Compute (A + u vᵀ)⁻¹ given A⁻¹, in O(n²)."""
    Ainv_u = A_inv @ u                  # O(n²)
    vT_Ainv = v @ A_inv                 # O(n²)
    denom = 1.0 + vT_Ainv @ u           # O(n) scalar
    if abs(denom) < 1e-12:
        raise ValueError("Update is singular (denominator near zero)")
    correction = np.outer(Ainv_u, vT_Ainv) / denom  # O(n²)
    return A_inv - correction           # O(n²)

# Quick sanity check
n = 50
A = np.random.randn(n, n) + n * np.eye(n)   # well-conditioned
A_inv = np.linalg.inv(A)
u = np.random.randn(n); v = np.random.randn(n)

B_inv_fast = sherman_morrison(A_inv, u, v)
B_inv_slow = np.linalg.inv(A + np.outer(u, v))
print(np.allclose(B_inv_fast, B_inv_slow))   # True

For a sequence of updates, store and propagate the inverse — never recompute from scratch.

Common pitfalls

Forgetting to check the denominator. 1 + v^TA⁻¹u = 0 means A + uv^T is singular and has no inverse. The formula does not invent one — it blows up.
Confusing uv^T with v^Tu. The first is an n×n outer product (rank-1 matrix). The second is a 1×1 scalar (inner product). Mixing them up flips everything.
Numerical drift over many updates. Applying Sherman–Morrison thousands of times accumulates floating-point error. Periodically re-factor A explicitly (every ~n updates) to refresh the inverse.
Using the formula when A⁻¹ is not actually stored. Sherman–Morrison needs A⁻¹ or at least cheap matrix–vector products with it. If you only have an LU factorisation, use the rank-1 LU update instead.
Applying it to rank-k perturbations one column at a time without absorbing. You can apply Sherman–Morrison k times, but each iteration must update the running inverse (not the original A⁻¹) — otherwise the next correction is wrong.
Reading "O(n²)" as "always faster". For very small n (say n ≤ 5), the constant factors swamp the asymptotic win, and re-inverting is fine. Sherman–Morrison shines for large n with many updates.

Frequently asked questions

What does the Sherman–Morrison formula say?

If A is invertible and 1 + v^TA⁻¹u ≠ 0, then (A + uv^T)⁻¹ = A⁻¹ − (A⁻¹uv^TA⁻¹) / (1 + v^TA⁻¹u). The right side is the old inverse minus a single rank-1 correction.

Why is the update cheaper than re-inverting?

Every operation on the right is a matrix–vector product (O(n²)), outer product (O(n²)), or scalar dot product (O(n)). Total cost O(n²) versus O(n³) for full re-inversion — three orders of magnitude faster at n = 1000.

When does the formula fail?

When 1 + v^TA⁻¹u = 0, the matrix A + uv^T is singular. The formula honestly reports this by blowing up. Numerically, a near-zero denominator means you should re-factor instead of trusting the update.

What is the Woodbury identity?

The rank-k generalisation: (A + UCV)⁻¹ = A⁻¹ − A⁻¹U (C⁻¹ + VA⁻¹U)⁻¹ VA⁻¹. Sherman–Morrison is the k = 1 case. The expensive piece is a k×k inverse — cheap when k is small.

Where does Sherman–Morrison show up in machine learning?

Recursive least squares maintains (X^TX)⁻¹ as rows of X arrive. BFGS quasi-Newton maintains an approximate inverse Hessian via rank-2 (= two Sherman–Morrison) updates. Kalman filters update posterior covariance the same way. Leave-one-out cross-validation removes one row at a time using the formula.

Does it generalise to rank-k updates without the full Woodbury?

Yes — apply Sherman–Morrison k times in succession, each iteration updating the running inverse. Costs O(k n²) versus one O(n³) inversion. Better than Woodbury when you process updates one at a time (online setting).

Can I use it to add or remove one data point in linear regression?

Yes — X^TX has a rank-1 update when a row is added or removed. Sherman–Morrison gives the new inverse in O(n²) and is the engine behind fast leave-one-out cross-validation and online learning.