Linear Algebra
Moore–Penrose Pseudoinverse
The "inverse" that always exists — and always solves least squares
The Moore–Penrose pseudoinverse A⁺ generalises the inverse to non-square, rank-deficient matrices. The least-squares solution to Ax = b is x* = A⁺b — and it always exists, computed cheaply via the SVD.
- DefinitionUnique matrix A⁺ satisfying the four Penrose conditions
- ExistenceEvery matrix — square or not, full or deficient rank — has exactly one A⁺
- Computed via SVDA = UΣVᵀ ⇒ A⁺ = VΣ⁺Uᵀ (invert nonzero σᵢ, leave zeros alone)
- CostO(mn²) for m×n matrix with m > n — same as the SVD
- Least-squares solutionx* = A⁺b minimises ‖Ax − b‖₂ with smallest ‖x‖₂
- Stable whenCondition number κ(A) < 10⁸ in double precision
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
What it generalises
For a square invertible matrix A, the inverse satisfies A A⁻¹ = A⁻¹ A = I. That equation breaks down the moment A is non-square, rank deficient, or singular — and most matrices in real data are at least one of those. The Moore–Penrose pseudoinverse A⁺ replaces the inverse with the unique matrix that "acts as inverse-y as possible":
- A A⁺ A = A
- A⁺ A A⁺ = A⁺
- (A A⁺)ᵀ = A A⁺
- (A⁺ A)ᵀ = A⁺ A
These are the Penrose conditions. Penrose proved in 1955 that for every matrix A there is exactly one A⁺ satisfying all four. When A is invertible, A⁺ = A⁻¹; otherwise A⁺ is a strictly more flexible object.
The least-squares lens
Consider an overdetermined system — more equations than unknowns:
A x = b with A ∈ ℝ^(m×n), m > n
If b is not in the column space of A, there is no exact solution. The next best thing is the least-squares solution: the x that minimises the residual norm ‖Ax − b‖₂. The pseudoinverse delivers this in one matrix-vector product:
x* = A⁺ b
If A has full column rank, x* is unique. If A is rank deficient (some columns linearly dependent), there are infinitely many least-squares minimisers — A⁺b picks the one with smallest Euclidean norm, the minimum-norm least-squares solution.
The geometric picture
A maps R^n into R^m. Its column space col(A) is an r-dimensional subspace of R^m where r = rank(A). For any b ∈ R^m:
- Project b orthogonally onto col(A) — call the projection b‖. This is the closest point in col(A) to b. The matrix that does this is A A⁺.
- Pre-image of b‖ under A. There are infinitely many x with Ax = b‖ whenever ker(A) is nonzero. The minimum-norm preimage is x* = A⁺b, which lies in the row space of A (orthogonal to ker(A)).
The two orthogonality properties (3) and (4) in the Penrose conditions encode exactly this: AA⁺ projects onto col(A); A⁺A projects onto row(A). Both are orthogonal projectors (symmetric and idempotent).
Worked example — 5 equations, 2 unknowns
Take an overdetermined system from a small line-fit problem. Five data points (xᵢ, yᵢ) = (0, 1), (1, 2), (2, 2), (3, 4), (4, 5). Fit y = m·x + c:
A = [ 0 1 ] b = [ 1 ]
[ 1 1 ] [ 2 ]
[ 2 1 ] [ 2 ]
[ 3 1 ] [ 4 ]
[ 4 1 ] [ 5 ]
5 equations, 2 unknowns (m and c). No exact solution unless the points are collinear. Apply the pseudoinverse via the normal equations (cheap when A is well-conditioned):
AᵀA = [ 30 10 ] Aᵀb = [ 32 ]
[ 10 5 ] [ 14 ]
For a full-column-rank tall matrix, A⁺ = (AᵀA)⁻¹Aᵀ. Compute the 2 × 2 inverse:
det(AᵀA) = 30·5 − 10·10 = 50
(AᵀA)⁻¹ = (1/50) · [ 5 −10 ]
[ −10 30 ]
Then
x* = (AᵀA)⁻¹ Aᵀb = (1/50) · [ 5 −10 ] [ 32 ] [ 1 ]
[ −10 30 ] [ 14 ] = [ 1 ]
(after simplifying: m = (160 − 140)/50 = 0.4·2.5 ... actually m = 1, c = 1)
Hand-check: with m = 1, c = 1, the predicted values are (1, 2, 3, 4, 5) and the residual is (0, 0, −1, 0, 0). The sum of squared residuals is 1, the minimum achievable. ✓
For an ill-conditioned A you would not form AᵀA at all — you would compute the SVD A = UΣVᵀ and form A⁺ = VΣ⁺Uᵀ directly, with singular-value truncation as a built-in regulariser.
Variants and special cases
- Full column rank (m ≥ n, rank = n). A⁺ = (AᵀA)⁻¹Aᵀ — the closed-form left inverse, valid only when AᵀA is invertible.
- Full row rank (m ≤ n, rank = m). A⁺ = Aᵀ(AAᵀ)⁻¹ — the right inverse, giving the minimum-norm solution to Ax = b for underdetermined systems.
- Rank deficient. Neither closed-form holds. Use the SVD definition A⁺ = VΣ⁺Uᵀ, which works regardless of rank.
- Tikhonov regularisation. When κ(A) is too large, replace Σ⁺ with (Σᵀ Σ + λI)⁻¹ Σᵀ — the singular-value-by-singular-value damping σ → σ/(σ² + λ). Stabilises noisy inversion at the cost of slight bias.
- Drazin and group inverses. Different generalised inverses with different properties — useful in spectral analysis of Markov chains. Moore–Penrose is the most common but not the only one.
Algorithm — pseudoinverse via SVD
function pinv(A, rcond = max(m, n) * eps_machine):
U, sigma, V_T = svd(A) // O(mn²) for m ≥ n
sigma_max = max(sigma)
tol = rcond * sigma_max
sigma_plus = zeros_like(sigma)
for i in 0 .. len(sigma) - 1:
if sigma[i] > tol:
sigma_plus[i] = 1.0 / sigma[i]
// Build Σ⁺ as the transpose with reciprocals
// A⁺ = V · diag(sigma_plus) · U^T
return V_T.T @ diag(sigma_plus) @ U.T
function lstsq(A, b): // solve A x = b in least-squares sense
return pinv(A) @ b
The cutoff rcond (default ≈ max(m, n) · 2.2 × 10⁻¹⁶) is what makes pseudoinversion numerically robust: tiny singular values get treated as zero rather than amplified into noise.
Pseudoinverse via SVD vs normal equations vs QR
| SVD-based pinv | Normal equations | QR with column pivoting | |
|---|---|---|---|
| Works for rank-deficient A? | Yes — singular values reveal rank | No — AᵀA is singular | Yes — pivoting reveals rank |
| Cost (m×n, m ≥ n) | O(mn²) + O(n³) constants ≈ 14mn² + 8n³ | O(mn²) — much smaller constant ≈ mn² | O(2mn² − 2n³/3) |
| Numerical stability vs κ(A) | Scales as κ(A) (best) | Scales as κ(A)² (worst) | Scales as κ(A) (good) |
| Detects rank deficiency? | Yes — singular values below tolerance | No (matrix becomes singular silently) | Yes — small diagonal of R |
| Memory | O(mn) — full SVD if needed | O(n²) — store AᵀA | O(mn) |
| Best when | Rank is uncertain or κ(A) is high | A is well-conditioned and small | Standard well-conditioned least squares |
Where pseudoinverses show up
- Linear regression. The classical "y = Xβ + ε" with the least-squares estimator β̂ = X⁺y. Every statistics software implements it; sklearn's LinearRegression literally calls a pinv/SVD path when the design matrix is rank deficient.
- Robotics — inverse kinematics. The Jacobian of a redundant manipulator (more joints than task DOFs) has more columns than rows. The minimum-norm joint velocity that produces a desired end-effector velocity is J⁺ · v — the pseudoinverse picks the smallest joint motion that achieves the task.
- Computer vision — fundamental matrix and homography estimation. Eight-point algorithm, direct linear transform, RANSAC inner loops — all solve overdetermined linear systems via the pseudoinverse with singular-value truncation.
- Image deblurring and tomography. Reconstruct an image x from blurred/projected measurements b = Ax. The blur operator A is rank deficient (high-frequency components are wiped out), so naive inversion explodes — Tikhonov-regularised pseudoinverse is the standard remedy.
- Compressed sensing — debiasing step. After ℓ¹ recovery picks the support of the sparse solution, the final coefficients are computed via pseudoinverse on the reduced sub-matrix to get unbiased least-squares estimates.
- Control theory — minimum-norm feedback. When the actuator matrix B has more columns than rows (over-actuated systems like satellite reaction wheels), pseudoinverse-based control allocation produces the smallest-effort actuator command achieving the desired force.
- Machine learning — closed-form solvers. Ridge regression, linear discriminant analysis, factor analysis — all reduce to a pseudoinverse computation. Even kernel ridge regression is a pseudoinverse on the Gram matrix.
Common mistakes
- Forming AᵀA when A is poorly conditioned. κ(AᵀA) = κ(A)². If κ(A) = 10⁵, the normal equations multiply your noise by 10¹⁰ — pushing past double-precision's safe limit (≈ 10⁸). Use SVD pinv or QR for borderline problems.
- Trying to invert near-zero singular values. Without a rank tolerance, 1/σ for σ ≈ 10⁻¹⁵ produces a 10¹⁵ amplification of round-off. Always use a cutoff (numpy's rcond or scipy's atol).
- Expecting AA⁺ = I. It only holds when A has full row rank. In general AA⁺ is a projector onto col(A) — equal to I only when col(A) = R^m.
- Treating A⁺ as a true inverse. A⁺Ax = x only when x ∈ row(A). If x has any component in ker(A), that component is lost. Pseudoinverse is one-sided unless A is invertible.
- Confusing minimum-norm with minimum-residual. The pseudoinverse delivers both for full-rank tall matrices. For rank-deficient matrices, the residual is unique but the solution set is a coset — the minimum-norm representative is what A⁺ returns.
- Using A⁺ when you only need to solve Ax = b once. Forming A⁺ explicitly costs O(mn²) and loses precision. If you only need x = A⁺b, call
lstsqwhich solves the least-squares problem without ever materialising A⁺ — twice as fast.
Frequently asked questions
What does the pseudoinverse actually do?
For a square invertible A, A⁺ = A⁻¹. For everything else — rectangular, rank-deficient, singular — A⁺ gives the closest thing to an inverse. Specifically x = A⁺b is the unique vector that simultaneously (1) minimises ‖Ax − b‖₂ (least squares) and (2) has the smallest ‖x‖₂ among all such minimisers (minimum norm). Every matrix has exactly one Moore–Penrose pseudoinverse.
How is A⁺ computed from the SVD?
Compute the SVD A = UΣV^T. Form Σ⁺ by transposing Σ and inverting every nonzero singular value σ_i → 1/σ_i; leave zero singular values as zero. Then A⁺ = V Σ⁺ U^T. The cost is dominated by SVD itself: O(mn²) for an m × n matrix with m > n. This is the numerically stable construction.
What are the Penrose conditions?
Four matrix equations that uniquely define A⁺: (1) AA⁺A = A, (2) A⁺AA⁺ = A⁺, (3) (AA⁺)^T = AA⁺, (4) (A⁺A)^T = A⁺A. The first two say A⁺ acts like an inverse on both sides; the last two say AA⁺ and A⁺A are orthogonal projectors. Penrose proved (1955) that exactly one matrix satisfies all four — hence the uniqueness.
Why use SVD instead of the normal equations A^TAx = A^Tb?
Stability. The normal equations square the condition number — κ(A^TA) = κ(A)². If κ(A) is around 10⁵, κ(A^TA) climbs to 10¹⁰, pushing past double precision's safe bound of 10⁸ and producing wildly wrong answers. SVD operates on A directly, with stability that scales as κ(A). Use normal equations only when A is well-conditioned and small.
What does the pseudoinverse look like geometrically?
A maps R^n to R^m. The columns of A span an r-dimensional subspace of R^m — the column space. AA⁺ is the orthogonal projection of R^m onto the column space; it sends b to the closest point of Ax for any x. A⁺ then picks the minimum-norm pre-image. Pseudoinversion is: project b onto col(A), then take the minimum-norm pre-image.
How does the pseudoinverse handle non-uniqueness?
When A has a non-trivial null space (rank < n), the least-squares problem has infinitely many minimisers — any particular solution plus any kernel vector. The pseudoinverse picks the unique minimiser of smallest Euclidean norm: x* = A⁺b lies in the row space of A, orthogonal to the kernel. That choice is what makes A⁺b a function rather than a set.
Is the pseudoinverse numerically stable to compute?
Via SVD with a singular-value cutoff, yes — stable for any matrix with condition number below about 10⁸ (the safe limit for double precision). The trick is to treat singular values below a tolerance (typically σ₁ · max(m,n) · ε_machine) as zero rather than inverting them, which would amplify noise. NumPy's pinv and SciPy's pinv both expose this rcond parameter.