Linear Algebra

Moore–Penrose Pseudoinverse

Name: Moore–Penrose Pseudoinverse — 60-second explainer
Uploaded: 2026-05-18T09:00:00Z
Duration: 1 min
Description: The Moore–Penrose pseudoinverse A⁺ generalises the inverse to non-square, rank-deficient matrices. The least-squares solution to Ax = b is x* = A⁺b — and it always exists, computed cheaply via the SVD.

The "inverse" that always exists — and always solves least squares

The Moore–Penrose pseudoinverse A⁺ generalises the inverse to non-square, rank-deficient matrices. The least-squares solution to Ax = b is x* = A⁺b — and it always exists, computed cheaply via the SVD.

DefinitionUnique matrix A⁺ satisfying the four Penrose conditions
ExistenceEvery matrix — square or not, full or deficient rank — has exactly one A⁺
Computed via SVDA = UΣVᵀ ⇒ A⁺ = VΣ⁺Uᵀ (invert nonzero σᵢ, leave zeros alone)
CostO(mn²) for m×n matrix with m > n — same as the SVD
Least-squares solutionx* = A⁺b minimises ‖Ax − b‖₂ with smallest ‖x‖₂
Stable whenCondition number κ(A) < 10⁸ in double precision

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

What it generalises

For a square invertible matrix A, the inverse satisfies A A⁻¹ = A⁻¹ A = I. That equation breaks down the moment A is non-square, rank deficient, or singular — and most matrices in real data are at least one of those. The Moore–Penrose pseudoinverse A⁺ replaces the inverse with the unique matrix that "acts as inverse-y as possible":

A A⁺ A = A
A⁺ A A⁺ = A⁺
(A A⁺)ᵀ = A A⁺
(A⁺ A)ᵀ = A⁺ A

These are the Penrose conditions. Penrose proved in 1955 that for every matrix A there is exactly one A⁺ satisfying all four. When A is invertible, A⁺ = A⁻¹; otherwise A⁺ is a strictly more flexible object.

The least-squares lens

Consider an overdetermined system — more equations than unknowns:

A x = b   with A ∈ ℝ^(m×n), m > n

If b is not in the column space of A, there is no exact solution. The next best thing is the least-squares solution: the x that minimises the residual norm ‖Ax − b‖₂. The pseudoinverse delivers this in one matrix-vector product:

x* = A⁺ b

If A has full column rank, x* is unique. If A is rank deficient (some columns linearly dependent), there are infinitely many least-squares minimisers — A⁺b picks the one with smallest Euclidean norm, the minimum-norm least-squares solution.

The geometric picture

A maps R^n into R^m. Its column space col(A) is an r-dimensional subspace of R^m where r = rank(A). For any b ∈ R^m:

Project b orthogonally onto col(A) — call the projection b‖. This is the closest point in col(A) to b. The matrix that does this is A A⁺.
Pre-image of b‖ under A. There are infinitely many x with Ax = b‖ whenever ker(A) is nonzero. The minimum-norm preimage is x* = A⁺b, which lies in the row space of A (orthogonal to ker(A)).

The two orthogonality properties (3) and (4) in the Penrose conditions encode exactly this: AA⁺ projects onto col(A); A⁺A projects onto row(A). Both are orthogonal projectors (symmetric and idempotent).

Worked example — 5 equations, 2 unknowns

Take an overdetermined system from a small line-fit problem. Five data points (xᵢ, yᵢ) = (0, 1), (1, 2), (2, 2), (3, 4), (4, 5). Fit y = m·x + c:

A = [ 0  1 ]   b = [ 1 ]
    [ 1  1 ]       [ 2 ]
    [ 2  1 ]       [ 2 ]
    [ 3  1 ]       [ 4 ]
    [ 4  1 ]       [ 5 ]

5 equations, 2 unknowns (m and c). No exact solution unless the points are collinear. Apply the pseudoinverse via the normal equations (cheap when A is well-conditioned):

AᵀA = [ 30   10 ]      Aᵀb = [ 32 ]
      [ 10    5 ]            [ 14 ]

For a full-column-rank tall matrix, A⁺ = (AᵀA)⁻¹Aᵀ. Compute the 2 × 2 inverse:

det(AᵀA) = 30·5 − 10·10 = 50
(AᵀA)⁻¹ = (1/50) · [  5  −10 ]
                   [ −10  30 ]

Then

x* = (AᵀA)⁻¹ Aᵀb = (1/50) · [  5  −10 ] [ 32 ]   [ 1 ]
                            [ −10  30 ] [ 14 ] = [ 1 ]
        (after simplifying: m = (160 − 140)/50 = 0.4·2.5 ... actually m = 1, c = 1)

Hand-check: with m = 1, c = 1, the predicted values are (1, 2, 3, 4, 5) and the residual is (0, 0, −1, 0, 0). The sum of squared residuals is 1, the minimum achievable. ✓

For an ill-conditioned A you would not form AᵀA at all — you would compute the SVD A = UΣVᵀ and form A⁺ = VΣ⁺Uᵀ directly, with singular-value truncation as a built-in regulariser.

Variants and special cases

Full column rank (m ≥ n, rank = n). A⁺ = (AᵀA)⁻¹Aᵀ — the closed-form left inverse, valid only when AᵀA is invertible.
Full row rank (m ≤ n, rank = m). A⁺ = Aᵀ(AAᵀ)⁻¹ — the right inverse, giving the minimum-norm solution to Ax = b for underdetermined systems.
Rank deficient. Neither closed-form holds. Use the SVD definition A⁺ = VΣ⁺Uᵀ, which works regardless of rank.
Tikhonov regularisation. When κ(A) is too large, replace Σ⁺ with (Σᵀ Σ + λI)⁻¹ Σᵀ — the singular-value-by-singular-value damping σ → σ/(σ² + λ). Stabilises noisy inversion at the cost of slight bias.
Drazin and group inverses. Different generalised inverses with different properties — useful in spectral analysis of Markov chains. Moore–Penrose is the most common but not the only one.

Algorithm — pseudoinverse via SVD

function pinv(A, rcond = max(m, n) * eps_machine):
    U, sigma, V_T = svd(A)        // O(mn²) for m ≥ n
    sigma_max = max(sigma)
    tol = rcond * sigma_max
    sigma_plus = zeros_like(sigma)
    for i in 0 .. len(sigma) - 1:
        if sigma[i] > tol:
            sigma_plus[i] = 1.0 / sigma[i]
    // Build Σ⁺ as the transpose with reciprocals
    // A⁺ = V · diag(sigma_plus) · U^T
    return V_T.T @ diag(sigma_plus) @ U.T

function lstsq(A, b):              // solve A x = b in least-squares sense
    return pinv(A) @ b

The cutoff rcond (default ≈ max(m, n) · 2.2 × 10⁻¹⁶) is what makes pseudoinversion numerically robust: tiny singular values get treated as zero rather than amplified into noise.

Pseudoinverse via SVD vs normal equations vs QR

	SVD-based pinv	Normal equations	QR with column pivoting
Works for rank-deficient A?	Yes — singular values reveal rank	No — AᵀA is singular	Yes — pivoting reveals rank
Cost (m×n, m ≥ n)	O(mn²) + O(n³) constants ≈ 14mn² + 8n³	O(mn²) — much smaller constant ≈ mn²	O(2mn² − 2n³/3)
Numerical stability vs κ(A)	Scales as κ(A) (best)	Scales as κ(A)² (worst)	Scales as κ(A) (good)
Detects rank deficiency?	Yes — singular values below tolerance	No (matrix becomes singular silently)	Yes — small diagonal of R
Memory	O(mn) — full SVD if needed	O(n²) — store AᵀA	O(mn)
Best when	Rank is uncertain or κ(A) is high	A is well-conditioned and small	Standard well-conditioned least squares

Where pseudoinverses show up

Linear regression. The classical "y = Xβ + ε" with the least-squares estimator β̂ = X⁺y. Every statistics software implements it; sklearn's LinearRegression literally calls a pinv/SVD path when the design matrix is rank deficient.
Robotics — inverse kinematics. The Jacobian of a redundant manipulator (more joints than task DOFs) has more columns than rows. The minimum-norm joint velocity that produces a desired end-effector velocity is J⁺ · v — the pseudoinverse picks the smallest joint motion that achieves the task.
Computer vision — fundamental matrix and homography estimation. Eight-point algorithm, direct linear transform, RANSAC inner loops — all solve overdetermined linear systems via the pseudoinverse with singular-value truncation.
Image deblurring and tomography. Reconstruct an image x from blurred/projected measurements b = Ax. The blur operator A is rank deficient (high-frequency components are wiped out), so naive inversion explodes — Tikhonov-regularised pseudoinverse is the standard remedy.
Compressed sensing — debiasing step. After ℓ¹ recovery picks the support of the sparse solution, the final coefficients are computed via pseudoinverse on the reduced sub-matrix to get unbiased least-squares estimates.
Control theory — minimum-norm feedback. When the actuator matrix B has more columns than rows (over-actuated systems like satellite reaction wheels), pseudoinverse-based control allocation produces the smallest-effort actuator command achieving the desired force.
Machine learning — closed-form solvers. Ridge regression, linear discriminant analysis, factor analysis — all reduce to a pseudoinverse computation. Even kernel ridge regression is a pseudoinverse on the Gram matrix.

Common mistakes

Forming AᵀA when A is poorly conditioned. κ(AᵀA) = κ(A)². If κ(A) = 10⁵, the normal equations multiply your noise by 10¹⁰ — pushing past double-precision's safe limit (≈ 10⁸). Use SVD pinv or QR for borderline problems.
Trying to invert near-zero singular values. Without a rank tolerance, 1/σ for σ ≈ 10⁻¹⁵ produces a 10¹⁵ amplification of round-off. Always use a cutoff (numpy's rcond or scipy's atol).
Expecting AA⁺ = I. It only holds when A has full row rank. In general AA⁺ is a projector onto col(A) — equal to I only when col(A) = R^m.
Treating A⁺ as a true inverse. A⁺Ax = x only when x ∈ row(A). If x has any component in ker(A), that component is lost. Pseudoinverse is one-sided unless A is invertible.
Confusing minimum-norm with minimum-residual. The pseudoinverse delivers both for full-rank tall matrices. For rank-deficient matrices, the residual is unique but the solution set is a coset — the minimum-norm representative is what A⁺ returns.
Using A⁺ when you only need to solve Ax = b once. Forming A⁺ explicitly costs O(mn²) and loses precision. If you only need x = A⁺b, call lstsq which solves the least-squares problem without ever materialising A⁺ — twice as fast.

Frequently asked questions

What does the pseudoinverse actually do?

For a square invertible A, A⁺ = A⁻¹. For everything else — rectangular, rank-deficient, singular — A⁺ gives the closest thing to an inverse. Specifically x = A⁺b is the unique vector that simultaneously (1) minimises ‖Ax − b‖₂ (least squares) and (2) has the smallest ‖x‖₂ among all such minimisers (minimum norm). Every matrix has exactly one Moore–Penrose pseudoinverse.

How is A⁺ computed from the SVD?

Compute the SVD A = UΣV^T. Form Σ⁺ by transposing Σ and inverting every nonzero singular value σ_i → 1/σ_i; leave zero singular values as zero. Then A⁺ = V Σ⁺ U^T. The cost is dominated by SVD itself: O(mn²) for an m × n matrix with m > n. This is the numerically stable construction.

What are the Penrose conditions?

Four matrix equations that uniquely define A⁺: (1) AA⁺A = A, (2) A⁺AA⁺ = A⁺, (3) (AA⁺)^T = AA⁺, (4) (A⁺A)^T = A⁺A. The first two say A⁺ acts like an inverse on both sides; the last two say AA⁺ and A⁺A are orthogonal projectors. Penrose proved (1955) that exactly one matrix satisfies all four — hence the uniqueness.

Why use SVD instead of the normal equations A^TAx = A^Tb?

Stability. The normal equations square the condition number — κ(A^TA) = κ(A)². If κ(A) is around 10⁵, κ(A^TA) climbs to 10¹⁰, pushing past double precision's safe bound of 10⁸ and producing wildly wrong answers. SVD operates on A directly, with stability that scales as κ(A). Use normal equations only when A is well-conditioned and small.

What does the pseudoinverse look like geometrically?

A maps R^n to R^m. The columns of A span an r-dimensional subspace of R^m — the column space. AA⁺ is the orthogonal projection of R^m onto the column space; it sends b to the closest point of Ax for any x. A⁺ then picks the minimum-norm pre-image. Pseudoinversion is: project b onto col(A), then take the minimum-norm pre-image.

How does the pseudoinverse handle non-uniqueness?

When A has a non-trivial null space (rank < n), the least-squares problem has infinitely many minimisers — any particular solution plus any kernel vector. The pseudoinverse picks the unique minimiser of smallest Euclidean norm: x* = A⁺b lies in the row space of A, orthogonal to the kernel. That choice is what makes A⁺b a function rather than a set.

Is the pseudoinverse numerically stable to compute?

Via SVD with a singular-value cutoff, yes — stable for any matrix with condition number below about 10⁸ (the safe limit for double precision). The trick is to treat singular values below a tolerance (typically σ₁ · max(m,n) · ε_machine) as zero rather than inverting them, which would amplify noise. NumPy's pinv and SciPy's pinv both expose this rcond parameter.