Algorithms

Strassen's Matrix Multiplication

Seven products where the textbook needs eight — and the cubic barrier falls

Strassen's algorithm multiplies two n×n matrices with seven recursive products instead of eight, dropping the cost from O(n³) to O(n^2.807) and breaking the cubic barrier that stood since antiquity.

  • Time complexityO(n^2.807)
  • Recursive products7 (not 8)
  • RecurrenceT(n)=7T(n/2)+O(n²)
  • DiscoveredVolker Strassen, 1969
  • Practical crossovern ≈ 500–1000

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The trick: trade one multiplication for several additions

Multiply two matrices the way you learned in school and the cost grows as the cube of the dimension. Each of the n² output entries is a dot product of length n, so an n×n by n×n product costs n² · n = n³ scalar multiplications. For a 1000×1000 matrix that's a billion multiplies. People assumed for centuries that this was simply how much arithmetic the problem required — that n³ was a law of nature, not a bound waiting to be beaten.

In 1969 Volker Strassen showed otherwise. His insight lives entirely in the 2×2 case. Split each input into four quadrants:

    A = | A11  A12 |     B = | B11  B12 |     C = A·B = | C11  C12 |
        | A21  A22 |         | B21  B22 |               | C21  C22 |

The block formulas give each output quadrant as a sum of two block products — for example C11 = A11·B11 + A12·B21. Counting them up, the four output blocks need eight block multiplications. Recurse on each, and you've just re-derived O(n³): the recurrence T(n) = 8·T(n/2) + O(n²) solves to n³.

Strassen's move was to compute seven specially crafted products instead, each a single multiplication of one sum of A-blocks with one sum of B-blocks:

M1 = (A11 + A22) · (B11 + B22)
M2 = (A21 + A22) · B11
M3 = A11 · (B12 − B22)
M4 = A22 · (B21 − B11)
M5 = (A11 + A12) · B22
M6 = (A21 − A11) · (B11 + B12)
M7 = (A12 − A22) · (B21 + B22)

and then reassembles the answer with only additions and subtractions:

C11 = M1 + M4 − M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 − M2 + M3 + M6

Seven multiplications, eighteen additions. The additions are "cheap" because additions cost O(n²) at every level while the multiplications are the recursive O(n³)-driving operations. Cut the recursive calls from eight to seven and the whole exponent bends.

Why seven turns n³ into n^2.807

The new recurrence is T(n) = 7·T(n/2) + O(n²). The combine step — those eighteen block additions, each over n²/4 entries — is O(n²), and there are seven recursive calls on half-size matrices. Plug into the master theorem: with a = 7, b = 2, the critical exponent is log_b a = log₂ 7 ≈ 2.807. Because n^2.807 dominates the n² combine cost, the work piles up at the leaves of the recursion tree and the total is:

T(n) = Θ(n^log₂7) = Θ(n^2.807...)

Geometrically, the recursion tree has depth log₂ n and a branching factor of 7, so it holds 7^(log₂ n) = n^2.807 leaves where the naive tree, branching by 8, holds n³. That single missing branch, compounded over every level, is the entire savings.

When Strassen pays off — and when it doesn't

  • Large, dense matrices. The asymptotic win only shows up once n is big enough that n^2.807 beats n³ by more than the constant-factor penalty. That crossover is real software, not theory: it lands around n ≈ 500–1000 in tuned implementations.
  • Recursion that bottoms out early. Production Strassen never recurses to 1×1 scalars. It switches to a cache-friendly naive (or BLAS) base case once the submatrices fit comfortably in cache, typically at blocks of 32–128. Recursing all the way down would be slower because the per-call bookkeeping overwhelms the arithmetic.
  • Not for small or sparse work. For n in the dozens, the naive triple loop wins outright. For sparse matrices, specialized formats beat any dense O(n^2.807) scheme.
  • Not where you need tight error bounds. Strassen is numerically weaker (see below); safety-critical numerics often forbid it.

In practice the strongest competition isn't the naive triple loop at all — it's a vendor BLAS dgemm routine that is already blocked, vectorized, and multithreaded to near-peak FLOPS. Strassen has to beat that, which is why most libraries reserve it for genuinely huge matrices, and some skip it entirely.

Strassen vs other matrix-multiply algorithms

Naive (school)StrassenWinograd variantCoppersmith–WinogradBest known (2020s)
Exponent3.0002.8072.8072.376≈ 2.371
Recursive products877
Additions per level4 blocks18 blocks15 blockshugehuge
Practical?Yes (small n)Yes (large n)Yes (large n)No (galactic)No (galactic)
Numerical stabilityStrongWeak (norm-wise)WeakUnknown / poorUnknown / poor
Crossover n≈ 500–1000≈ 500–1000> 10^20 (never)> 10^20 (never)

The "galactic" algorithms in the right two columns have asymptotic exponents below 2.4, but their constant factors are so astronomically large that they only beat Strassen for matrices bigger than any physical computer could hold. Strassen sits on the practical frontier: the only sub-cubic algorithm anyone actually runs.

What the numbers actually say

  • One fewer multiplication per level. 7 instead of 8 is a 12.5% cut per recursion level, but it compounds. At n = 1024 (ten levels) the recursive-product count is 7^10 ≈ 2.8×10^8 versus 8^10 ≈ 1.07×10^9 — Strassen does roughly 3.8× fewer leaf multiplications.
  • The exponent gap widens with size. n³/n^2.807 = n^0.193. At n = 1000 that's a factor of ~3.8; at n = 10⁶ it would be ~14×. The bigger the matrix, the larger the theoretical edge.
  • Eighteen additions per level cost something. Strassen replaces 4 block additions with 18, so the additive overhead more than quadruples. That extra O(n²) work per level is exactly why the crossover sits in the hundreds, not at n = 2.
  • It can't be 2×2-optimal with fewer. A 1971 lower bound (Hopcroft–Kerr, Winograd) proved that 7 multiplications is the minimum for 2×2 matrices over a non-commutative ring — you cannot do 2×2 in six. Strassen is provably optimal at the base case it exploits.
  • Memory cost. A textbook recursive Strassen allocates temporaries for the seven products and their sums, costing extra O(n²) auxiliary memory per level; careful implementations reuse buffers to keep the overhead to a small constant number of n×n scratch arrays.

JavaScript implementation

A teaching implementation that pads to a power of two, recurses with the seven products, and falls back to the naive base case for small blocks:

const add = (X, Y) => X.map((r, i) => r.map((v, j) => v + Y[i][j]));
const sub = (X, Y) => X.map((r, i) => r.map((v, j) => v - Y[i][j]));

function naive(A, B) {
  const n = A.length, m = B[0].length, k = B.length;
  const C = Array.from({ length: n }, () => new Float64Array(m));
  for (let i = 0; i < n; i++)
    for (let p = 0; p < k; p++) {
      const a = A[i][p];
      for (let j = 0; j < m; j++) C[i][j] += a * B[p][j];
    }
  return C.map(r => Array.from(r));
}

function split(M) {                       // four equal quadrants of an even matrix
  const n = M.length, h = n / 2;
  const q = (r0, c0) => M.slice(r0, r0 + h).map(row => row.slice(c0, c0 + h));
  return [q(0, 0), q(0, h), q(h, 0), q(h, h)];
}

function strassen(A, B, cutoff = 64) {
  const n = A.length;
  if (n <= cutoff) return naive(A, B);     // base case: stop recursing

  const [A11, A12, A21, A22] = split(A);
  const [B11, B12, B21, B22] = split(B);

  const M1 = strassen(add(A11, A22), add(B11, B22), cutoff);
  const M2 = strassen(add(A21, A22), B11, cutoff);
  const M3 = strassen(A11, sub(B12, B22), cutoff);
  const M4 = strassen(A22, sub(B21, B11), cutoff);
  const M5 = strassen(add(A11, A12), B22, cutoff);
  const M6 = strassen(sub(A21, A11), add(B11, B12), cutoff);
  const M7 = strassen(sub(A12, A22), add(B21, B22), cutoff);

  const C11 = add(sub(add(M1, M4), M5), M7);
  const C12 = add(M3, M5);
  const C21 = add(M2, M4);
  const C22 = add(sub(add(M1, M3), M2), M6);

  const h = n / 2, C = Array.from({ length: n }, () => new Array(n));
  for (let i = 0; i < h; i++)
    for (let j = 0; j < h; j++) {
      C[i][j]         = C11[i][j];
      C[i][j + h]     = C12[i][j];
      C[i + h][j]     = C21[i][j];
      C[i + h][j + h] = C22[i][j];
    }
  return C;
}

// Caller pads non-power-of-two inputs to the next power of two with zeros,
// runs strassen(), then crops the result back to the original dimensions.

Two details matter for correctness and speed. First, the cutoff is not an optimization afterthought — recursing to 1×1 makes this slower than the naive loop for any realistic input, so the base-case switch is mandatory. Second, every quadrant must be square and even-sized; the caller pads to a power of two so the recursion never hits an odd split.

Python implementation

The NumPy version makes the seven products and the block recombination read almost exactly like the algebra:

import numpy as np

def strassen(A, B, cutoff=64):
    n = A.shape[0]
    if n <= cutoff:
        return A @ B                       # base case: hand off to BLAS

    h = n // 2
    A11, A12 = A[:h, :h], A[:h, h:]
    A21, A22 = A[h:, :h], A[h:, h:]
    B11, B12 = B[:h, :h], B[:h, h:]
    B21, B22 = B[h:, :h], B[h:, h:]

    M1 = strassen(A11 + A22, B11 + B22, cutoff)
    M2 = strassen(A21 + A22, B11,       cutoff)
    M3 = strassen(A11,       B12 - B22, cutoff)
    M4 = strassen(A22,       B21 - B11, cutoff)
    M5 = strassen(A11 + A12, B22,       cutoff)
    M6 = strassen(A21 - A11, B11 + B12, cutoff)
    M7 = strassen(A12 - A22, B21 + B22, cutoff)

    C11 = M1 + M4 - M5 + M7
    C12 = M3 + M5
    C21 = M2 + M4
    C22 = M1 - M2 + M3 + M6

    C = np.empty((n, n), dtype=A.dtype)
    C[:h, :h], C[:h, h:] = C11, C12
    C[h:, :h], C[h:, h:] = C21, C22
    return C

def multiply(A, B, cutoff=64):
    """Pad to the next power of two, run Strassen, crop back."""
    n = max(A.shape + B.shape)
    m = 1
    while m < n:
        m *= 2
    Ap = np.zeros((m, m), dtype=A.dtype); Ap[:A.shape[0], :A.shape[1]] = A
    Bp = np.zeros((m, m), dtype=B.dtype); Bp[:B.shape[0], :B.shape[1]] = B
    return strassen(Ap, Bp, cutoff)[:A.shape[0], :B.shape[1]]

Note the base case hands off to NumPy's @ operator, which calls an optimized BLAS dgemm. On most hardware that BLAS routine is so fast you will struggle to make this Python Strassen beat a plain A @ B at all — a sobering demonstration of how high the practical bar really is.

Variants worth knowing

Winograd's form (Strassen–Winograd). A rearrangement that computes the same seven products but with only 15 additions instead of 18. It's the version most production codes actually use because additions, while cheap per operation, dominate the constant factor near the crossover.

Karatsuba multiplication. The same "three products instead of four" idea applied to big integers and polynomials: split each number in half and use 3 sub-multiplications, giving O(n^1.585). Strassen is, in spirit, the matrix sibling of Karatsuba — both buy a smaller exponent by spending extra additions.

Coppersmith–Winograd and the laser method. A family of algorithms that push the exponent toward 2.376 and below by analyzing tensor rank of bilinear forms. They are theoretical landmarks but galactic — too large-constant to run.

AlphaTensor's discoveries (2022). DeepMind's reinforcement-learning system rediscovered Strassen-like decompositions and found new low-rank tensor factorizations for specific small sizes (e.g. 4×4 and 5×5 over certain fields), shaving a multiplication here and there — the first machine-discovered improvements to the small-case base tables in decades.

Mixed / hybrid schemes. Real linear-algebra libraries apply one or two levels of Strassen on top of a heavily tuned BLAS kernel, capturing part of the asymptotic win without paying the full recursion-overhead and stability cost of pure Strassen.

Common bugs and edge cases

  • Recursing all the way to scalars. The single most common reason a hand-written Strassen is slower than the naive loop: with no cutoff, the recursion overhead and temporary allocations swamp the arithmetic. Always switch to a naive (or BLAS) base case.
  • Assuming a power-of-two size. Odd dimensions break the four-equal-quadrants split. Pad to the next power of two with zeros (simplest) or peel the odd boundary; forgetting this silently corrupts the result on, say, a 7×7 input.
  • Sign errors in the recombination. The C11 and C22 formulas mix four terms with different signs (M1 + M4 − M5 + M7). Transcribing the table from memory is error-prone — verify against a naive product on random matrices.
  • Ignoring numerical stability. Strassen is only norm-wise weakly stable, not component-wise. For ill-conditioned matrices or applications needing per-entry accuracy, the extra error can matter; the naive algorithm is the safer choice there.
  • Forgetting the multiplications are non-commutative. The seven products multiply blocks, not scalars. You may never reorder a block product (A11·B11 ≠ B11·A11 for non-symmetric blocks), even though the scalar derivation looks commutative.
  • Benchmarking against the wrong baseline. Beating your own naive triple loop is easy; beating a tuned multithreaded BLAS is the real test. Many "my Strassen is faster" results vanish the moment the baseline is a real dgemm.

Frequently asked questions

How does Strassen do 2×2 multiplication with only seven products?

The naive 2×2 product needs eight scalar multiplications, one per entry of the result times the corresponding row-column pair. Strassen forms seven cleverly chosen sums of the input entries, multiplies those seven pairs once each, then recombines them with eighteen additions and subtractions to recover all four output entries. The trade is one multiplication for several extra additions — a win because multiplication, recursively applied, is the expensive part.

Why is Strassen's complexity O(n^2.807) and not O(n³)?

The recurrence is T(n) = 7·T(n/2) + O(n²). By the master theorem the work is dominated by the leaves: there are 7^(log₂ n) = n^(log₂ 7) of them, and log₂ 7 ≈ 2.807. The naive algorithm makes eight recursive calls, giving 8^(log₂ n) = n³. Cutting eight calls to seven is what bends the exponent below three.

At what matrix size does Strassen actually beat the naive algorithm?

It depends on the implementation, but the crossover is typically somewhere between n ≈ 500 and n ≈ 1000 for well-tuned code. Below that, the extra additions, recursion overhead, and worse cache behavior outweigh the saved multiplications, so the cubic algorithm — or a vendor BLAS routine — wins. Practical libraries switch to a naive base case once the submatrices are small.

Is Strassen numerically stable?

It is stable enough for most uses but weaker than the naive algorithm. Strassen is only norm-wise weakly stable: the error bound depends on the norms of the whole matrices rather than the individual entries, so a few large entries can pollute small results. For most floating-point work the difference is negligible, but it disqualifies Strassen from applications that need tight per-entry error guarantees.

What is the theoretically fastest known matrix multiplication?

As of the mid-2020s the best published exponent is about ω ≈ 2.371, from refinements of the Coppersmith–Winograd laser method by Williams, Le Gall, Alman, Duan, and others. Those algorithms are galactic — their constant factors are so enormous they only win for matrices larger than any computer could store. Strassen, at 2.807, remains the only sub-cubic algorithm that is also practical.

Does the input matrix have to be a power of two?

No. The clean recursion assumes even dimensions so each matrix splits into four equal quadrants, but you handle odd sizes by padding with a row and column of zeros (which costs at most a constant factor) or by peeling off the odd boundary and patching it with rank-one corrections. Padding to the next power of two is the simplest and most common approach in teaching implementations.