Algorithms

Strassen's Matrix Multiplication

Seven multiplications where the textbook needs eight — and cubic time cracks

Strassen's algorithm multiplies two n×n matrices using seven recursive multiplications per block instead of eight, dropping the schoolbook O(n³) cost to about O(n^2.807) — the first proof that matrix multiplication is sub-cubic.

  • Time complexityO(nlog₂7) ≈ O(n2.807)
  • SchoolbookO(n³)
  • Multiplications per 2×2 block7 (not 8)
  • Block additions18
  • PublishedVolker Strassen, 1969

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The trick: trade a multiplication for a fistful of additions

Multiply two matrices the way you learned in school and the cost is brutal. Each entry of the n×n product is a dot product of one row and one column — n multiplications. There are n² entries, so the total is n³ scalar multiplications. Double the matrix size and the work goes up eightfold. For a 4096×4096 multiply that is about 69 billion multiplications.

In 1969 Volker Strassen published a result that stunned the field: matrix multiplication is not inherently cubic. His insight starts by splitting each n×n matrix into four (n/2)×(n/2) blocks. Block multiplication looks like ordinary 2×2 multiplication, except the "numbers" are matrices:

A = | A11  A12 |     B = | B11  B12 |     C = A·B = | C11  C12 |
    | A21  A22 |         | B21  B22 |               | C21  C22 |

The obvious way computes each output block as C11 = A11·B11 + A12·B21 and so on — eight block multiplications and four block additions. That is the schoolbook recursion, and it stays O(n³). Strassen found a way to compute the same four output blocks with only seven block multiplications, paying for the saved multiplication with extra additions and subtractions. Since multiplication of large blocks is the expensive recursive part and addition is cheap (O(n²)), trading one multiplication for several additions wins at scale.

The seven products

Strassen defines seven intermediate products, each a single (n/2)×(n/2) multiplication of two sums or differences of blocks:

M1 = (A11 + A22) · (B11 + B22)
M2 = (A21 + A22) · B11
M3 = A11 · (B12 − B22)
M4 = A22 · (B21 − B11)
M5 = (A11 + A12) · B22
M6 = (A21 − A11) · (B11 + B12)
M7 = (A12 − A22) · (B21 + B22)

Then the four output blocks fall out of additions of those seven products:

C11 = M1 + M4 − M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 − M2 + M3 + M6

That is the whole algorithm. Count the work at one level: 7 multiplications, plus 10 additions/subtractions to form the operands of M1…M7, plus 8 additions/subtractions to assemble C — 18 block additions in total. Each multiplication recurses on a half-size problem; each addition is O((n/2)²). The recurrence is:

T(n) = 7 · T(n/2) + Θ(n²)

By the master theorem, because 7 > 2² the work is dominated by the leaves of the recursion tree, giving T(n) = Θ(n^log₂7) ≈ Θ(n^2.807). The exponent dropped from 3 to 2.807 — small on paper, enormous as n grows. Verifying the identities is pure algebra: expand M1 + M4 − M5 + M7 and watch every cross term cancel except A11·B11 + A12·B21, which is exactly C11.

When Strassen actually pays off

  • Large dense matrices. The asymptotic win only shows up once n is large enough to overcome the bigger constant factor — in tuned libraries the crossover is roughly n = 500 to 1000.
  • A few levels of recursion, then switch. Production code recurses with Strassen at the top, then drops to a cache-optimized dense kernel once blocks are small (often 64 or 128). One or two Strassen levels capture most of the benefit with the least numerical and overhead cost.
  • Compute-bound, not memory-bound. Strassen needs scratch space for the M-products and operand sums. If you are memory-constrained, the extra allocations can erase the savings.

Skip Strassen for small matrices, sparse matrices (use sparse formats instead), or ill-conditioned numerical work where stability matters more than speed. For graphics 4×4 transforms or tiny kernels, the naive method is faster and exact.

Strassen vs the alternatives

SchoolbookStrassen (1969)Coppersmith–Winograd familyBlock tiling (BLAS)
Time complexityO(n³)O(n^2.807)O(n^2.371) (best known)O(n³)
Multiplications per 2×2 block87fewer, but huge constant8
Hidden constantsmallmoderate (18 adds)astronomical (galactic)small, cache-tuned
Extra memorynoneO(n²) scratchlargetiles only
Cache behaviorgood with tilingworse (recursion, scratch)impracticalexcellent
Numerical stabilitybest (O(n·ε))weaker (cancellation)poor / theoreticalbest
Used in practicesmall nlarge dense n, hybridnever (theory only)everywhere (default)

The headline: Strassen is the only sub-cubic algorithm anyone runs. Everything below 2.807 is a "galactic algorithm" — provably faster asymptotically, but with constants so large no realistic input is big enough to benefit. The current record exponent ω < 2.3715 (Williams–Xu–Xu–Zhou, 2024, building on Duan–Wu–Zhou, Le Gall, and Alman–Williams) lives entirely in proofs.

What the numbers actually say

  • One Strassen level cuts ~12.5% of the multiplications. 7/8 = 0.875, so a single split replaces 8 sub-multiplications with 7. Recurse fully and the multiplication count for n = 1024 drops from 1024³ ≈ 1.07 billion to 7^10 ≈ 282 million — about 3.8× fewer multiplications.
  • Crossover is the catch. Those saved multiplications cost 18 block additions per level. For small blocks the additions and recursion overhead dominate, which is why hybrids switch to dense multiply below ~64–128.
  • Asymptotic gap widens fast. At n = 10⁶, n³ = 10¹⁸ while n^2.807 ≈ 7×10¹⁶ — roughly a 14× reduction in operation count, before constants.
  • Stability cost is a larger error constant. Schoolbook gives a componentwise bound; Strassen gives only a normwise bound with a constant that grows per recursion level. For well-conditioned matrices the difference is rarely visible in double precision.

JavaScript implementation

This version pads to the next power of two, recurses with Strassen down to a small base case, then uses the naive kernel. Helpers add and sub work on square blocks.

const BASE = 64; // switch to naive below this size

function naive(A, B, n) {
  const C = Array.from({ length: n }, () => new Float64Array(n));
  for (let i = 0; i < n; i++)
    for (let k = 0; k < n; k++) {
      const a = A[i][k];
      for (let j = 0; j < n; j++) C[i][j] += a * B[k][j];
    }
  return C;
}

const add = (X, Y, n) => X.map((row, i) => row.map((v, j) => v + Y[i][j]));
const sub = (X, Y, n) => X.map((row, i) => row.map((v, j) => v - Y[i][j]));

function split(M, n) {                       // four (n/2) quadrants
  const h = n / 2;
  const q = (r0, c0) => M.slice(r0, r0 + h).map(row => row.slice(c0, c0 + h));
  return [q(0, 0), q(0, h), q(h, 0), q(h, h)]; // A11, A12, A21, A22
}

function strassen(A, B, n) {
  if (n <= BASE) return naive(A, B, n);
  const h = n / 2;
  const [A11, A12, A21, A22] = split(A, n);
  const [B11, B12, B21, B22] = split(B, n);

  const M1 = strassen(add(A11, A22, h), add(B11, B22, h), h);
  const M2 = strassen(add(A21, A22, h), B11, h);
  const M3 = strassen(A11, sub(B12, B22, h), h);
  const M4 = strassen(A22, sub(B21, B11, h), h);
  const M5 = strassen(add(A11, A12, h), B22, h);
  const M6 = strassen(sub(A21, A11, h), add(B11, B12, h), h);
  const M7 = strassen(sub(A12, A22, h), add(B21, B22, h), h);

  const C11 = add(sub(add(M1, M4, h), M5, h), M7, h);
  const C12 = add(M3, M5, h);
  const C21 = add(M2, M4, h);
  const C22 = add(sub(add(M1, M3, h), M2, h), M6, h);

  const C = Array.from({ length: n }, () => new Float64Array(n));
  for (let i = 0; i < h; i++)
    for (let j = 0; j < h; j++) {
      C[i][j] = C11[i][j];       C[i][j + h] = C12[i][j];
      C[i + h][j] = C21[i][j];   C[i + h][j + h] = C22[i][j];
    }
  return C;
}

Two details that bite. First, the base-case cutoff is not optional — without it you recurse down to 1×1 multiplies and the addition overhead makes this slower than the naive method for every realistic n. Second, if n is not a power of two you must pad both inputs to the next power of two (fill with zeros) before calling strassen, then slice the real result back out.

Python implementation

Same structure with NumPy doing the block arithmetic. NumPy's @ on the base case is itself a tuned BLAS call, so the cutoff should be large here.

import numpy as np

BASE = 128  # below this, let BLAS handle the dense multiply

def strassen(A, B):
    n = A.shape[0]
    if n <= BASE:
        return A @ B
    h = n // 2
    A11, A12 = A[:h, :h], A[:h, h:]
    A21, A22 = A[h:, :h], A[h:, h:]
    B11, B12 = B[:h, :h], B[:h, h:]
    B21, B22 = B[h:, :h], B[h:, h:]

    M1 = strassen(A11 + A22, B11 + B22)
    M2 = strassen(A21 + A22, B11)
    M3 = strassen(A11,        B12 - B22)
    M4 = strassen(A22,        B21 - B11)
    M5 = strassen(A11 + A12, B22)
    M6 = strassen(A21 - A11, B11 + B12)
    M7 = strassen(A12 - A22, B21 + B22)

    C11 = M1 + M4 - M5 + M7
    C12 = M3 + M5
    C21 = M2 + M4
    C22 = M1 - M2 + M3 + M6

    C = np.empty((n, n), dtype=A.dtype)
    C[:h, :h], C[:h, h:] = C11, C12
    C[h:, :h], C[h:, h:] = C21, C22
    return C

def strassen_padded(A, B):
    """Pad to the next power of two, multiply, slice back."""
    n = A.shape[0]
    m = 1 << (n - 1).bit_length()          # next power of two >= n
    Ap = np.zeros((m, m)); Ap[:n, :n] = A
    Bp = np.zeros((m, m)); Bp[:n, :n] = B
    return strassen(Ap, Bp)[:n, :n]

The padding helper rounds n up to the next power of two with a bit trick: 1 << (n-1).bit_length() gives the smallest power of two not less than n. The zero rows and columns contribute nothing to the real product, so slicing [:n, :n] recovers the exact answer.

Variants worth knowing

Winograd's form (Strassen–Winograd). A reformulation that needs only 15 additions per level instead of 18 by sharing common subexpressions. Same 7 multiplications, same O(n^2.807), but a smaller constant — this is the variant most high-performance libraries actually implement.

Dynamic peeling. Instead of padding odd-sized matrices to a power of two, peel off the odd row and column, recurse on the even (n−1) part, and patch the result with rank-one and matrix-vector corrections. Avoids wasting work on padded zeros for sizes far from a power of two.

Hybrid / cutoff Strassen. The version everyone uses in production: recurse with Strassen only for the top one or two levels, then call a cache-blocked dense GEMM. Captures the asymptotic win where it matters while keeping the inner loops cache-friendly and numerically clean.

Coppersmith–Winograd and successors. A different family entirely, based on the laser method and tensor rank, pushing the exponent ω toward 2. The 2024 record is ω < 2.371552 (Williams, Xu, Xu, Zhou — refining Duan, Wu, Zhou's 2.371866). These prove matrix multiplication is even cheaper asymptotically but are galactic — never run on real hardware.

Alphatensor and the 4×4 result (2022). DeepMind's reinforcement-learning search rediscovered Strassen-like schemes and found a 4×4 multiplication using 47 multiplications over certain finite fields, briefly beating the recursive Strassen count for that case — a reminder the search for low-rank multiplication schemes is still open.

Common bugs and edge cases

  • Forgetting the base-case cutoff. Recursing all the way to 1×1 makes Strassen far slower than naive — the 18 additions per level swamp the saved multiplication. Always switch to a dense kernel for small blocks.
  • Assuming power-of-two sizes. The clean split needs an even n at every level. Pad to the next power of two or use dynamic peeling; a raw recursion on n = 6 will hit an odd dimension and break.
  • Sign errors in the assembly formulas. C11 = M1 + M4 − M5 + M7 and C22 = M1 − M2 + M3 + M6 are easy to mistype. Validate against a naive multiply on random matrices before trusting the code.
  • Allocating fresh scratch on every call. Each recursion level builds 10 operand sums and 7 products. Naive allocation thrashes the heap; high-performance versions preallocate or reuse buffers.
  • Expecting bit-exact results. Strassen's subtractions cause cancellation, so its floating-point output differs from the naive product in the last few bits. Test with a tolerance, not exact equality.
  • Using it on sparse or tiny matrices. Strassen densifies everything and has a large startup cost. Sparse formats or naive multiply win outside the large-dense regime.

Frequently asked questions

Why is Strassen's algorithm O(n^2.807) and not O(n³)?

The recurrence is T(n) = 7·T(n/2) + O(n²). Schoolbook block multiplication makes 8 recursive calls, giving T(n) = 8·T(n/2) + O(n²) = O(n³). Strassen reduces the 8 multiplications to 7 with clever additions, so the exponent becomes log₂(7) ≈ 2.807 by the master theorem.

What are the seven products in Strassen's algorithm?

M1 = (A11+A22)(B11+B22), M2 = (A21+A22)B11, M3 = A11(B12−B22), M4 = A22(B21−B11), M5 = (A11+A12)B22, M6 = (A21−A11)(B11+B12), M7 = (A12−A22)(B21+B22). The four output blocks are combinations of these: C11 = M1+M4−M5+M7, C12 = M3+M5, C21 = M2+M4, C22 = M1−M2+M3+M6.

Is Strassen's algorithm actually faster in practice?

Only for large matrices. The hidden constant is large (18 block additions versus 4), and recursion plus the loss of cache-friendly memory access mean the schoolbook method wins below a crossover point — typically around n = 500 to 1000 for tuned BLAS implementations. Production libraries switch to dense multiplication once a sub-block is small.

Does Strassen's algorithm only work on power-of-two matrices?

The clean recursion assumes n is even at every level, which is automatic for powers of two. For other sizes you either pad the matrix with zero rows and columns up to the next power of two, or use dynamic peeling — split off an odd row/column, handle it with a rank-one update, and recurse on the even part.

Is Strassen's algorithm numerically stable?

Less stable than the schoolbook method. The subtractions in the seven products cause cancellation, so the error bound grows with a larger constant than the naive O(n·ε) bound. It is stable enough for most floating-point work but not for ill-conditioned problems where you need the tightest possible accuracy.

What is the fastest known matrix multiplication algorithm?

As of 2024 the best published exponent ω is about 2.371, from refinements of the Coppersmith–Winograd laser method (Williams, Le Gall, Alman–Williams, and Duan–Wu–Zhou). These are galactic algorithms — the constants are astronomically large, so Strassen at 2.807 remains the only sub-cubic method anyone actually runs.