Linear Algebra

Matrix Multiplication

Compose linear transformations — and the operation behind every neural network

Matrix multiplication combines two matrices A and B into a new matrix AB whose (i,j) entry is the dot product of A's i-th row and B's j-th column. Geometrically, it composes two linear transformations. Every neural network forward pass, every 3D rotation, every solution to linear systems — all built on this single operation. The world's GPUs are mostly designed to do this fast.

  • Definition(AB)_{ij} = ∑_k A_{ik} · B_{kj}
  • Compatibilitym×n matrix times n×p matrix gives m×p
  • Not commutativeAB ≠ BA in general
  • Associative(AB)C = A(BC)
  • Naive timeO(n³) for n×n matrices
  • Strassen / faster algorithmsO(n^2.81); current best ~O(n^2.371)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How matrix multiplication works

For matrices A (m × n) and B (n × p), the product C = AB is an (m × p) matrix where:

C_{ij} = ∑_{k=1}^n A_{ik} · B_{kj}

The (i, j) entry of C is the dot product of A's i-th row with B's j-th column. Compute this for every (i, j) pair — that's the whole multiplication.

Concrete example. Multiply:

A = [1 2]   B = [5 6]
    [3 4]       [7 8]
C[0,0] = 1·5 + 2·7 = 19
C[0,1] = 1·6 + 2·8 = 22
C[1,0] = 3·5 + 4·7 = 43
C[1,1] = 3·6 + 4·8 = 50

AB = [19 22]
     [43 50]

Why this definition?

Matrix multiplication composes linear transformations. If A and B are matrices representing linear transformations T_A and T_B (each takes a vector to another vector), then the composition T_A(T_B(v)) — apply B first, then A — is itself a linear transformation. The matrix of that composition is exactly AB (computed by the row-by-column formula).

So the formula isn't arbitrary — it's forced by "what matrix represents the composition of two transformations?" The answer is unique, and it's row-by-column.

Algebraic properties

PropertyStatementLike ordinary multiplication?
Associative(AB)C = A(BC)Yes
DistributiveA(B + C) = AB + ACYes
IdentityAI = IA = A (where I is identity matrix)Yes
CommutativeAB ≠ BA in generalNO — major difference from numbers
InversesSome A have A⁻¹ such that AA⁻¹ = ILike numbers, but not all matrices have inverses
ZeroAB = 0 doesn't imply A = 0 or B = 0NO — different from numbers

The non-commutativity (AB ≠ BA) is the deepest difference. Number multiplication is commutative; matrix multiplication isn't because composing transformations in different orders gives different results.

Vector-times-matrix and matrix-times-vector

A matrix A times a column vector v applies the linear transformation A to v:

A v = w  (transforms input v into output w)

Geometrically — A could rotate, scale, shear, project, or some combination. Each column of A is the image of one basis vector under the transformation.

For neural networks — input layer is a vector x; multiply by weight matrix W to get the next layer's pre-activation. Each row of W weights one output neuron's connections to the input.

Worked examples

Example 1 — 2D rotation matrix

The matrix that rotates a 2D vector by angle θ counterclockwise:

R(θ) = [cos θ  −sin θ]
       [sin θ   cos θ]

Multiplying R(θ₁) · R(θ₂) gives R(θ₁ + θ₂) — rotations compose by adding angles. Verify by working out the trigonometric expansion using angle-sum formulas.

Example 2 — projecting onto a line

The matrix that projects 2D vectors onto the x-axis:

P = [1 0]
    [0 0]

P · [a, b]ᵀ = [a, 0]ᵀ — vertical component dropped, horizontal kept. Note P² = P (projection is idempotent — projecting twice is the same as once).

Example 3 — graphics transformation chain

To rotate a 3D model by angle θ around the y-axis, then scale by factor s, then translate by (tx, ty, tz):

Translate · Scale · Rotate · vertex

The combined transformation matrix is M = T · S · R. Multiplying it by each vertex position gives the final screen position. Pre-multiplying T, S, R once and reusing M for all vertices is the standard graphics pipeline optimization.

JavaScript — matrix multiplication

function matmul(A, B) {
  const m = A.length;          // rows of A
  const n = A[0].length;        // cols of A = rows of B
  const p = B[0].length;        // cols of B
  if (B.length !== n) throw new Error('Incompatible dimensions');

  const C = Array.from({ length: m }, () => new Array(p).fill(0));
  for (let i = 0; i < m; i++) {
    for (let j = 0; j < p; j++) {
      let sum = 0;
      for (let k = 0; k < n; k++) {
        sum += A[i][k] * B[k][j];
      }
      C[i][j] = sum;
    }
  }
  return C;
}

// Performance optimization — i, k, j loop order is cache-friendlier
function matmulFast(A, B) {
  const m = A.length, n = A[0].length, p = B[0].length;
  const C = Array.from({ length: m }, () => new Array(p).fill(0));
  for (let i = 0; i < m; i++) {
    for (let k = 0; k < n; k++) {
      const aik = A[i][k];
      for (let j = 0; j < p; j++) {
        C[i][j] += aik * B[k][j];
      }
    }
  }
  return C;
}

const A = [[1,2], [3,4]];
const B = [[5,6], [7,8]];
matmul(A, B);  // [[19, 22], [43, 50]]

For large matrices, real production code uses optimized BLAS implementations — OpenBLAS, Intel MKL, NVIDIA cuBLAS. These exploit cache hierarchy, SIMD instructions, and parallel cores. Naive JavaScript code is 10-100× slower per operation.

Faster algorithms — Strassen and beyond

AlgorithmComplexityYearPractical?
Naive (row-by-column)O(n³)Default for small n
StrassenO(n^2.807)1969Yes for n > ~1000
PanO(n^2.78)1978No (theoretical)
Coppersmith-WinogradO(n^2.376)1990No (galactic)
Le GallO(n^2.3729)2014No (galactic)
WilliamsO(n^2.371552)2024No (galactic)

"Galactic" algorithms have asymptotic improvements but enormous constant factors that make them slower than naive for any matrix size that fits on Earth. Strassen is the only sub-cubic algorithm used in practice; everything beyond is theoretical.

Where matrix multiplication shows up

  • Neural networks. Every layer is matrix-multiply + non-linearity. Forward pass — activations × weights. Backward pass — error × transpose-of-weights. GPU clusters spend most cycles here.
  • Computer graphics. 3D vertex transformation, model-view-projection composition. Every rendered pixel passes through several matrix multiplications.
  • Solving linear systems. Ax = b is solved by computing A⁻¹ (which uses matmul internally) or by Gaussian elimination (also matmul).
  • Statistics — covariance and PCA. Covariance matrices, principal components — both derived from matrix multiplications.
  • Signal processing. Discrete Fourier Transform — matrix multiplication of signal vector with DFT matrix. Filtering, convolution, image transformations.
  • Network analysis. Adjacency matrix powers — A² counts paths of length 2; A^k counts paths of length k. Markov chain transitions iterate by matrix multiplication.

Common mistakes

  • Multiplying in the wrong order. Matrix multiplication isn't commutative. AB ≠ BA. Always check which order makes sense for your transformation.
  • Forgetting dimension compatibility. m×n times p×q is valid only if n = p. Mismatched dimensions are silent in some libraries (NumPy throws; some custom code returns garbage).
  • Confusing matmul with element-wise multiplication. Matrix multiplication is the row-by-column dot-product operation. Element-wise multiplication (A * B with same dimensions) is a different operation. NumPy uses A @ B for matmul, A * B for element-wise.
  • Inverting non-invertible matrices. Singular matrices (det = 0) have no inverse. Trying to compute A⁻¹ for them either errors or returns garbage. Always check with det(A) ≠ 0 or use pseudoinverse.
  • Using naive O(n³) for very large matrices. For n > ~1000, parallel libraries (BLAS, cuBLAS) and Strassen-style algorithms can give 10-100× speedup. Don't roll your own; use optimized linear algebra.
  • Floating-point error accumulation. n× multiplications and additions accumulate roundoff. For ill-conditioned matrices, the result loses precision. Use higher-precision (float64), iterative refinement, or specialized stable algorithms.

Frequently asked questions

Why is matrix multiplication defined this way?

Because it composes linear transformations. If A represents the transformation T_A and B represents T_B, then AB represents the composition T_A ∘ T_B (apply B first, then A). Working out the algebra of composition gives exactly the row-by-column formula. The definition isn't arbitrary; it's forced by the requirement to compose transformations.

Why is AB ≠ BA in general?

Because applying transformations in different orders gives different results. Rotate then translate ≠ translate then rotate. The matrices encode this fundamental property — multiplication order matters. Even when both AB and BA exist (square matrices), they're usually different. Cases where AB = BA — same matrix, identity, scalar matrix, or matrices that share the same eigenvectors.

When is matrix multiplication defined?

When the inner dimensions match — m×n times n×p is valid (the n's must match). The result is m×p. Multiplying a 3×4 and a 5×6 doesn't make sense — the 4 doesn't match the 5. Square matrices of the same size are always compatible.

How fast is matrix multiplication?

Naive — O(n³). For each of the n² output entries, compute a dot product of n terms. Faster algorithms exist — Strassen (1969) — O(n^2.807) using clever recursion. Subsequent improvements bring this to ~O(n^2.371) (Coppersmith-Winograd and successors). In practice, Strassen has high constant factors and only beats naive for n &gt; ~1000. GPU implementations parallelize the O(n³) work massively.

How is matrix multiplication used in machine learning?

Every layer of a neural network is essentially a matrix multiply followed by a non-linearity. Input × weights = pre-activation. With million-parameter networks, this means trillions of multiply-add operations per training example. NVIDIA's tensor cores and Google's TPUs are specialized hardware for matrix multiplication — at FP16 or BF16 precision for speed.

How does matrix multiplication relate to graphics?

3D rotation, scaling, translation are all matrix operations. Composing them — first rotate, then translate — is matrix multiplication of the corresponding matrices. The OpenGL/DirectX pipeline pre-multiplies matrices for view, projection, model — the result is one 4×4 matrix that transforms vertices to screen positions in one step.

What's broadcasting and how does it relate?

Broadcasting (NumPy term) automatically extends arrays of incompatible shapes for element-wise operations. NOT the same as matrix multiplication. Broadcasting handles cases like adding a 1×n vector to an m×n matrix (treats the vector as repeated m times). Matrix multiplication has strict dimensional requirements; element-wise ops broadcast.