Linear Algebra

Matrix Multiplication

Name: Matrix Multiplication — 60-second explainer
Uploaded: 2026-05-13T09:57:36Z
Duration: 1 min
Description: Matrix multiplication combines two matrices A and B into a new matrix AB whose (i,j) entry is the dot product of A's i-th row and B's j-th column. Geometrically, it composes two linear transformations. Every neural network forward pass, every 3D rotation, every solution to linear systems — all built on this single operation. The world's GPUs are mostly designed to do this fast.

Compose linear transformations — and the operation behind every neural network

Matrix multiplication combines two matrices A and B into a new matrix AB whose (i,j) entry is the dot product of A's i-th row and B's j-th column. Geometrically, it composes two linear transformations. Every neural network forward pass, every 3D rotation, every solution to linear systems — all built on this single operation. The world's GPUs are mostly designed to do this fast.

Definition(AB)_{ij} = ∑_k A_{ik} · B_{kj}
Compatibilitym×n matrix times n×p matrix gives m×p
Not commutativeAB ≠ BA in general
Associative(AB)C = A(BC)
Naive timeO(n³) for n×n matrices
Strassen / faster algorithmsO(n^2.81); current best ~O(n^2.371)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How matrix multiplication works

For matrices A (m × n) and B (n × p), the product C = AB is an (m × p) matrix where:

C_{ij} = ∑_{k=1}^n A_{ik} · B_{kj}

The (i, j) entry of C is the dot product of A's i-th row with B's j-th column. Compute this for every (i, j) pair — that's the whole multiplication.

Concrete example. Multiply:

A = [1 2]   B = [5 6]
    [3 4]       [7 8]

C[0,0] = 1·5 + 2·7 = 19
C[0,1] = 1·6 + 2·8 = 22
C[1,0] = 3·5 + 4·7 = 43
C[1,1] = 3·6 + 4·8 = 50

AB = [19 22]
     [43 50]

Why this definition?

Matrix multiplication composes linear transformations. If A and B are matrices representing linear transformations T_A and T_B (each takes a vector to another vector), then the composition T_A(T_B(v)) — apply B first, then A — is itself a linear transformation. The matrix of that composition is exactly AB (computed by the row-by-column formula).

So the formula isn't arbitrary — it's forced by "what matrix represents the composition of two transformations?" The answer is unique, and it's row-by-column.

Algebraic properties

Property	Statement	Like ordinary multiplication?
Associative	(AB)C = A(BC)	Yes
Distributive	A(B + C) = AB + AC	Yes
Identity	AI = IA = A (where I is identity matrix)	Yes
Commutative	AB ≠ BA in general	NO — major difference from numbers
Inverses	Some A have A⁻¹ such that AA⁻¹ = I	Like numbers, but not all matrices have inverses
Zero	AB = 0 doesn't imply A = 0 or B = 0	NO — different from numbers

The non-commutativity (AB ≠ BA) is the deepest difference. Number multiplication is commutative; matrix multiplication isn't because composing transformations in different orders gives different results.

Vector-times-matrix and matrix-times-vector

A matrix A times a column vector v applies the linear transformation A to v:

A v = w  (transforms input v into output w)

Geometrically — A could rotate, scale, shear, project, or some combination. Each column of A is the image of one basis vector under the transformation.

For neural networks — input layer is a vector x; multiply by weight matrix W to get the next layer's pre-activation. Each row of W weights one output neuron's connections to the input.

Worked examples

Example 1 — 2D rotation matrix

The matrix that rotates a 2D vector by angle θ counterclockwise:

R(θ) = [cos θ  −sin θ]
       [sin θ   cos θ]

Multiplying R(θ₁) · R(θ₂) gives R(θ₁ + θ₂) — rotations compose by adding angles. Verify by working out the trigonometric expansion using angle-sum formulas.

Example 2 — projecting onto a line

The matrix that projects 2D vectors onto the x-axis:

P = [1 0]
    [0 0]

P · [a, b]ᵀ = [a, 0]ᵀ — vertical component dropped, horizontal kept. Note P² = P (projection is idempotent — projecting twice is the same as once).

Example 3 — graphics transformation chain

To rotate a 3D model by angle θ around the y-axis, then scale by factor s, then translate by (tx, ty, tz):

Translate · Scale · Rotate · vertex

The combined transformation matrix is M = T · S · R. Multiplying it by each vertex position gives the final screen position. Pre-multiplying T, S, R once and reusing M for all vertices is the standard graphics pipeline optimization.

JavaScript — matrix multiplication

function matmul(A, B) {
  const m = A.length;          // rows of A
  const n = A[0].length;        // cols of A = rows of B
  const p = B[0].length;        // cols of B
  if (B.length !== n) throw new Error('Incompatible dimensions');

  const C = Array.from({ length: m }, () => new Array(p).fill(0));
  for (let i = 0; i < m; i++) {
    for (let j = 0; j < p; j++) {
      let sum = 0;
      for (let k = 0; k < n; k++) {
        sum += A[i][k] * B[k][j];
      }
      C[i][j] = sum;
    }
  }
  return C;
}

// Performance optimization — i, k, j loop order is cache-friendlier
function matmulFast(A, B) {
  const m = A.length, n = A[0].length, p = B[0].length;
  const C = Array.from({ length: m }, () => new Array(p).fill(0));
  for (let i = 0; i < m; i++) {
    for (let k = 0; k < n; k++) {
      const aik = A[i][k];
      for (let j = 0; j < p; j++) {
        C[i][j] += aik * B[k][j];
      }
    }
  }
  return C;
}

const A = [[1,2], [3,4]];
const B = [[5,6], [7,8]];
matmul(A, B);  // [[19, 22], [43, 50]]

For large matrices, real production code uses optimized BLAS implementations — OpenBLAS, Intel MKL, NVIDIA cuBLAS. These exploit cache hierarchy, SIMD instructions, and parallel cores. Naive JavaScript code is 10-100× slower per operation.

Faster algorithms — Strassen and beyond

Algorithm	Complexity	Year	Practical?
Naive (row-by-column)	O(n³)	—	Default for small n
Strassen	O(n^2.807)	1969	Yes for n > ~1000
Pan	O(n^2.78)	1978	No (theoretical)
Coppersmith-Winograd	O(n^2.376)	1990	No (galactic)
Le Gall	O(n^2.3729)	2014	No (galactic)
Williams	O(n^2.371552)	2024	No (galactic)

"Galactic" algorithms have asymptotic improvements but enormous constant factors that make them slower than naive for any matrix size that fits on Earth. Strassen is the only sub-cubic algorithm used in practice; everything beyond is theoretical.

Where matrix multiplication shows up

Neural networks. Every layer is matrix-multiply + non-linearity. Forward pass — activations × weights. Backward pass — error × transpose-of-weights. GPU clusters spend most cycles here.
Computer graphics. 3D vertex transformation, model-view-projection composition. Every rendered pixel passes through several matrix multiplications.
Solving linear systems. Ax = b is solved by computing A⁻¹ (which uses matmul internally) or by Gaussian elimination (also matmul).
Statistics — covariance and PCA. Covariance matrices, principal components — both derived from matrix multiplications.
Signal processing. Discrete Fourier Transform — matrix multiplication of signal vector with DFT matrix. Filtering, convolution, image transformations.
Network analysis. Adjacency matrix powers — A² counts paths of length 2; A^k counts paths of length k. Markov chain transitions iterate by matrix multiplication.

Common mistakes

Multiplying in the wrong order. Matrix multiplication isn't commutative. AB ≠ BA. Always check which order makes sense for your transformation.
Forgetting dimension compatibility. m×n times p×q is valid only if n = p. Mismatched dimensions are silent in some libraries (NumPy throws; some custom code returns garbage).
Confusing matmul with element-wise multiplication. Matrix multiplication is the row-by-column dot-product operation. Element-wise multiplication (A * B with same dimensions) is a different operation. NumPy uses A @ B for matmul, A * B for element-wise.
Inverting non-invertible matrices. Singular matrices (det = 0) have no inverse. Trying to compute A⁻¹ for them either errors or returns garbage. Always check with det(A) ≠ 0 or use pseudoinverse.
Using naive O(n³) for very large matrices. For n > ~1000, parallel libraries (BLAS, cuBLAS) and Strassen-style algorithms can give 10-100× speedup. Don't roll your own; use optimized linear algebra.
Floating-point error accumulation. n× multiplications and additions accumulate roundoff. For ill-conditioned matrices, the result loses precision. Use higher-precision (float64), iterative refinement, or specialized stable algorithms.

Frequently asked questions

Why is matrix multiplication defined this way?

Because it composes linear transformations. If A represents the transformation T_A and B represents T_B, then AB represents the composition T_A ∘ T_B (apply B first, then A). Working out the algebra of composition gives exactly the row-by-column formula. The definition isn't arbitrary; it's forced by the requirement to compose transformations.

Why is AB ≠ BA in general?

Because applying transformations in different orders gives different results. Rotate then translate ≠ translate then rotate. The matrices encode this fundamental property — multiplication order matters. Even when both AB and BA exist (square matrices), they're usually different. Cases where AB = BA — same matrix, identity, scalar matrix, or matrices that share the same eigenvectors.

When is matrix multiplication defined?

When the inner dimensions match — m×n times n×p is valid (the n's must match). The result is m×p. Multiplying a 3×4 and a 5×6 doesn't make sense — the 4 doesn't match the 5. Square matrices of the same size are always compatible.

How fast is matrix multiplication?

Naive — O(n³). For each of the n² output entries, compute a dot product of n terms. Faster algorithms exist — Strassen (1969) — O(n^2.807) using clever recursion. Subsequent improvements bring this to ~O(n^2.371) (Coppersmith-Winograd and successors). In practice, Strassen has high constant factors and only beats naive for n > ~1000. GPU implementations parallelize the O(n³) work massively.

How is matrix multiplication used in machine learning?

Every layer of a neural network is essentially a matrix multiply followed by a non-linearity. Input × weights = pre-activation. With million-parameter networks, this means trillions of multiply-add operations per training example. NVIDIA's tensor cores and Google's TPUs are specialized hardware for matrix multiplication — at FP16 or BF16 precision for speed.

How does matrix multiplication relate to graphics?

3D rotation, scaling, translation are all matrix operations. Composing them — first rotate, then translate — is matrix multiplication of the corresponding matrices. The OpenGL/DirectX pipeline pre-multiplies matrices for view, projection, model — the result is one 4×4 matrix that transforms vertices to screen positions in one step.

What's broadcasting and how does it relate?

Broadcasting (NumPy term) automatically extends arrays of incompatible shapes for element-wise operations. NOT the same as matrix multiplication. Broadcasting handles cases like adding a 1×n vector to an m×n matrix (treats the vector as repeated m times). Matrix multiplication has strict dimensional requirements; element-wise ops broadcast.