Linear Algebra
Matrix Multiplication
Compose linear transformations — and the operation behind every neural network
Matrix multiplication combines two matrices A and B into a new matrix AB whose (i,j) entry is the dot product of A's i-th row and B's j-th column. Geometrically, it composes two linear transformations. Every neural network forward pass, every 3D rotation, every solution to linear systems — all built on this single operation. The world's GPUs are mostly designed to do this fast.
- Definition(AB)_{ij} = ∑_k A_{ik} · B_{kj}
- Compatibilitym×n matrix times n×p matrix gives m×p
- Not commutativeAB ≠ BA in general
- Associative(AB)C = A(BC)
- Naive timeO(n³) for n×n matrices
- Strassen / faster algorithmsO(n^2.81); current best ~O(n^2.371)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How matrix multiplication works
For matrices A (m × n) and B (n × p), the product C = AB is an (m × p) matrix where:
C_{ij} = ∑_{k=1}^n A_{ik} · B_{kj}
The (i, j) entry of C is the dot product of A's i-th row with B's j-th column. Compute this for every (i, j) pair — that's the whole multiplication.
Concrete example. Multiply:
A = [1 2] B = [5 6]
[3 4] [7 8]
C[0,0] = 1·5 + 2·7 = 19
C[0,1] = 1·6 + 2·8 = 22
C[1,0] = 3·5 + 4·7 = 43
C[1,1] = 3·6 + 4·8 = 50
AB = [19 22]
[43 50]
Why this definition?
Matrix multiplication composes linear transformations. If A and B are matrices representing linear transformations T_A and T_B (each takes a vector to another vector), then the composition T_A(T_B(v)) — apply B first, then A — is itself a linear transformation. The matrix of that composition is exactly AB (computed by the row-by-column formula).
So the formula isn't arbitrary — it's forced by "what matrix represents the composition of two transformations?" The answer is unique, and it's row-by-column.
Algebraic properties
| Property | Statement | Like ordinary multiplication? |
|---|---|---|
| Associative | (AB)C = A(BC) | Yes |
| Distributive | A(B + C) = AB + AC | Yes |
| Identity | AI = IA = A (where I is identity matrix) | Yes |
| Commutative | AB ≠ BA in general | NO — major difference from numbers |
| Inverses | Some A have A⁻¹ such that AA⁻¹ = I | Like numbers, but not all matrices have inverses |
| Zero | AB = 0 doesn't imply A = 0 or B = 0 | NO — different from numbers |
The non-commutativity (AB ≠ BA) is the deepest difference. Number multiplication is commutative; matrix multiplication isn't because composing transformations in different orders gives different results.
Vector-times-matrix and matrix-times-vector
A matrix A times a column vector v applies the linear transformation A to v:
A v = w (transforms input v into output w)
Geometrically — A could rotate, scale, shear, project, or some combination. Each column of A is the image of one basis vector under the transformation.
For neural networks — input layer is a vector x; multiply by weight matrix W to get the next layer's pre-activation. Each row of W weights one output neuron's connections to the input.
Worked examples
Example 1 — 2D rotation matrix
The matrix that rotates a 2D vector by angle θ counterclockwise:
R(θ) = [cos θ −sin θ]
[sin θ cos θ]
Multiplying R(θ₁) · R(θ₂) gives R(θ₁ + θ₂) — rotations compose by adding angles. Verify by working out the trigonometric expansion using angle-sum formulas.
Example 2 — projecting onto a line
The matrix that projects 2D vectors onto the x-axis:
P = [1 0]
[0 0]
P · [a, b]ᵀ = [a, 0]ᵀ — vertical component dropped, horizontal kept. Note P² = P (projection is idempotent — projecting twice is the same as once).
Example 3 — graphics transformation chain
To rotate a 3D model by angle θ around the y-axis, then scale by factor s, then translate by (tx, ty, tz):
Translate · Scale · Rotate · vertex
The combined transformation matrix is M = T · S · R. Multiplying it by each vertex position gives the final screen position. Pre-multiplying T, S, R once and reusing M for all vertices is the standard graphics pipeline optimization.
JavaScript — matrix multiplication
function matmul(A, B) {
const m = A.length; // rows of A
const n = A[0].length; // cols of A = rows of B
const p = B[0].length; // cols of B
if (B.length !== n) throw new Error('Incompatible dimensions');
const C = Array.from({ length: m }, () => new Array(p).fill(0));
for (let i = 0; i < m; i++) {
for (let j = 0; j < p; j++) {
let sum = 0;
for (let k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
return C;
}
// Performance optimization — i, k, j loop order is cache-friendlier
function matmulFast(A, B) {
const m = A.length, n = A[0].length, p = B[0].length;
const C = Array.from({ length: m }, () => new Array(p).fill(0));
for (let i = 0; i < m; i++) {
for (let k = 0; k < n; k++) {
const aik = A[i][k];
for (let j = 0; j < p; j++) {
C[i][j] += aik * B[k][j];
}
}
}
return C;
}
const A = [[1,2], [3,4]];
const B = [[5,6], [7,8]];
matmul(A, B); // [[19, 22], [43, 50]]
For large matrices, real production code uses optimized BLAS implementations — OpenBLAS, Intel MKL, NVIDIA cuBLAS. These exploit cache hierarchy, SIMD instructions, and parallel cores. Naive JavaScript code is 10-100× slower per operation.
Faster algorithms — Strassen and beyond
| Algorithm | Complexity | Year | Practical? |
|---|---|---|---|
| Naive (row-by-column) | O(n³) | — | Default for small n |
| Strassen | O(n^2.807) | 1969 | Yes for n > ~1000 |
| Pan | O(n^2.78) | 1978 | No (theoretical) |
| Coppersmith-Winograd | O(n^2.376) | 1990 | No (galactic) |
| Le Gall | O(n^2.3729) | 2014 | No (galactic) |
| Williams | O(n^2.371552) | 2024 | No (galactic) |
"Galactic" algorithms have asymptotic improvements but enormous constant factors that make them slower than naive for any matrix size that fits on Earth. Strassen is the only sub-cubic algorithm used in practice; everything beyond is theoretical.
Where matrix multiplication shows up
- Neural networks. Every layer is matrix-multiply + non-linearity. Forward pass — activations × weights. Backward pass — error × transpose-of-weights. GPU clusters spend most cycles here.
- Computer graphics. 3D vertex transformation, model-view-projection composition. Every rendered pixel passes through several matrix multiplications.
- Solving linear systems. Ax = b is solved by computing A⁻¹ (which uses matmul internally) or by Gaussian elimination (also matmul).
- Statistics — covariance and PCA. Covariance matrices, principal components — both derived from matrix multiplications.
- Signal processing. Discrete Fourier Transform — matrix multiplication of signal vector with DFT matrix. Filtering, convolution, image transformations.
- Network analysis. Adjacency matrix powers — A² counts paths of length 2; A^k counts paths of length k. Markov chain transitions iterate by matrix multiplication.
Common mistakes
- Multiplying in the wrong order. Matrix multiplication isn't commutative. AB ≠ BA. Always check which order makes sense for your transformation.
- Forgetting dimension compatibility. m×n times p×q is valid only if n = p. Mismatched dimensions are silent in some libraries (NumPy throws; some custom code returns garbage).
- Confusing matmul with element-wise multiplication. Matrix multiplication is the row-by-column dot-product operation. Element-wise multiplication (A * B with same dimensions) is a different operation. NumPy uses A @ B for matmul, A * B for element-wise.
- Inverting non-invertible matrices. Singular matrices (det = 0) have no inverse. Trying to compute A⁻¹ for them either errors or returns garbage. Always check with det(A) ≠ 0 or use pseudoinverse.
- Using naive O(n³) for very large matrices. For n > ~1000, parallel libraries (BLAS, cuBLAS) and Strassen-style algorithms can give 10-100× speedup. Don't roll your own; use optimized linear algebra.
- Floating-point error accumulation. n× multiplications and additions accumulate roundoff. For ill-conditioned matrices, the result loses precision. Use higher-precision (float64), iterative refinement, or specialized stable algorithms.
Frequently asked questions
Why is matrix multiplication defined this way?
Because it composes linear transformations. If A represents the transformation T_A and B represents T_B, then AB represents the composition T_A ∘ T_B (apply B first, then A). Working out the algebra of composition gives exactly the row-by-column formula. The definition isn't arbitrary; it's forced by the requirement to compose transformations.
Why is AB ≠ BA in general?
Because applying transformations in different orders gives different results. Rotate then translate ≠ translate then rotate. The matrices encode this fundamental property — multiplication order matters. Even when both AB and BA exist (square matrices), they're usually different. Cases where AB = BA — same matrix, identity, scalar matrix, or matrices that share the same eigenvectors.
When is matrix multiplication defined?
When the inner dimensions match — m×n times n×p is valid (the n's must match). The result is m×p. Multiplying a 3×4 and a 5×6 doesn't make sense — the 4 doesn't match the 5. Square matrices of the same size are always compatible.
How fast is matrix multiplication?
Naive — O(n³). For each of the n² output entries, compute a dot product of n terms. Faster algorithms exist — Strassen (1969) — O(n^2.807) using clever recursion. Subsequent improvements bring this to ~O(n^2.371) (Coppersmith-Winograd and successors). In practice, Strassen has high constant factors and only beats naive for n > ~1000. GPU implementations parallelize the O(n³) work massively.
How is matrix multiplication used in machine learning?
Every layer of a neural network is essentially a matrix multiply followed by a non-linearity. Input × weights = pre-activation. With million-parameter networks, this means trillions of multiply-add operations per training example. NVIDIA's tensor cores and Google's TPUs are specialized hardware for matrix multiplication — at FP16 or BF16 precision for speed.
How does matrix multiplication relate to graphics?
3D rotation, scaling, translation are all matrix operations. Composing them — first rotate, then translate — is matrix multiplication of the corresponding matrices. The OpenGL/DirectX pipeline pre-multiplies matrices for view, projection, model — the result is one 4×4 matrix that transforms vertices to screen positions in one step.
What's broadcasting and how does it relate?
Broadcasting (NumPy term) automatically extends arrays of incompatible shapes for element-wise operations. NOT the same as matrix multiplication. Broadcasting handles cases like adding a 1×n vector to an m×n matrix (treats the vector as repeated m times). Matrix multiplication has strict dimensional requirements; element-wise ops broadcast.