Machine Learning
Backpropagation
Reverse-mode automatic differentiation — compute all weight gradients in one backward pass
Backpropagation is the algorithm at the heart of training neural networks. It computes the gradient of a scalar loss with respect to every weight in the network in O(forward-pass-time + backward-pass-time) ≈ O(2 × forward-pass-time), regardless of how many parameters the network has. The technique is reverse-mode automatic differentiation: forward-pass caches activations; backward-pass propagates the loss gradient via the chain rule, multiplying local Jacobians. Independently rediscovered multiple times — Linnainmaa 1970, Werbos 1974, but popularized by Rumelhart, Hinton, Williams in 1986. Without it, GPT, Stable Diffusion, and modern computer vision would be impossible — gradient-based training of >100B-parameter models needs the linear-cost guarantee.
- Cost~2× forward-pass
- TypeReverse-mode autodiff
- PopularizedRumelhart Hinton Williams 1986
- Earlier workLinnainmaa 1970, Werbos 1974
- MemoryStores forward activations
- Modern stackPyTorch autograd, JAX, TF
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why backpropagation matters
- Every modern network. Training GPT-4, Llama 3, Stable Diffusion XL, AlphaFold 2, and every large vision model relies on backpropagation. Without the linear-cost gradient, training trillion-parameter models would be impossible.
- Foundation of deep learning. Stochastic gradient descent + backprop is the engine of the deep learning revolution that began in 2012 (AlexNet on ImageNet).
- Generic to graphs of operations. Backpropagation is not specific to neural networks. Modern autodiff frameworks differentiate any composition of differentiable primitives — physics simulations, ray tracing (Mitsuba 3), molecular dynamics (JAX MD), and even discrete optimizers (via straight-through estimators).
- Linear in parameters. Forward-mode would scale with the number of parameters; for a 70B-parameter LLM that is 70 billion forward passes per gradient. Reverse-mode does the whole job in roughly two forward passes.
- Composable with checkpointing, sharding, and offload. Memory-saving tricks like gradient checkpointing, ZeRO-3, and CPU offload are layered on top of vanilla backprop.
- Hardware-friendly. The backward pass is dominated by matrix multiplications — the operation GPUs and TPUs accelerate best. NVIDIA H100 hits 989 TFLOPs of FP16 matmul; backprop spends >90% of training time inside these kernels.
- Differentiable programming. Treating computation as a graph that can be differentiated end-to-end has spread well beyond neural networks — differentiable rendering, differentiable physics, and learning-to-optimize all rely on backprop machinery.
The algorithm in three steps
- Forward pass. Run inputs through the network layer by layer, computing y = f_L(f_{L-1}(... f_1(x))). At each layer, store the activation needed for the backward pass — for matmul that is the input; for ReLU that is the input sign mask; for softmax that is the output. Compute scalar loss L(y, target).
- Initialize gradient. Start from dL/dL = 1.0. This single scalar seeds the entire backward pass.
- Backward pass. Walk the graph in reverse topological order. At each node, multiply the incoming gradient (from above) by the local Jacobian of the node's operation, accumulating into upstream gradients. For a linear layer y = Wx + b: gradient with respect to x is g_out @ W^T; gradient with respect to W is g_out^T @ x; gradient with respect to b is sum over batch of g_out.
A short history
- 1960s. Bryson and Ho describe a multi-stage dynamic-programming algorithm essentially equivalent to backpropagation in optimal control theory.
- 1970. Seppo Linnainmaa, in his Master's thesis at Helsinki, derives reverse-mode automatic differentiation in fully general form (with rounding-error analysis) — arguably the first complete description of backprop.
- 1974. Paul Werbos's PhD thesis applies reverse-mode autodiff to train neural networks — usually credited as the first explicit neural-network application.
- 1986. Rumelhart, Hinton, and Williams publish "Learning representations by back-propagating errors" in Nature. This is the paper that popularized the technique and triggered the 1980s neural-network wave.
- 1989-2006. Multiple AI winters; backprop deemed slow and prone to local minima.
- 2012. AlexNet wins ImageNet by a huge margin using backprop on GPUs — the deep learning revival begins.
- 2015 onwards. Autograd frameworks (Theano, Torch, TensorFlow, PyTorch, JAX) make backprop accessible to anyone who can write a Python forward pass.
How modern autograd works
- Tape recording. PyTorch records operations on tensors with requires_grad=True onto a dynamic graph. Each output tensor stores a reference to the Function that produced it and to its inputs.
- Topological backward. loss.backward() walks the graph in reverse topological order, calling each Function's backward() method. Each backward consumes the upstream gradient and emits gradients for its inputs.
- Gradient accumulation. If a tensor is used multiple times (residual connections, weight tying), gradients from each use are summed automatically.
- Graph freeing. By default the graph is destroyed after backward; pass retain_graph=True if you need a second backward (e.g., higher-order gradients, GAN with multiple losses).
- JAX's pure-functional twist. jax.grad takes a pure function and returns a new function that computes its gradient. The graph is built by tracing rather than tape, enabling jit-compilation, vmap, and pmap on top.
Memory cost — the dominant constraint
- Activation memory. Storing every layer's input/output dominates VRAM during training. For a transformer with L layers, batch B, sequence S, hidden d: roughly O(L · B · S · d) bytes — tens of GB for modern LLMs.
- Gradient checkpointing. Trade compute for memory: store only every k-th activation, recompute the rest. Reduces memory from O(L) to O(L/k) at the cost of one extra forward pass.
- Mixed precision. Cast activations to bf16/fp16 for storage and matmul; keep optimizer state in fp32. Halves activation memory.
- Offload. ZeRO-Offload moves optimizer states and even activations to CPU RAM (or NVMe) between forward and backward.
- FlashAttention. Recomputes attention on the backward pass instead of storing the n × n attention matrix — cuts attention memory from O(n^2) to O(n).
Numerical pitfalls
- Vanishing gradients. If local Jacobians have norm < 1, repeated multiplication shrinks gradients exponentially with depth. Killed RNNs in the 1990s; solved by LSTM gates, ReLU activations, residual connections, and proper initialization (He, Xavier).
- Exploding gradients. Local Jacobians with norm > 1 blow up gradients. Mitigated by gradient clipping (clip norm to 1.0 is standard for transformers) and careful learning-rate scheduling.
- FP16 underflow. Gradients near 1e-8 can round to zero in fp16. Solved by loss scaling: multiply the loss by 1024-65536 before backward; divide gradients by the same factor before the optimizer step. bf16 (Brain Float) avoids this with its larger exponent range.
- Numerical-vs-symbolic. Backprop is numerical autodiff — it computes the gradient at a specific point. It does not produce a symbolic formula, which is what kept symbolic-differentiation systems like SymPy slow on large nets.
Common misconceptions
- "Backprop is only for neural networks." Backpropagation is generic reverse-mode automatic differentiation. JAX, autograd, and PyTorch differentiate physics simulations, fluid solvers, ray tracers, and even agent rollouts.
- "Backprop computes the Hessian." Standard backprop computes only the gradient (first derivative). Second-order information requires another pass — Hessian-vector products via Pearlmutter's trick, or full Hessian via O(n) backward passes.
- "Hinton invented backprop." Linnainmaa 1970 and Werbos 1974 predate Rumelhart-Hinton-Williams 1986. The 1986 paper popularized it for neural networks; the underlying technique is older.
- "Backprop finds the global minimum." It computes a gradient. Whether SGD with that gradient finds a good minimum depends on the loss landscape and is mostly empirical for non-convex deep nets.
- "You need the chain rule by hand." Modern autograd makes hand-derivation unnecessary — write a forward pass with differentiable primitives and backward is automatic. Hand-derivation is still useful for debugging custom kernels.
- "Backprop scales linearly with parameters." Per-step it is linear in operations, not parameters. A larger model with the same number of FLOPs (e.g., MoE) has the same backward cost as a dense one.
Performance and engineering
- Throughput. A single A100 80GB sustains ~300 TFLOPs of bf16 matmul; the backward pass for a 7B-parameter LLM takes ~1.5× the forward time on the same hardware.
- Bandwidth-bound layers. Activation functions, normalization, and elementwise ops are memory-bound; their backward kernels often run at <20% of peak FLOPs. Triton/CUDA fusion (xformers, FlashAttention) reclaims this.
- Distributed backward. Data parallelism averages gradients across GPUs via NCCL all-reduce. ZeRO-3 shards gradients themselves; FSDP follows the same pattern in PyTorch native.
- Pipeline parallelism. Splits layers across devices; backward pass walks layers in reverse, with bubble overhead managed via 1F1B (one-forward-one-backward) scheduling.
Frequently asked questions
What is the chain rule applied to neural networks?
A neural network is a composition of functions: y = f_L(f_{L-1}(... f_1(x))). The chain rule says dL/dx = (dL/dy_L)(dy_L/dy_{L-1})...(dy_1/dx). Backpropagation evaluates this product right to left, starting from dL/dy_L = 1 (loss with respect to itself). At each layer, multiply the incoming gradient by the local Jacobian of that layer's operation. For a linear layer y = Wx + b, the Jacobian with respect to x is W^T; with respect to W it's the outer product of the incoming gradient and the input.
Why is reverse-mode preferred over forward-mode for training?
Forward-mode autodiff costs O(n_inputs) per output; reverse-mode costs O(n_outputs) per input. Neural networks have one scalar loss output and millions to billions of parameter inputs. Forward-mode would require one full pass per parameter — utterly infeasible for a 175B-parameter model. Reverse-mode requires one backward pass total to get all parameter gradients. The cost is independent of parameter count, making large-model training tractable. Forward-mode is preferred when n_inputs < n_outputs (rare in deep learning).
What does autograd do under the hood?
PyTorch autograd builds a dynamic computation graph during the forward pass. Each tensor with requires_grad=True records the operation that produced it (a Function node) and its inputs. When you call loss.backward(), autograd traverses the graph in topological order from loss back to inputs, calling each Function's backward() method to multiply the incoming gradient by the local Jacobian. The graph is freed after backward unless retain_graph=True. JAX and TF use a similar pattern but JAX prefers tracing pure functions for jit/grad transformations.
Why does backprop require O(L) activation memory?
To compute the gradient at layer i, backprop needs the activation a_i that was produced during the forward pass — many local Jacobians depend on the input value. With L layers and batch size B, you store L activation tensors, total memory O(L · B · d). For a 70-billion-parameter LLM with 80 layers, sequence 8192, hidden 8192, batch 4 — activations alone consume >100 GB. This dominates the memory budget of modern training and motivates gradient checkpointing, ZeRO sharding, and CPU offload.
What is gradient checkpointing and why does it matter?
Gradient checkpointing (Chen et al. 2016) trades compute for memory. Instead of storing all L activations, store only sqrt(L) of them; on the backward pass, recompute the missing activations from the nearest checkpoint. Memory drops from O(L) to O(sqrt(L)); compute increases by ~33% (one extra forward pass). For a 100-layer net, store 10 checkpoints instead of 100. Standard in PyTorch via torch.utils.checkpoint, in HuggingFace Trainer, and in DeepSpeed. Critical for fitting large LLMs on limited-VRAM GPUs.
How does backprop handle shared weights (RNN, transformer)?
When a parameter W appears in multiple places — RNN unrolled across timesteps, transformer's same FFN applied to every token — gradients accumulate. Each use contributes its own dL/dW; autograd sums all contributions before the optimizer step. This is the multivariate chain rule: total derivative = sum over all paths. PyTorch handles it automatically: every backward call adds to the .grad attribute. You then call optimizer.zero_grad() before the next forward to clear accumulated gradients.