Machine Learning
Gradient Descent
Iteratively step in the direction opposite the gradient — the workhorse of ML optimization
Gradient descent is the iterative optimization algorithm that minimizes a loss function L(θ) by repeatedly updating parameters θ in the direction of steepest descent: θ ← θ − η ∇L(θ), where η is the learning rate. Batch GD computes the gradient over the entire dataset (slow, accurate). SGD uses a single sample (fast, noisy, escapes local minima). Mini-batch SGD averages over k samples (the workhorse, k = 32-512 typical). Modern variants: Momentum (Polyak 1964) accelerates along consistent directions; AdaGrad (Duchi 2011) scales per-parameter; RMSProp (Hinton 2012); Adam (Kingma 2014, the de-facto default for deep learning, combining momentum + RMSProp); AdamW decouples weight decay. Convergence guaranteed for convex L; hopeful but empirical for non-convex deep networks.
- Update ruleθ ← θ − η ∇L
- VariantsBatch, SGD, mini-batch, Adam, AdamW
- Mini-batch typical32-512
- ConvergenceConvex (proven), non-convex (empirical)
- Adam 2014Kingma & Ba
- AdamW 2017Decoupled weight decay
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why gradient descent matters
- Every modern ML training run. LLM pretraining (GPT-4 trained for months on tens of thousands of GPUs), reinforcement learning (PPO, DDPG), diffusion model training (Stable Diffusion, Sora), and even classical models like logistic regression all rely on some flavor of gradient descent.
- Scales to billions of parameters. Each step processes a mini-batch of data and updates all parameters simultaneously. The cost per step grows linearly with parameters; total wall-clock time depends on dataset, batch size, and convergence speed.
- Differentiable simulation. The same algorithm tunes parameters of differentiable physics simulators, fluid solvers, ray tracers, and protein-folding models. If a forward pass is differentiable, gradient descent can optimize it.
- RLHF and alignment. GPT-3.5/4, Claude, and Llama-Chat are aligned via gradient descent on a reward signal from human preferences (PPO-style RLHF, or DPO).
- Provably converges on convex problems. Lipschitz-smooth convex loss with constant step size yields a 1/T convergence rate; with momentum, 1/T^2. Foundations are well understood.
- Hardware-friendly. Each step is a sequence of matmuls and elementwise ops — the operations modern GPUs and TPUs accelerate best.
- Tunable. Learning rate, batch size, schedule, weight decay, beta1/beta2 — the hyperparameters give engineers many knobs to fit specific problems.
Batch, stochastic, and mini-batch
- Batch (full) gradient descent. Computes ∇L over the entire dataset before each update. Exact gradient, but each step costs O(N) data passes. Practical only for small datasets (N < 1e5).
- Stochastic gradient descent (SGD). Computes the gradient on a single random example. Each step is cheap but the gradient estimate is high-variance. Robbins-Monro 1951 proved convergence for decreasing learning rate η_t.
- Mini-batch SGD. Averages the gradient over k examples (k typically 32-512). Variance decreases as 1/k. Hardware-friendly because GPUs are most efficient with batch dimensions of 32+. The default for deep learning since 2012.
- Effective batch size. Distributed training scales batch size by the number of accelerators. Llama 3 uses effective batches of millions of tokens. Linear-scaling rule (Goyal et al. 2017): scale learning rate proportionally with batch size, with warmup, to retain accuracy.
- Gradient accumulation. Simulate large batches on memory-limited hardware: forward+backward on micro-batches, accumulate gradients, then step once. Equivalent to a single large batch.
Momentum and adaptive methods
- Polyak momentum (1964). v ← βv + g; θ ← θ − ηv. With β ≈ 0.9, the update direction is an exponentially weighted average of past gradients — smooths noise and accelerates through ravines.
- Nesterov accelerated gradient (1983). Computes the gradient at the lookahead position θ − ηβv rather than θ. Often slightly faster convergence on convex problems.
- AdaGrad (Duchi et al. 2011). Per-parameter learning rate inversely proportional to the square root of the sum of past squared gradients. Strong for sparse features (NLP); learning rate eventually decays to zero, hurting long-running training.
- RMSProp (Hinton 2012, lecture notes). Replaces AdaGrad's sum with an exponentially weighted average; learning rate no longer monotonically decays.
- Adam (Kingma & Ba 2014). Combines momentum with RMSProp-style adaptive learning rates. Default hyperparameters work for many tasks — one of the most cited papers in ML.
- AdamW (Loshchilov & Hutter 2017). Decouples weight decay from the gradient. Now default for transformer training.
- Lion (Chen et al. 2023). Sign-based optimizer found by symbolic search; uses less memory than Adam, often comparable performance.
- Shampoo / Sophia / SOAP. Approximate second-order methods that scale to deep nets — active research; sometimes outperform AdamW on language modeling.
Learning-rate schedules
- Constant. Simplest; works for short training. Risks divergence early or stagnation late.
- Linear warmup. Ramp from 0 to peak over the first 1-10% of steps. Critical for transformers with large batches — avoids early divergence from un-stable Adam moments.
- Cosine decay. Smoothly reduce from peak to ~10% of peak following a half-cosine. Used in GPT-2, Llama, Stable Diffusion.
- Inverse-sqrt. The original transformer schedule: rate proportional to 1/sqrt(step), often with a warmup multiplier.
- Step decay. Drop by 10× at fixed epochs (30, 60, 90 for ResNet-on-ImageNet). Common in vision.
- 1cycle (Smith 2018). Triangular schedule: warmup, then warm-down, then a final low-LR fine-tune phase. Empirically strong on small image-classification budgets.
- WSD (warmup-stable-decay). Used in Llama 2 and others — warmup, constant peak, cosine decay at the end. Allows mid-run resumption without recomputing the schedule.
Picking hyperparameters
- Learning rate. The single most important hyperparameter. Search log-spaced (1e-5, 1e-4, ..., 1e-1). For Adam, peak rates are typically 1e-4 to 5e-4 for transformer pretraining; 1e-3 to 3e-3 for image classification with SGD.
- Batch size. Largest that fits, then scale learning rate. Large batches give smoother gradients but smaller effective parallelism per token. Critical-batch-size analysis (McCandlish et al. 2018) gives a principled upper bound.
- Weight decay. Regularizer. AdamW with decay 0.1 is common for transformer pretraining; 1e-4 for ResNet.
- Beta1, beta2 for Adam. Defaults 0.9 and 0.999 work for most. Llama uses 0.95 for beta2 to give faster adaptation.
- Gradient clipping. Clip global gradient norm to 1.0 for transformers. Prevents single bad batches from blowing up training.
Convergence theory at a glance
- Convex Lipschitz-smooth. Constant step size 1/L gives O(1/T) convergence in function value. With momentum, O(1/T^2) (Nesterov optimal).
- Strongly convex. Linear convergence: gap shrinks by factor (1 − μ/L) per step.
- Non-convex smooth. Gradient descent finds an ε-stationary point in O(1/ε^2) steps; SGD adds variance terms.
- Deep networks. Mostly empirical. Theoretical results (Neural Tangent Kernel, mean-field) explain limited regimes; in practice, schedule + initialization + architecture matter more than tight convergence bounds.
Common misconceptions
- "Adam is always best." Adam is the default for transformers, but SGD with momentum often outperforms it on vision tasks (ResNet, ConvNeXt) and generalizes better. Choice is empirical.
- "Second-order is faster." Newton's method converges in fewer iterations but each iteration costs O(d^3) for the Hessian solve. For deep nets with d in the billions, full second-order is infeasible. Approximate methods (Shampoo, Sophia) are active research.
- "Smaller learning rate is safer." Divergence from too-small learning rate is rare; non-convergence (or convergence to a bad minimum) is common. Bigger-is-better up to the point of instability is the empirical rule.
- "Momentum is just a smoothing trick." On convex problems, accelerated methods provably reach Nesterov's optimal 1/T^2 rate — not achievable without momentum. It's a fundamental algorithmic improvement.
- "Batch GD always converges to global minimum." Only for convex losses. For non-convex deep nets, batch GD reaches a stationary point — which may be a saddle, a poor local min, or a good one.
- "You need to schedule the learning rate." Schedules help, but constant LR with Adam works surprisingly well on many tasks; schedules matter most when you push to the optimization frontier.
Practical tips
- Always plot the loss. Diverging loss in the first 100 steps usually means LR too high or warmup too short. Slow decrease usually means LR too low.
- Use mixed precision. bf16 for activations and matmul; fp32 for optimizer states. ~2× throughput on modern GPUs.
- Debug with overfitting. If a model can't drive loss to zero on a single batch with no regularization, there's a bug.
- Check gradient norms. If < 1e-6 throughout, gradients are vanishing; if > 100 in the first steps, you need more warmup or clipping.
- Save optimizer state. Adam stores m and v per parameter — a 7B model has 14B optimizer scalars. Resuming training without optimizer state is roughly equivalent to a fresh restart.
Frequently asked questions
Why use mini-batch instead of full-batch?
Full-batch gradient descent computes the exact gradient over all N training examples per update — for N = 1B that is 1 billion forward+backward passes per parameter step. Mini-batch averages over k samples (32-512 typical) producing a noisy but unbiased gradient estimate at 1/k the cost. The noise even helps generalization by preventing overfitting to sharp minima. Full-batch fits in memory only for small datasets; SGD and mini-batch dominate everywhere else.
What is the role of momentum?
Momentum (Polyak 1964, Nesterov 1983) maintains an exponentially weighted average of past gradients: v = beta*v + (1-beta)*grad, then update theta -= eta*v. Beta is typically 0.9. This accelerates progress along directions where gradients agree across steps and damps oscillation in directions where they disagree. Like a ball rolling down a hill — momentum carries it through small bumps. SGD-with-momentum often beats Adam on vision tasks.
Why does Adam combine momentum and RMSProp?
Adam (Kingma & Ba 2014) maintains two exponentially weighted averages: m (first moment, like momentum) and v (second moment, like RMSProp). Update: theta -= eta * m_hat / (sqrt(v_hat) + eps), where m_hat and v_hat are bias-corrected. The first moment provides momentum-style acceleration; dividing by sqrt(second moment) gives per-parameter adaptive learning rates — small for noisy/large gradients, large for consistent/small ones. Default hyperparameters (beta1=0.9, beta2=0.999, eps=1e-8) work well across many tasks.
What is the learning rate schedule (warmup, cosine, step)?
Learning rate often varies during training. Linear warmup ramps from 0 to a peak over the first 1-10% of steps — critical for transformers to avoid early divergence with large effective batch sizes. Cosine decay smoothly reduces from peak to a minimum (often 10% of peak) following a cosine curve — used in GPT-2, Llama, Stable Diffusion. Step decay drops by a factor (e.g., 10x) at fixed epoch boundaries — common for ResNet-on-ImageNet. Schedule choice meaningfully affects final accuracy.
Why doesn't gradient descent always find global minima?
Convex loss functions (logistic regression, SVMs) have a unique global minimum and gradient descent provably converges to it. Deep neural networks have non-convex losses with countless local minima, saddle points, and flat regions. Empirically, SGD with momentum tends to find good local minima that generalize well — possibly because noise helps escape sharp minima and find flat basins. Theoretical guarantees are limited; in practice, careful initialization, learning-rate schedules, and adequate compute matter more than convergence proofs.
What is AdamW vs Adam?
Original Adam (2014) couples weight decay with the gradient: it adds lambda*theta to the gradient, which then gets scaled by the adaptive 1/sqrt(v) factor — so weight decay shrinks more for parameters with small running gradients. Loshchilov & Hutter 2017 showed this is wrong; weight decay should be applied directly to theta independently of the adaptive rate: theta -= eta * (m_hat/(sqrt(v_hat)+eps) + lambda*theta). AdamW is now the default for transformer training (GPT, BERT, Llama), routinely outperforming vanilla Adam by 1-2 points on benchmarks.