Machine Learning

Adam Optimizer

One learning rate per parameter — tuned by the gradient's own history

Adam is a gradient-descent optimizer that gives every parameter its own adaptive learning rate, derived from running estimates of the gradient's mean (first moment) and its uncentered variance (second moment).

IntroducedKingma & Ba, 2015
Default learning rate1e-3
β₁, β₂, ε0.9, 0.999, 1e-8
State per parameter2 values (m, v)
Cost per stepO(d) in d parameters

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: per-parameter step sizes

Plain gradient descent uses one learning rate for every weight in the model. That is a problem, because a single global step size is always wrong for someone. The weights wired to a rare word in an embedding table see a gradient only occasionally, but a large one when they do — they want big, confident steps. The weights in a saturated layer see tiny, noisy gradients every batch — they want small, careful steps. With one global η you either crawl for the rare features or thrash the noisy ones.

Adam — short for Adaptive Moment Estimation, introduced by Diederik Kingma and Jimmy Ba in 2015 — fixes this by giving every parameter its own effective learning rate, and it does so without any extra hyperparameters to tune per parameter. It maintains two running averages for each weight: the mean of recent gradients (the first moment, like momentum) and the mean of recent squared gradients (the second moment, an estimate of variance). The actual step divides the smoothed gradient by the square root of that variance estimate, so a parameter whose gradients have been large or erratic automatically takes smaller steps, and a parameter whose gradients have been small and consistent automatically takes larger ones.

The mental model: momentum gives the optimizer a velocity, so it rolls through small bumps and noise; the variance term gives it a per-axis brake, so it doesn't overshoot in steep directions while it makes progress in shallow ones. Adam is momentum and RMSProp fused into a single update, with one correction to fix a startup artifact.

The precise update rule

At step t, let g_t = ∇L(θ) be the gradient. Adam computes, element-wise over the whole parameter vector:

m_t = β1 · m_{t-1} + (1 − β1) · g_t          # first moment (mean)
v_t = β2 · v_{t-1} + (1 − β2) · g_t²         # second moment (uncentered variance)

m̂_t = m_t / (1 − β1^t)                       # bias correction
v̂_t = v_t / (1 − β2^t)

θ_t = θ_{t-1} − η · m̂_t / (√v̂_t + ε)        # the update

Three pieces do the work. The moving averages m and v are exponentially weighted: with β₁ = 0.9 the first moment effectively averages roughly the last 10 gradients; with β₂ = 0.999 the second moment averages the last 1000 squared gradients, so it changes slowly and stably.

The bias correction matters more than it looks. Because m and v start at zero, the early averages are pulled toward zero. At t = 1, v_1 = 0.001 · g_1² — a thousandfold underestimate — so without correction √v̂ would be tiny and the first steps would explode. Dividing by (1 − β2^t) rescales it back; the correction is large early and fades to 1 as t grows.

The denominator √v̂ + ε is the adaptive part. Note that the per-step displacement is bounded: ignoring the ε and bias terms, the magnitude of the update is roughly η · |m̂| / √v̂, which is at most about η when the gradient is consistent (because then |m̂| ≈ √v̂). This "trust region" property is why Adam tolerates a wide range of gradient scales without diverging.

Cost and complexity: each step is O(d) time and O(d) extra memory for d parameters — the same asymptotic class as SGD, just with two extra state tensors per weight and a handful more flops. There is no matrix inverse, no second-order Hessian; Adam is a first-order method that only approximates curvature through the variance estimate.

When to reach for Adam

Transformers and language models — sparse, wildly-scaled gradients across attention, embeddings, and layer norms make per-parameter rates essential. AdamW is the default for GPT, Llama, BERT, and friends.
Sparse features and embeddings — rare tokens get the big, occasional steps they need without you tuning a schedule per feature.
GANs and reinforcement learning — non-stationary, noisy objectives where SGD's fixed rate struggles to stay stable.
Fast prototyping — the 1e-3 default works out of the box on most problems, so you spend less time tuning learning rates and more time on the model.

Reach for SGD with momentum instead when you are training a CNN on a vision benchmark with a long, well-tuned schedule — it frequently reaches lower test error. And reach for AdamW (not plain Adam) whenever you want weight decay, which in practice is almost always.

Adam vs other optimizers

	SGD	SGD + momentum	RMSProp	Adam	AdamW
Per-parameter rate	no	no	yes (variance)	yes (variance)	yes (variance)
Uses momentum	no	yes	no	yes	yes
Bias correction	—	—	no	yes	yes
Extra state per weight	0	1 (velocity)	1 (variance)	2 (m, v)	2 (m, v)
Weight decay	L2 ≡ decay	L2 ≡ decay	L2 ≡ decay	L2 (coupled, distorted)	decoupled (correct)
LR sensitivity	high	high	medium	low	low
Typical default LR	1e-1 to 1e-2	1e-1 to 1e-2	1e-3	1e-3	1e-3
Best for	simple, convex	CNN vision benchmarks	RNNs (historically)	general default	transformers, anything with decay

RMSProp and momentum are the two halves of Adam: RMSProp contributes the variance-scaled denominator, momentum contributes the smoothed numerator. Adam is the marriage of the two plus bias correction. AdamW changes only one line — where weight decay is applied — but that line is why it generalizes better and why it, not vanilla Adam, dominates modern training.

What the numbers actually say

Memory: 2 extra tensors per parameter. A 7-billion-parameter model in fp32 needs about 28 GB for the weights and 56 GB for Adam's m and v — optimizer state alone is twice the model. With fp32 master weights plus gradients, full mixed-precision Adam training runs roughly 16 bytes per parameter, ~112 GB for 7B. This is why 8-bit Adam (bitsandbytes) and ZeRO sharding exist.
Compute: ~3× the optimizer flops of SGD per step, but still negligible next to the forward/backward pass — the update is a handful of element-wise ops over the parameter vector, far cheaper than the matrix multiplies that produced the gradient.
Convergence: fewer steps to a target training loss. On many tasks Adam reaches a given loss in noticeably fewer iterations than tuned SGD, especially early; the original paper showed faster convergence on MNIST logistic regression, MLPs, and CNNs.
Robustness: the 1e-3 default just works. Across a broad range of architectures you can start at 1e-3 and get a reasonable run with no learning-rate search — the property that made Adam the field's default optimizer for a decade.

JavaScript implementation

class Adam {
  constructor(numParams, { lr = 1e-3, b1 = 0.9, b2 = 0.999, eps = 1e-8 } = {}) {
    this.lr = lr; this.b1 = b1; this.b2 = b2; this.eps = eps;
    this.m = new Float64Array(numParams);   // first moment
    this.v = new Float64Array(numParams);   // second moment
    this.t = 0;                             // timestep
  }

  // theta: Float64Array of weights, grad: Float64Array of ∂L/∂theta. Updates theta in place.
  step(theta, grad) {
    this.t += 1;
    const { b1, b2, eps, lr } = this;
    const bc1 = 1 - Math.pow(b1, this.t);   // 1 − β1^t
    const bc2 = 1 - Math.pow(b2, this.t);   // 1 − β2^t
    for (let i = 0; i < theta.length; i++) {
      const g = grad[i];
      this.m[i] = b1 * this.m[i] + (1 - b1) * g;
      this.v[i] = b2 * this.v[i] + (1 - b2) * g * g;
      const mHat = this.m[i] / bc1;
      const vHat = this.v[i] / bc2;
      theta[i] -= lr * mHat / (Math.sqrt(vHat) + eps);
    }
  }
}

// AdamW: decouple weight decay — subtract λ·θ AFTER the Adam step, not inside the gradient.
class AdamW extends Adam {
  constructor(n, opts = {}) { super(n, opts); this.wd = opts.weightDecay ?? 0.01; }
  step(theta, grad) {
    const decay = this.lr * this.wd;
    for (let i = 0; i < theta.length; i++) theta[i] -= decay * theta[i]; // decoupled
    super.step(theta, grad);
  }
}

Two details worth flagging. First, m and v are the entire reason Adam costs 2× SGD-with-momentum in memory — they are full-size parameter tensors. Second, the bias-correction factors bc1 and bc2 are scalars recomputed once per step, not per parameter; the common bug is to fold them into the loop and pay for Math.pow millions of times.

Python implementation

import numpy as np

class Adam:
    def __init__(self, shape, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8):
        self.lr, self.b1, self.b2, self.eps = lr, b1, b2, eps
        self.m = np.zeros(shape)   # first moment
        self.v = np.zeros(shape)   # second moment
        self.t = 0

    def step(self, theta, grad):
        self.t += 1
        self.m = self.b1 * self.m + (1 - self.b1) * grad
        self.v = self.b2 * self.v + (1 - self.b2) * grad ** 2
        m_hat = self.m / (1 - self.b1 ** self.t)   # bias correction
        v_hat = self.v / (1 - self.b2 ** self.t)
        theta -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        return theta


class AdamW(Adam):
    """Decoupled weight decay (Loshchilov & Hutter, 2019)."""
    def __init__(self, shape, weight_decay=0.01, **kw):
        super().__init__(shape, **kw)
        self.wd = weight_decay

    def step(self, theta, grad):
        theta -= self.lr * self.wd * theta   # decoupled decay, applied to weights
        return super().step(theta, grad)


# Worked example: minimize f(x) = (x - 3)^2, gradient 2(x - 3)
opt = Adam(shape=(1,), lr=0.1)
theta = np.array([0.0])
for _ in range(200):
    grad = 2 * (theta - 3.0)
    theta = opt.step(theta, grad)
print(theta)   # ≈ [3.0]

The PyTorch and TensorFlow built-ins (torch.optim.Adam, torch.optim.AdamW) implement exactly this update, fused into a single CUDA kernel and vectorized across every parameter group. The hand-rolled version above is for understanding, not for training a real model — use the framework's optimizer in production.

Variants worth knowing

AdamW (2019, Loshchilov & Hutter). Decouples weight decay from the gradient so the adaptive denominator doesn't distort regularization. It is now the default optimizer for essentially all transformer training. If you use weight decay at all, use AdamW.

AMSGrad (2018). Replaces v̂ with the running maximum of v, fixing a convergence counterexample where Adam fails on a simple online problem because the effective rate can grow. Rarely changes results in practice, but it patches a real theoretical hole in the original proof.

Nadam. Folds Nesterov momentum into Adam's first moment — looks one step ahead before averaging, sometimes shaving a little off convergence.

RAdam (rectified Adam). Adds a warmup-like rectification term that tames the high variance of the adaptive rate in the first few steps, often removing the need for a manual learning-rate warmup.

Lion and Adafactor. Memory-lean alternatives for large models. Lion (2023) keeps only the sign of a momentum buffer — one state tensor instead of two. Adafactor factorizes the second moment to sub-linear memory, used heavily in training large T5-style models on limited memory.

Common bugs and edge cases

Using Adam when you mean AdamW. Calling Adam(weight_decay=...) applies coupled L2, which the 1/√v term distorts so heavily-updated weights are barely decayed. Switch to AdamW for correct, scale-invariant decay.
Skipping or mis-timing bias correction. Drop it and the first dozen steps either explode or stall, because v is a thousandfold underestimate at t = 1. Recompute 1 − β^t per step, not once.
ε in the wrong place. The correct form is m̂ / (√v̂ + ε). Writing m̂ / √(v̂ + ε) changes the behavior near zero gradients; frameworks default to the former, but the value of ε (1e-8 vs 1e-3 for some transformer recipes) genuinely matters for stability.
No learning-rate warmup on transformers. Adam's early adaptive rate is high-variance; large transformers diverge in the first few hundred steps without linear warmup. Either warm up or use RAdam.
Forgetting optimizer state in checkpoints. Saving only the weights and not m, v, and t means resuming restarts the moment estimates from zero — a visible loss spike on resume.
Expecting Adam to escape every trap. It is still a first-order method. It smooths and rescales gradients but does not see curvature directly; it can settle into sharp minima that generalize worse than the flatter basins SGD tends to find.

Frequently asked questions

What do the first and second moments mean in Adam?

The first moment m is an exponentially-weighted moving average of the gradient — it acts like momentum, smoothing the direction of travel. The second moment v is the same kind of average of the gradient squared, an estimate of its uncentered variance. Dividing the step by √v shrinks updates for parameters with large or noisy gradients and enlarges them for parameters with small, steady gradients.

Why does Adam need bias correction?

m and v start at zero, so for the first few steps they are biased toward zero — badly, because β1 = 0.9 and β2 = 0.999 decay slowly. Dividing by (1 − β1^t) and (1 − β2^t) rescales them to unbiased estimates. Without it, the very first updates would be far too large, especially because v is underestimated — a tiny √v̂ in the denominator inflates the step — destabilizing early training.

What is the difference between Adam and AdamW?

Adam folds L2 weight decay into the gradient, which the adaptive 1/√v term then distorts so that large-gradient weights are barely regularized. AdamW decouples weight decay, subtracting λθ from the weights directly after the Adam step. AdamW generalizes better and is the default in modern transformer training — GPT, Llama, and most vision transformers use it.

What learning rate should I use with Adam?

The paper's default is 1e-3, and it is a remarkably robust starting point — Adam needs far less learning-rate tuning than SGD. For transformers, 1e-4 to 3e-4 with linear warmup and cosine decay is typical; for fine-tuning, 1e-5 to 5e-5. The adaptive denominator does a lot of the per-parameter scaling that you would otherwise tune by hand.

Why does SGD sometimes beat Adam on image classification?

Adam converges faster in training loss but can settle into sharper minima that generalize slightly worse. On well-tuned CNN benchmarks like ImageNet, SGD with momentum often reaches a lower test error given a long enough schedule. Adam wins where gradients are sparse or wildly scaled — transformers, embeddings, GANs, RL — and where you can't afford to hand-tune a learning rate.

How much memory does Adam use compared to SGD?

Adam stores two extra tensors per parameter — m and v — so it roughly triples optimizer memory versus plain SGD, or doubles it versus SGD with momentum. For a 7-billion-parameter model in fp32 that is about 56 GB of optimizer state alone, which is why large-model training leans on 8-bit Adam, sharded optimizers (ZeRO), or fused kernels.