Machine Learning

Gradient Clipping

The seatbelt that keeps one bad batch from wrecking the whole run

Gradient clipping caps the size of the gradient before each weight update — by norm or by value — so a single huge step can't blow up training. It's the standard fix for exploding gradients in RNNs and Transformers.

  • Per-step costO(n) over parameters
  • Direction (by norm)Preserved exactly
  • Typical threshold1.0 (LLMs) · 5–10 (RNNs)
  • Memory overheadO(1)
  • First describedPascanu et al., 2013

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How gradient clipping works

Training a neural network is a long chain of small steps: compute a loss, backpropagate it into a gradient, and nudge every weight a little in the downhill direction. The trouble is that "a little" is set by the learning rate times the gradient's magnitude — and that magnitude is not bounded. On most steps the gradient is well-behaved, but every so often one pathological batch produces a gradient hundreds or thousands of times larger than usual. Multiply that by the learning rate and the weights jump off a cliff. The loss spikes, often straight to NaN, and the whole run is dead.

Gradient clipping is the seatbelt. Right after backpropagation and right before the optimizer step, you measure how big the gradient is. If it's over a threshold you chose, you shrink it back down to that threshold; if it's under, you leave it alone. The expensive part of the step — the forward and backward passes — is untouched. You're only rescaling numbers that already exist.

The technique was popularized by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in their 2013 paper "On the difficulty of training Recurrent Neural Networks." They showed that the exploding-gradient problem in RNNs has a precise geometric cause — and that a brutally simple rescaling step tames it without changing the learning dynamics on normal steps.

The precise mechanism and math

There are two ways to clip, and they behave very differently.

Clip by global norm (the standard). Treat all the network's gradients as one giant vector g, compute its L2 norm, and if that norm exceeds a threshold c, rescale the whole vector by a single shared factor:

‖g‖ = sqrt( Σ_i g_i² )          over every parameter in the model

if ‖g‖ > c:
    g ← g · (c / ‖g‖)           one scalar multiply, every component
else:
    g ← g                       untouched

Because every component is multiplied by the same number, the direction of the update is preserved exactly — you walk the same way downhill, just no further than length c. After clipping, ‖g‖ = min(‖g‖, c). This is the version in torch.nn.utils.clip_grad_norm_ and tf.clip_by_global_norm.

Clip by value. Forget the norm; just truncate each component independently into a box [-c, c]:

g_i ← max(-c, min(c, g_i))      for every component independently

This is even cheaper but it bends the direction: if one component was huge and the rest were tiny, clamping only the huge one tilts the update vector toward the others. It can't overflow, but it's a blunter instrument and is rarely the recommended default.

Cost. Both are O(n) in the number of parameters n — one pass to accumulate the squared norm (or to clamp), one pass to rescale. Memory overhead is O(1): you keep a single running sum, not a copy of the gradient. Against the cost of the backward pass, which is itself O(n) but with a far larger constant, clipping is essentially free — typically under 1% of step time.

When to clip — and when not to

  • Recurrent networks (RNN, LSTM, GRU). The original and still strongest use case. Backprop-through-time is where gradients explode, and clipping is close to mandatory for plain RNNs.
  • Transformers and large language models. Nearly every large-scale training recipe clips by global norm at 1.0. GPT-style and BERT-style runs ship with it on by default; it's cheap insurance against a single bad shard of data spiking the loss.
  • Reinforcement learning. Policy-gradient methods like PPO and A3C clip gradients because the reward signal is noisy and occasionally produces enormous updates.
  • Mixed-precision (fp16) training. The smaller dynamic range of fp16 makes overflow likelier, so clipping pairs naturally with loss scaling — clip after unscaling.

When not to reach for it first: if you're clipping on the majority of steps, your learning rate is too high or your initialization is bad. Clipping is a safety net for the rare spike, not a substitute for a sane training setup. On a well-conditioned feedforward CNN with batch norm, you often don't need it at all.

Clipping vs. other stability tricks

Clip by normClip by valueLower learning rateBatch / layer normLSTM / GRU gatingLoss scaling (fp16)
Fixes exploding gradientsYesYesPartiallyIndirectlyIndirectlyNo (fixes underflow)
Fixes vanishing gradientsNoNoNoHelpsYes (its main job)No
Preserves update directionYesNoYesn/an/aYes
Per-step costO(n), tinyO(n), tinyFreeO(n) + extra params3–4× the paramsO(n), tiny
Slows normal-step learningNoNoYesNoNoNo
Tuning neededOne thresholdOne thresholdThe whole LR scheduleAlmost noneArchitecturalScale factor (often auto)
Used in production LLMsUniversallyRarelyAlwaysLayerNorm alwaysPre-Transformer eraUniversally

The honest framing: these aren't competitors, they're layers of the same defense. A modern Transformer run uses layer norm and a warmup learning-rate schedule and clip-by-norm at 1.0 and loss scaling, all at once. Clipping is the one that catches the spike that slips past everything else.

What the numbers actually say

  • Why RNN gradients explode exactly. Backprop-through-time over T steps multiplies the recurrent weight matrix W by itself T times. The gradient magnitude scales like the largest singular value σ_max(W) raised to the T. With σ_max = 1.5 over a 50-step sequence, that's 1.5^50 ≈ 6.4 × 10⁸ — a billion-fold amplification from a matrix that's only 50% too large.
  • Overflow is real, not theoretical. Single-precision floats max out near 3.4 × 10³⁸; fp16 maxes out at just 65,504. A gradient that grows 1.5^60 ≈ 4 × 10¹⁰ already shatters fp16 and is one step from inf → NaN in fp32.
  • Clipping fires rarely once stable. In a healthy run with threshold 1.0, the global norm sits below the threshold on the large majority of steps after warmup — so clipping is a no-op most of the time and only intervenes on the spikes it exists for.
  • The cost is negligible. Computing a global norm over a 7-billion-parameter model is a single reduction — well under 1% of the per-step wall-clock dominated by the matrix multiplies in the forward and backward passes.
  • Default that almost nobody tunes: max_norm = 1.0. It appears verbatim in the public training recipes for GPT-2, BERT, and most Hugging Face Trainer configs.

JavaScript implementation

Both modes, operating in place on a flat list of parameter-gradient arrays — the shape you'd get from a small autodiff engine.

// Clip the global L2 norm of ALL gradients to maxNorm.
// grads: array of Float32Array (one per parameter tensor), mutated in place.
// Returns the pre-clip total norm — useful to log and to pick a threshold.
function clipGradNorm(grads, maxNorm) {
  let sumSq = 0;
  for (const g of grads) {
    for (let i = 0; i < g.length; i++) sumSq += g[i] * g[i];
  }
  const totalNorm = Math.sqrt(sumSq);

  // +1e-6 guards against divide-by-zero on a dead step.
  const scale = maxNorm / (totalNorm + 1e-6);
  if (scale < 1) {                          // only shrink, never grow
    for (const g of grads) {
      for (let i = 0; i < g.length; i++) g[i] *= scale;
    }
  }
  return totalNorm;
}

// Clip each component independently into [-c, c]. Bends direction.
function clipGradValue(grads, c) {
  for (const g of grads) {
    for (let i = 0; i < g.length; i++) {
      g[i] = Math.max(-c, Math.min(c, g[i]));
    }
  }
}

Two details earn their keep. First, the scale < 1 guard means we never amplify a small gradient — clipping only ever shrinks. Second, the + 1e-6 in the denominator keeps a genuinely-zero gradient (a fully masked batch, say) from producing 0/0 = NaN and silently poisoning the very thing clipping is meant to protect.

Python implementation (and the famous PyTorch one-liner)

From-scratch NumPy, mirroring the JS, then the line you'll actually write in production.

import numpy as np

def clip_grad_norm(grads, max_norm):
    """grads: list of np.ndarray, mutated in place. Returns pre-clip norm."""
    total_norm = np.sqrt(sum(np.sum(g * g) for g in grads))
    scale = max_norm / (total_norm + 1e-6)
    if scale < 1.0:                       # only shrink
        for g in grads:
            g *= scale
    return total_norm

def clip_grad_value(grads, c):
    for g in grads:
        np.clip(g, -c, c, out=g)          # in place, bends direction

In real PyTorch you don't write any of that — it's one call, placed precisely between the backward pass and the optimizer step:

import torch

for batch in loader:
    optimizer.zero_grad()
    loss = model(batch).loss
    loss.backward()                       # gradients now populated

    # Clip the global norm across ALL parameters to 1.0.
    # Returns the pre-clip total norm — log it to tune the threshold.
    total_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=1.0
    )

    optimizer.step()                      # safe, bounded step

The ordering is the whole game: backward()clip_grad_norm_step(). Clip after the gradients exist and after any loss averaging or gradient accumulation, but before the weights actually move. With mixed precision you add one more step — unscale first, then clip:

scaler.scale(loss).backward()
scaler.unscale_(optimizer)                # bring grads back to true scale
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

Variants worth knowing

Adaptive Gradient Clipping (AGC). Introduced with the NFNets (Brock et al., 2021) to train large image models without batch norm. Instead of one global threshold, AGC clips each parameter block relative to the ratio of its gradient norm to its own weight norm — so the threshold scales with the size of the weights it's protecting, removing the single brittle hyperparameter.

Per-layer (per-parameter) norm clipping. Clip each tensor's norm separately rather than one global norm. Useful when different layers live on wildly different gradient scales, but it changes the relative step sizes between layers, so it's less common than global clipping.

Gradient-norm warmup / auto-clipping. Track a running quantile (e.g. the 90th percentile) of recent gradient norms and set the threshold there automatically. This removes the need to guess a constant and adapts as training enters different regimes.

Per-sample clipping for differential privacy. DP-SGD clips the gradient of each individual training example before averaging and adding noise. Same arithmetic, completely different goal: here clipping bounds any one record's influence so a privacy guarantee can be proven, not to stop a NaN.

Common bugs and edge cases

  • Clipping in the wrong order. Calling clip_grad_norm_ before backward() clips stale or empty gradients and does nothing; calling it after step() clips gradients the optimizer has already consumed. It must sit strictly between the two.
  • Forgetting to unscale under mixed precision. If you clip fp16 gradients that are still multiplied by the loss-scale factor, your threshold of 1.0 is effectively a threshold of (1.0 ÷ scale) — often thousands of times too tight. Always unscale_ first.
  • Clipping every step (threshold too low). If the norm is over the threshold on most steps you're not catching spikes, you're capping the learning rate. Symptoms: training that's stable but mysteriously slow. Log the unclipped norm and raise the threshold.
  • Divide-by-zero on a dead batch. A fully-masked or empty batch can yield an exactly-zero gradient norm; c / 0 is inf and the rescale produces NaN. Add an epsilon to the denominator, exactly the bug clipping is supposed to prevent.
  • Confusing by-value with by-norm. Setting clip-by-value to 1.0 on a model whose healthy gradients exceed 1.0 component-wise will quietly cripple learning, because it's truncating normal updates, not just spikes. By-value thresholds are not interchangeable with by-norm thresholds.
  • Treating clipping as a cure. If clipping is the only thing keeping a run alive, the real fix is elsewhere — lower the learning rate, add a warmup, switch a plain RNN to an LSTM, or add layer normalization. Clipping hides the symptom; it doesn't condition the problem.

Frequently asked questions

What's the difference between clipping by norm and clipping by value?

Clip-by-norm rescales the entire gradient vector by one shared factor when its global L2 norm exceeds a threshold, so the direction is preserved exactly. Clip-by-value truncates each component independently to a fixed range like [-1, 1], which is simpler but bends the update direction. By-norm is the standard choice; by-value is mostly used as a crude last resort.

Why do RNNs need gradient clipping more than feedforward networks?

A recurrent net reuses the same weight matrix at every timestep, so backpropagation through time multiplies that matrix by itself dozens of times. If its largest singular value exceeds 1, the gradient grows exponentially in the sequence length and overflows to NaN. Feedforward nets multiply different matrices and rarely hit this geometric blow-up.

Does gradient clipping change where training converges?

On steps where the norm is below the threshold — the vast majority once training stabilizes — clipping does nothing, so the update is identical to plain SGD or Adam. It only intervenes on rare spikes, where it caps the step length while keeping the direction. The result is a biased estimator of the true gradient, but the bias is confined to the few catastrophic steps it exists to tame.

How do I pick the clipping threshold?

Log the unclipped global norm for a few hundred steps and pick a threshold near the top of the normal range — common values are 1.0 for Transformers and 5.0 or 10.0 for RNNs. Too low and you throttle every step, slowing learning; too high and rare spikes still cause NaNs. Many recipes set 1.0 by default and never tune it.

Should I clip before or after dividing by the batch size?

Clip after all gradient scaling — including loss averaging and gradient accumulation — but before the optimizer step. In PyTorch that means calling clip_grad_norm_ after loss.backward() and after unscaling mixed-precision gradients, then optimizer.step(). Clipping a still-scaled gradient compares its norm against the wrong threshold.

Does gradient clipping replace fixing exploding gradients properly?

No — it's a safety net, not a cure. Clipping masks the symptom but leaves the cause: a poorly conditioned recurrence, too-high a learning rate, or missing normalization. Pair it with LSTM/GRU gating, layer normalization, careful initialization, and a warmup schedule. If you're clipping on most steps, your learning rate is too high.