Calculus

Chain Rule

Differentiate compositions — d/dx[f(g(x))] = f'(g(x)) · g'(x)

The chain rule says — to differentiate a composition f(g(x)), multiply the derivative of the outer function (evaluated at g(x)) by the derivative of the inner function. It's the workhorse of calculus, the engine of backpropagation in neural networks, and the rule that makes nearly every interesting derivative computable. Without it, you couldn't differentiate sin(x²) or e^(3x) without going back to the limit definition.

Formulad/dx[f(g(x))] = f'(g(x)) · g'(x)
Leibniz formdy/dx = (dy/du) · (du/dx)
For three functionsd/dx[f(g(h(x)))] = f'(g(h(x))) · g'(h(x)) · h'(x)
Multivariable form∇(f∘g) = (Df at g(x)) · ∇g
Used inImplicit differentiation, related rates, backpropagation, Jacobians
Year of formal proof1697 (Leibniz)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Watch on YouTube

The formula

For a composite function f(g(x)) — apply g first, then f — the chain rule says:

d/dx [f(g(x))] = f'(g(x)) · g'(x)

In Leibniz notation with y = f(u) and u = g(x):

dy/dx = (dy/du) · (du/dx)

This looks like fraction multiplication, and that's no accident — the chain rule is the result of multiplying difference quotients and taking limits. The notation invites you to "cancel" du, but understand it as a derivation, not a literal cancellation.

Worked examples

Example 1 — Differentiate sin(x²)

Outer — f(u) = sin(u), so f'(u) = cos(u). Inner — g(x) = x², so g'(x) = 2x.

d/dx[sin(x²)] = cos(x²) · 2x = 2x cos(x²)

Example 2 — Differentiate e^(3x)

Outer — f(u) = e^u, f'(u) = e^u. Inner — g(x) = 3x, g'(x) = 3.

d/dx[e^(3x)] = e^(3x) · 3 = 3e^(3x)

Example 3 — Triple composition

Differentiate cos(sin(x²)). Three layers — outer cos, middle sin, inner x².

= −sin(sin(x²)) · cos(x²) · 2x

Each layer contributes one factor — derivative of cos (the −sin), evaluated at sin(x²); times derivative of sin (the cos), evaluated at x²; times derivative of x² (the 2x).

Example 4 — Implicit differentiation

The circle x² + y² = 25. Find dy/dx. Differentiate both sides with respect to x — and use chain rule on y² because y depends on x:

2x + 2y · (dy/dx) = 0
2y · (dy/dx) = −2x
dy/dx = −x/y

So at the point (3, 4) on the circle, the slope is −3/4. We didn't need to solve y = √(25 − x²) explicitly — implicit differentiation handles the curve directly.

Why the chain rule works (proof sketch)

From the limit definition:

(d/dx) f(g(x))
= lim_{h→0} (f(g(x+h)) − f(g(x))) / h

Multiply numerator and denominator by g(x+h) − g(x) (assuming nonzero):

= lim_{h→0} [(f(g(x+h)) − f(g(x))) / (g(x+h) − g(x))] · [(g(x+h) − g(x)) / h]

The first bracket is f's difference quotient at g(x), with input change g(x+h) − g(x). As h → 0, that input change → 0 (g is continuous), so the bracket approaches f'(g(x)). The second bracket is g's difference quotient, approaching g'(x). Product of limits — f'(g(x)) · g'(x).

The case where g(x+h) = g(x) for h near 0 (e.g., g constant) requires a slightly different argument; both cases agree on the same answer.

The multivariable chain rule

For a function f of several variables that themselves depend on t — say f(x(t), y(t)):

df/dt = (∂f/∂x)(dx/dt) + (∂f/∂y)(dy/dt)

The total derivative df/dt is the sum of partial derivatives of f, each weighted by how the corresponding variable changes with t. In matrix form, with vector inputs and outputs, the chain rule says — the Jacobian of the composition is the product of the Jacobians.

This is the form used in machine learning. A neural network is a deep chain of operations; the gradient of the loss with respect to weights is computed by multiplying Jacobians backward through the layers — that's backpropagation in matrix-multiply form.

JavaScript — automatic differentiation

// Forward-mode automatic differentiation using "dual numbers"
class Dual {
  constructor(real, dual = 0) { this.r = real; this.d = dual; }

  static add(a, b) { return new Dual(a.r + b.r, a.d + b.d); }
  static mul(a, b) { return new Dual(a.r * b.r, a.r * b.d + a.d * b.r); }

  static sin(a) { return new Dual(Math.sin(a.r), Math.cos(a.r) * a.d); }
  static exp(a) { const e = Math.exp(a.r); return new Dual(e, e * a.d); }
  static pow(a, n) { return new Dual(a.r ** n, n * a.r ** (n - 1) * a.d); }
}

// Compute f(x) and f'(x) simultaneously
function derivative(f, x) {
  const result = f(new Dual(x, 1));  // dual part = 1 means "differentiate w.r.t. x"
  return { value: result.r, derivative: result.d };
}

// Chain rule applies automatically — every operation propagates derivatives
const f = x => Dual.sin(Dual.pow(x, 2));  // sin(x²)
derivative(f, 1);  // { value: 0.841, derivative: 1.080 } — matches 2 cos(1)

// Real autodiff libraries (autograd, JAX, PyTorch) generalize this for
// gradients of arbitrary code, including loops and conditionals.

Where the chain rule is essential

Anywhere you need to differentiate a composition. Almost every calculus derivative beyond elementary ones uses chain rule.
Implicit differentiation. Curves defined by equations rather than explicit y = f(x). Chain rule handles d/dx of expressions involving y.
Related rates. "How fast is the radius growing if the volume increases at 3 m³/s?" You differentiate the volume formula with respect to time using chain rule on radius(time).
Backpropagation in neural networks. The gradient of loss with respect to deep weights is a chain of partial derivatives. Gradient descent updates the weights — this is the math under every modern AI training run.
Multivariable optimization. Gradient of f(x(t), y(t)) requires the multivariable chain rule. Lagrange multipliers, Hamiltonian dynamics, and physics calculations all use it.
Differential equations. Substitution methods (u-substitution in integrals; change of variables in PDEs) all use chain rule.

Common mistakes

Forgetting the inner derivative. The most common chain rule bug — writing d/dx[sin(x²)] = cos(x²) instead of cos(x²) · 2x. Always include the derivative of the inner function.
Wrong evaluation point for the outer derivative. f'(g(x)) means f' evaluated AT g(x), not at x. For sin(x²), it's cos(x²), not cos(x).
Misidentifying the inner and outer. In sin(x²), sin is outer. In (sin x)², the squaring is outer. Pay attention to which operation is applied last.
Stopping at the first level for nested compositions. cos(sin(x²)) is three layers; you need three derivative factors. Each layer of nesting adds one term.
Treating dy/dx as a true fraction in non-trivial cases. For chain rule, the fraction-like manipulation works. But you can't always cancel differentials — d²y/dx² doesn't equal d²y / d(x)² in any meaningful sense.
Forgetting that y is a function of x in implicit differentiation. When you differentiate y² with respect to x, the answer is 2y · (dy/dx), not 2y. Skipping the chain rule factor breaks the calculation.

Frequently asked questions

Why does the chain rule work?

From the limit definition. Δy/Δx = (Δy/Δu) · (Δu/Δx) when Δu ≠ 0 (multiply and divide by the same Δu). As Δx → 0, Δu → 0 too (g is continuous), so we get dy/dx = (dy/du)(du/dx). The Leibniz fraction-like notation isn't a fraction strictly — but the chain rule looks exactly like fraction multiplication, which is no accident.

How does the chain rule relate to backpropagation?

Backpropagation IS the chain rule applied to neural networks. A network is a deep composition of simple functions — backpropagation walks backward through the composition, multiplying derivatives at each layer. Modern deep learning (millions to billions of parameters) computes chain-rule gradients efficiently — that's the entire training loop.

How do I apply the chain rule to a triple composition?

Apply it twice. d/dx[f(g(h(x)))] = f'(g(h(x))) · d/dx[g(h(x))] = f'(g(h(x))) · g'(h(x)) · h'(x). Each layer of nesting adds one factor — the derivative of that layer evaluated at its input.

What's implicit differentiation?

Differentiating both sides of an equation that doesn't isolate y. For x² + y² = 25, differentiate both sides with respect to x — 2x + 2y(dy/dx) = 0, so dy/dx = −x/y. The chain rule is essential — d/dx(y²) = 2y · dy/dx (treating y as a function of x). Used to find slopes on curves like circles or ellipses where solving for y is messy.

How do I remember which is the "outer" and "inner" function?

The outer function is what's applied last. In f(g(x)), g is applied first to get g(x); then f is applied to that. So f is "outer," g is "inner." Compute outer's derivative at g(x), then multiply by inner's derivative at x. "Outside in, then multiply by inside derivative."

Does the chain rule work for the multivariable case?

Yes — generalized to ∇(f ∘ g)(x) = (Df at g(x)) · ∇g(x), with matrix multiplication when there are multiple variables. For scalar f and vector-valued g, the gradient of the composite is the gradient of f at g(x) times the Jacobian of g. This is the form that powers neural network gradients.

Why is chain rule so often the hardest derivative concept for students?

Because it requires recognizing composite structure inside an expression. sin(x²) needs to be parsed as f(u) = sin(u) with u = x²; not everyone sees it as composition at first glance. Once the parsing becomes automatic, the rule is mechanical. Practice with many examples is the only path.