Calculus
Gradient
The vector of partial derivatives — points uphill, drives optimization
The gradient ∇f of a multivariable function is the vector of partial derivatives — (∂f/∂x, ∂f/∂y, ...). It points in the direction of steepest ascent at each point, with magnitude equal to the steepest rate of increase. Backprop in neural networks computes gradients; gradient descent moves against them; the gradient is the central object of multivariable optimization.
- Definition∇f = (∂f/∂x, ∂f/∂y, ∂f/∂z, ...)
- DirectionSteepest ascent of f
- MagnitudeRate of f's increase in that direction
- ∇f = 0 atCritical points (max, min, or saddle)
- Used inGradient descent, neural network training, optimization, physics fields
- Notation∇f, grad f
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The gradient
For a function f(x, y, z, ...) of several variables, the gradient is the vector of partial derivatives:
∇f = (∂f/∂x, ∂f/∂y, ∂f/∂z, ...)
Each component is a partial derivative — the rate of change of f with respect to one variable, holding the others constant. The vector ∇f at a point gives all this information at once.
Geometric meaning
The gradient at a point has two key properties:
- Direction — points in the direction of steepest ascent of f.
- Magnitude — equal to the rate of increase in that direction (units of f per unit distance).
If you stand at a point and want to move so f increases as fast as possible, walk in the direction of ∇f. Move against ∇f and f decreases as fast as possible (steepest descent). Move perpendicular to ∇f and f stays constant — that's a level curve.
Topographic maps illustrate this — contour lines are level sets of elevation. The gradient at any point on the map points perpendicular to the local contour, uphill, with magnitude equal to the steepness.
Worked examples
Example 1 — gradient of a quadratic
Let f(x, y) = x² + 3y². Compute the gradient.
∂f/∂x = 2x
∂f/∂y = 6y
∇f = (2x, 6y)
At (1, 2), ∇f = (2, 12). The function increases fastest in the direction (2, 12), with magnitude √(4 + 144) ≈ 12.17 per unit distance. The y-coefficient (6 in 6y) is bigger than x's (2 in 2x), so f is steeper in y — and the gradient at (1, 2) is dominated by the y-component.
Example 2 — find critical points
Find critical points of f(x, y) = x² + 2xy + 3y² + 4x + 5.
∂f/∂x = 2x + 2y + 4
∂f/∂y = 2x + 6y
Set both to zero:
2x + 2y + 4 = 0 ⟹ x + y = −2
2x + 6y = 0 ⟹ x = −3y
Substitute — −3y + y = −2 — y = 1, x = −3. Critical point at (−3, 1).
To classify it as max/min/saddle, look at the Hessian (matrix of second partials):
H = [2 2]
[2 6]
Eigenvalues are both positive (det = 8 > 0, trace = 8 > 0), so the Hessian is positive definite — the critical point is a local minimum.
The directional derivative
The directional derivative of f in unit-direction u is:
D_u f = ∇f · u
The dot product of the gradient with the direction. By Cauchy-Schwarz, |D_u f| ≤ |∇f| · |u| = |∇f|, with equality when u points along ∇f. So the gradient direction is the steepest direction.
Special cases:
- u parallel to ∇f — D_u f = |∇f| (steepest ascent rate).
- u opposite to ∇f — D_u f = −|∇f| (steepest descent).
- u perpendicular to ∇f — D_u f = 0 (along level curve, no change).
Gradient descent — the workhorse algorithm
To minimize a function f, follow the negative gradient:
x_{k+1} = x_k − η · ∇f(x_k)
where η is the learning rate (step size). Repeat until convergence (∇f near zero or x stops changing significantly).
This is the foundation of:
- Neural network training (backpropagation computes ∇f efficiently).
- Logistic regression and most machine-learning model fitting.
- Numerical optimization in physics, engineering, finance.
- Reinforcement learning policy gradients.
Variants — momentum (incorporates previous gradient direction), adaptive (Adam, RMSprop adjust step size per dimension), stochastic (uses random subset for noisy but cheaper gradient estimates).
JavaScript — gradient computation and gradient descent
// Numerical gradient via central differences
function gradient(f, x, h = 1e-6) {
return x.map((_, i) => {
const xPlus = [...x]; xPlus[i] += h;
const xMinus = [...x]; xMinus[i] -= h;
return (f(xPlus) - f(xMinus)) / (2 * h);
});
}
// Gradient descent — minimize f starting from x₀
function gradientDescent(f, x0, lr = 0.01, iters = 1000) {
let x = [...x0];
for (let i = 0; i < iters; i++) {
const grad = gradient(f, x);
x = x.map((xi, j) => xi - lr * grad[j]);
}
return x;
}
// Minimize f(x, y) = (x - 3)² + (y - 4)²
const f = ([x, y]) => (x - 3)**2 + (y - 4)**2;
const result = gradientDescent(f, [0, 0]);
console.log(result); // ≈ [3, 4] — the unique minimum
Gradient as a vector field
For each point in space, ∇f gives a vector. The collection of these vectors is a vector field — a function from points to vectors. Gradient fields have special properties:
- They're "irrotational" — curl(∇f) = 0 (when f is twice continuously differentiable).
- The line integral around any closed loop is zero — gradient fields are conservative.
- Any irrotational field on a simply-connected region IS the gradient of some function (Poincaré lemma).
Physics example — gravitational and electrostatic potentials are scalar fields whose gradients give force fields. The conservation of energy comes from the gradient structure.
Where the gradient appears
- Machine learning. Optimization of loss functions in neural networks via gradient descent. Backpropagation is the chain-rule machinery for computing gradients efficiently.
- Physics — fields. Force = −∇(potential energy). Electric field = −∇(electric potential). Heat flux = −k·∇(temperature) (Fourier's law). Gradients describe how scalar fields drive vector flows.
- Computer graphics. Image gradients used in edge detection, normal computation, lighting. Sobel and Scharr operators approximate ∇ for discrete pixel grids.
- Optimization in operations research. Linear and nonlinear programming use gradients to find extrema subject to constraints.
- Statistics. Maximum likelihood estimation finds parameters where the gradient of log-likelihood is zero. Newton's method uses both gradient and Hessian.
- Computer vision. Optical flow, image registration, structure from motion all involve solving systems where gradients are key.
Common mistakes
- Confusing the gradient with a derivative. The gradient is a vector. Derivative in a single variable is a scalar. ∇f and df/dx mean different things in 1D — though they coincide when written as ∇f = (df/dx).
- Wrong sign in gradient descent. Subtract the gradient (move against it) to minimize. Adding moves toward maximum. Easy to flip; results in the algorithm finding the wrong extremum.
- Not normalizing the direction in directional derivative. D_u f = ∇f · u requires u to be a unit vector. Without normalization, you get the rate scaled by the length of u.
- Forgetting that gradient is local. ∇f at one point doesn't tell you about other points. Following the gradient locally can diverge if step size is too big or the function is non-convex.
- Confusing partial and total derivatives. Partial — change in one variable, others fixed. Total — change considering all dependencies. The chain rule connects them; mixing them up gives wrong answers.
- Numerical gradient with too-small h. Floating-point error dominates when h is below ~10⁻⁸. Central difference (with h ~ 10⁻⁶) is more stable than forward difference. For production, use autodiff libraries (JAX, PyTorch) for exact gradients.
Frequently asked questions
Why does the gradient point in the direction of steepest ascent?
The directional derivative of f in direction u (unit vector) is ∇f · u. By Cauchy-Schwarz, this dot product is maximized when u is aligned with ∇f, with maximum value |∇f|. So among all unit-direction movements, the gradient direction gives the largest rate of increase. Hence "steepest ascent."
What's the difference between gradient and partial derivatives?
Partial derivatives are scalar — ∂f/∂x at a point is one number, the rate of change in the x direction. The gradient is a vector — bundles all partial derivatives into a single object. The gradient encodes complete first-order behavior; individual partials are just its components.
How is the gradient used in machine learning?
Gradient descent. Compute ∇L (gradient of the loss function w.r.t. model parameters); move parameters against the gradient by a small step. Repeat until convergence. Backpropagation is the chain-rule machinery that efficiently computes these gradients in deep networks. Modern AI training is mostly gradient computation at massive scale.
What's a directional derivative?
The rate of change of f when moving in a specific direction u (unit vector). Computed as D_u f = ∇f · u (dot product). The gradient is the special case where you take the derivative in EVERY direction simultaneously and bundle the answers; the directional derivative picks out the rate in one specific direction.
How does the gradient relate to level curves and surfaces?
The gradient is always perpendicular to level curves (in 2D) or level surfaces (in 3D). Imagine a topographic map — the gradient points "straight uphill," perpendicular to the contour lines. This is why steepest descent in optimization moves perpendicular to the level set of the loss function.
What's a saddle point?
A critical point (∇f = 0) that's neither a max nor a min — it goes up in some directions and down in others. The Hessian (matrix of second partials) has both positive and negative eigenvalues. Saddle points are abundant in high-dimensional optimization (most critical points in deep networks are saddles, not local minima).
How does the gradient generalize to vector-valued functions?
For F : ℝⁿ → ℝᵐ, the analog is the Jacobian matrix — m × n matrix of all partial derivatives. Each row is the gradient of one component function. The Jacobian is the matrix of the best linear approximation to F at a point. For scalar f (m = 1), the Jacobian is a row vector — the gradient transposed.