Calculus
Directional Derivative
The slope in any direction — the gradient is the special vector that bundles them all
The directional derivative D_v(f) of a multivariable function f at a point gives the rate of change of f as you move in the direction of the unit vector v. It equals the dot product of the gradient with the direction: D_v(f) = ∇f · v̂. The gradient direction is the steepest ascent (Cauchy–Schwarz); perpendicular directions are level (no change). Directional derivatives describe slopes in any direction, not just along the coordinate axes.
- NotationD_v(f), D_v̂(f), ∇_v f, ∂f/∂v
- FormulaD_v̂(f) = ∇f · v̂
- Limit definitionlim_{h → 0} [f(p + h v̂) − f(p)] / h
- Maximum value|∇f| (when v̂ is along ∇f)
- Zero whenv̂ ⊥ ∇f (along level curves)
- Used inGradient descent, optimisation, Hessian, optical flow
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The directional derivative as a limit
Given a function f : ℝⁿ → ℝ, a point p, and a unit vector v̂, the directional derivative is:
D_v̂(f)(p) = lim_{h → 0} [f(p + h v̂) − f(p)] / h
This is exactly the single-variable derivative, but with the increment h v̂ aimed in a chosen direction rather than along the x-axis. As h → 0, the chord from p to p + h v̂ tilts toward the tangent line of f along the v̂ direction; its slope converges to the directional derivative.
For a unit vector v̂ aligned with the x-axis, D_x̂(f) reduces to ∂f/∂x — the partial derivative. Directional derivatives generalise partials to arbitrary directions.
Proving D_v̂(f) = ∇f · v̂
Define a function of one real variable t:
φ(t) = f(p + t v̂)
This is the restriction of f to the line through p in the direction v̂. By the chain rule for multivariable functions:
φ'(t) = (d/dt) f(p + t v̂) = ∇f(p + t v̂) · v̂
By construction φ'(0) = D_v̂(f)(p). Substituting:
D_v̂(f)(p) = ∇f(p) · v̂
This is the central identity: the directional derivative in direction v̂ is the dot product of the gradient with v̂. Computing the gradient once gives all directional derivatives at that point — every direction's slope is a single dot product away.
Worked example — slope on a sloped plane
Let f(x, y) = x² + 3y². Compute the directional derivative at p = (1, 2) in direction v = (3, 4).
Step 1 — gradient.
∇f = (2x, 6y) ⟹ ∇f(1, 2) = (2, 12)
Step 2 — normalise direction.
|v| = √(9 + 16) = 5
v̂ = (3/5, 4/5)
Step 3 — dot product.
D_v̂(f)(1, 2) = (2, 12) · (3/5, 4/5)
= 6/5 + 48/5 = 54/5 = 10.8
The function increases at rate 10.8 per unit distance moved in direction (3, 4). The maximum possible rate (in the gradient direction (2, 12)) is |∇f| = √(4 + 144) ≈ 12.17. Our chosen direction (3, 4) achieves 10.8 of that — the cosine of the angle between (3, 4) and (2, 12) times 12.17.
Steepest ascent — proof via Cauchy–Schwarz
Question: among all unit directions v̂, which gives the largest D_v̂(f)?
The Cauchy–Schwarz inequality says, for any vectors a and b,
|a · b| ≤ |a| · |b|
with equality if and only if b is a non-negative scalar multiple of a (or a = 0). Applying this with a = ∇f(p) and b = v̂ (so |v̂| = 1):
|D_v̂(f)| = |∇f · v̂| ≤ |∇f|
The inequality is sharp: equality holds when v̂ = ∇f / |∇f|. So the gradient direction maximises the directional derivative, with maximum value |∇f|. The opposite direction v̂ = −∇f / |∇f| gives the minimum, D_v̂(f) = −|∇f| — steepest descent.
Three special cases worth memorising:
- v̂ aligned with ∇f. D_v̂(f) = +|∇f| (steepest ascent).
- v̂ opposite to ∇f. D_v̂(f) = −|∇f| (steepest descent).
- v̂ perpendicular to ∇f. D_v̂(f) = 0 (along a level curve, no change to first order).
Why level curves are perpendicular to the gradient
A level curve (or surface, in 3-D) is the set { x : f(x) = c } for some constant c. If you walk along this curve, f stays constant by definition — its derivative along the curve is zero. So if v̂ is the unit tangent to the level curve at p,
0 = D_v̂(f)(p) = ∇f(p) · v̂
which says ∇f(p) ⊥ v̂. The gradient is perpendicular to every level curve. Topographic maps illustrate this beautifully — the gradient at any point on the map points perpendicular to the local contour, "straight uphill."
In 3-D, the gradient is perpendicular to the level surface, which is exactly the surface's normal vector. This is why setting ∇F(x, y, z) = 0 finds critical points of the surface F(x, y, z) = c, and why the equation of the tangent plane to F = c at p is (∇F(p)) · (x − p) = 0.
Directional derivative vs gradient — what each gives you
| Directional derivative D_v̂(f) | Gradient ∇f | |
|---|---|---|
| Type | Scalar (a number) | Vector |
| Inputs | Function f, point p, direction v̂ | Function f, point p |
| Captures | Slope in one direction | All directional info bundled |
| Max value | |∇f| (when v̂ along ∇f) | — |
| Components of ∂f/∂x_i | Special case: v̂ a basis vector | Direct: ∇f = (∂f/∂x_i) |
| Geometric meaning | Slope on a vertical slice | Steepest-ascent direction with magnitude |
| Used for | Slope in any user-chosen direction | Steepest descent in optimisation |
| Sign reflects | Whether f rises or falls in direction v̂ | — |
The gradient is "the directional derivative in every direction at once" — it stores all D_v̂(f) values implicitly. Compute ∇f once, and any directional derivative is a dot product away.
Worked example — three dimensions
Let f(x, y, z) = x²y + yz + cos(z). Compute the directional derivative at (1, 2, 0) in direction (1, 1, 1).
∇f = (2xy, x² + z, y − sin z)
∇f(1, 2, 0) = (4, 1, 2)
|v| = √3, so v̂ = (1, 1, 1) / √3
D_v̂(f) = (4, 1, 2) · (1, 1, 1) / √3 = 7 / √3 ≈ 4.04
Same procedure as the 2-D case — three components instead of two.
Connection to gradient descent
The most influential application of the directional derivative is in optimisation. To minimise a function f, repeatedly take a step in the direction of steepest descent:
x_{k+1} = x_k − η · ∇f(x_k)
The directional derivative in direction −∇f / |∇f| equals −|∇f| — the most negative possible. Walking in any other direction cannot decrease f as fast. This is why gradient descent is the canonical first-order optimisation algorithm: among all directions, it commits the steepest descent.
Variants modify the choice of direction. Conjugate gradient picks directions based on Hessian information from previous steps. Newton's method moves in direction −H⁻¹ ∇f rather than −∇f, with H the Hessian. Quasi-Newton methods (BFGS, L-BFGS) approximate H⁻¹ from gradient history. All are directional-derivative arguments dressed up.
Where directional derivatives appear
- Optimisation. Gradient descent and its variants — every step is a directional-derivative argument.
- Vector calculus and physics. Force in direction v on a particle in a potential is −D_v̂(U). Heat flux at a point is −k ∇T (gradient times conductivity); the rate of heat flow per unit area in direction v̂ is the directional derivative −k D_v̂(T).
- Computer vision. Image gradients (Sobel, Scharr operators) compute partials with respect to pixel x and y; edge detection looks for directions of large directional derivative magnitude. Optical flow estimates how images move by solving for directions with consistent directional derivatives.
- Differential geometry. Lie derivatives, covariant derivatives, geodesic equations all build on directional derivatives along tangent vectors. The Hessian-vector product H v is the directional derivative of ∇f along v.
- Economics. The marginal effect of changing many inputs at once (a policy package) is a directional derivative in the "policy direction" through input space.
- Machine learning. Forward-mode autodiff (Jacobian-vector products) computes directional derivatives of network outputs with respect to inputs. Used for sensitivity analysis, second-order training methods, and adversarial robustness.
Common mistakes
- Forgetting to normalise v. If v has length 2, ∇f · v gives twice the per-unit-distance rate. Always divide by |v| first, or be explicit about which convention you are using.
- Computing D_v̂(f) = ∇f · v with v unnormalised. Catches even careful students. The formula D_v̂(f) = ∇f · v̂ requires the unit vector.
- Using the formula when f is not differentiable. The identity D_v̂(f) = ∇f · v̂ requires f to be differentiable at p. If f only has partials (a weaker condition), the directional derivative may exist for some directions but not equal the dot product.
- Confusing the directional derivative with a partial derivative. The partial ∂f/∂x is the directional derivative in the direction (1, 0, 0, ...) — a special case. General directional derivatives sample any direction.
- Sign error in steepest descent. Subtract the gradient (move against it) to minimise. Adding the gradient maximises — the most common bug in hand-rolled optimisation code.
- Forgetting that "perpendicular to ∇f gives zero" is first-order. If you walk along the level curve over a finite distance, f stays constant; if you walk perpendicular to the gradient over a finite distance not exactly tangent to the level set, f does change (just slowly). The instantaneous rate is zero, not the cumulative change.
- Mismatching dimensions. If f : ℝ³ → ℝ, then ∇f and v̂ are 3-vectors; if f is a function on ℝ² of two variables, both are 2-vectors. Dot products only work with matching dimensions.
Frequently asked questions
Why does D_v(f) = ∇f · v̂?
Define φ(t) = f(p + t v̂), the value of f along the ray from p in direction v̂. Compute φ'(0) by the multivariable chain rule: φ'(t) = ∇f(p + t v̂) · v̂. At t = 0 this is ∇f(p) · v̂. By definition D_v̂(f) = φ'(0), so D_v̂(f) = ∇f · v̂. The proof requires only the chain rule and the differentiability of f.
Why must v be a unit vector?
Because we want the rate of change per unit distance, not per unit of some arbitrary vector. If v has length 2, then ∇f · v gives twice the rate per unit distance — you have walked twice as far per unit of t. Normalising v to v̂ = v/|v| restores the per-distance interpretation. Some textbooks distinguish "directional derivative in direction v" (with v not normalised, giving |v| times the unit-direction rate) from the "unit directional derivative." The standard convention is to normalise.
Why is the gradient the direction of steepest ascent?
From D_v̂(f) = ∇f · v̂ and Cauchy–Schwarz, |D_v̂(f)| ≤ |∇f| |v̂| = |∇f|. Equality holds when v̂ is a positive scalar multiple of ∇f, i.e., v̂ = ∇f / |∇f|. Among all unit directions, the one aligned with ∇f gives the largest possible rate of increase of f. Likewise, the opposite direction gives the most negative rate — steepest descent.
What is the directional derivative perpendicular to ∇f?
Zero. The dot product ∇f · v̂ is zero when v̂ ⊥ ∇f. Geometrically, this is the level set: if you walk perpendicular to the gradient, f does not change to first order. This is why level curves (in 2-D) and level surfaces (in 3-D) are perpendicular to the gradient. The contour lines of a topographic map are level curves of elevation; the gradient at any point is perpendicular to the local contour, pointing uphill.
What is the relationship between directional derivative and partial derivative?
Partial derivatives are the special case where v̂ is a unit basis vector. ∂f/∂x = D_x̂(f) where x̂ = (1, 0, 0); ∂f/∂y = D_ŷ(f); etc. The general directional derivative samples the slope along any unit direction; partials sample only the coordinate axes. The bundle of all partials forms ∇f, which together with v̂ gives D_v̂(f) for any direction.
How is the directional derivative defined formally as a limit?
The fundamental limit definition is D_v̂(f)(p) = lim_{h → 0} [f(p + h v̂) − f(p)] / h. This mirrors the single-variable derivative, but the increment is a step in direction v̂ rather than along the x-axis. When f is differentiable at p, this limit exists for every direction and equals ∇f(p) · v̂. When f is differentiable but not smooth, all directional derivatives may exist without f being totally differentiable — the gradient approach can fail for non-Cartesian directions.
How are directional derivatives used in machine learning?
Gradient descent moves against the gradient — the direction of steepest descent. Each step has a directional derivative D_{−∇f}(f) = −|∇f| (the most negative possible). Conjugate gradient methods use directional information from previous iterations to choose better directions than the immediate gradient, often converging in fewer steps. Hessian-vector products in second-order methods are directional derivatives of the gradient. Modern automatic differentiation libraries (JAX's jvp, PyTorch's torch.autograd.functional.jvp) expose directional derivatives directly.