Calculus began with a single number — the derivative $f'(x)$, the slope of a tangent line, a measure of how fast one quantity changes as another changes. For twenty-nine chapters that number has done extraordinary work: it found maxima, drove...
Prerequisites
- Chapter 29: Functions of Several Variables
Learning Objectives
- Apply the multivariable chain rule to compositions of functions using tree diagrams.
- Compute the gradient $\nabla f$ of a function of two or three variables.
- Compute directional derivatives and interpret them geometrically as $\nabla f \cdot \mathbf{u}$.
- Explain why the gradient points in the direction of steepest ascent and is normal to level sets.
- Find tangent planes and normal lines to surfaces.
- Implement gradient descent for numerical optimization and connect it to machine learning.
In This Chapter
- 30.1 The Conceptual Peak
- 30.2 The Multivariable Chain Rule
- 30.3 The Gradient Vector
- 30.4 The Three Superpowers of the Gradient
- 30.5 The Directional Derivative — and the Proof of the Three Superpowers
- 30.6 Tangent Planes and Normal Lines
- 30.7 Linear Approximation in Several Variables
- 30.8 Gradient Descent — The Master Algorithm
- 30.9 Gradient Descent and Machine Learning
- 30.10 The Gradient Across the Sciences
- 30.11 Computing Gradients in Python
- 30.12 Looking Forward: Constrained Optimization
- 30.13 Reflection: Why This Is the Peak
Chapter 30 — Multivariable Chain Rule and Gradient
30.1 The Conceptual Peak
Back in Chapter 1 we promised you a summit. This is it.
Calculus began with a single number — the derivative $f'(x)$, the slope of a tangent line, a measure of how fast one quantity changes as another changes. For twenty-nine chapters that number has done extraordinary work: it found maxima, drove related-rates problems, seeded Taylor series, and (through the Fundamental Theorem of Calculus, Chapter 14) turned out to be the inverse of integration. But $f'(x)$ has one quiet limitation. It describes change along a line. The world is not a line.
When a function depends on several variables — temperature across a room, elevation across a mountain, the loss of a neural network across millions of parameters — "the rate of change" is no longer a single number. You can move in infinitely many directions from a point, and $f$ changes at a different rate in each one. The object that captures all of those rates at once, and organizes them into a single geometric arrow, is the gradient $\nabla f$. It is the true multivariable derivative, and learning to read it is the conceptual peak of single- and multivariable calculus alike.
The Key Insight. In one variable, $f'$ is a number: magnitude only. In several variables, $\nabla f$ is a vector: it has both a magnitude (how fast $f$ is changing) and a direction (the direction of fastest increase). That single upgrade — from number to direction-carrying vector — is what lets calculus optimize anything, from a spacecraft trajectory to a billion-parameter language model.
This chapter has three movements. First the multivariable chain rule, the bookkeeping that lets change propagate through layered dependencies. Then the gradient itself, with its three geometric superpowers: steepest ascent, perpendicularity to level sets, and the directional derivative. Finally gradient descent — the algorithm that, by stepping in the direction $-\nabla f$ over and over, trains essentially every machine-learning model in existence. We have been tracking that algorithm since Chapter 6, when the single-variable derivative first told us "which way to step." Here it reaches its climax.
30.2 The Multivariable Chain Rule
Recall the single-variable chain rule from Chapter 7: if $y = f(g(x))$, then $\frac{dy}{dx} = f'(g(x))\,g'(x)$. The derivative of a composition is the product of derivatives — one factor per link in the chain. In several variables nothing is destroyed; something is added. When a quantity can be reached through more than one intermediate variable, you take the product along each path and then sum the paths.
Case 1: One independent variable, $z = f(x, y)$ with $x = x(t),\ y = y(t)$
Here both $x$ and $y$ ride along a single parameter $t$ — think of a bug walking along a path $(x(t), y(t))$ across a temperature field $f$. The bug's temperature changes for two reasons at once: it is moving in $x$ and moving in $y$. The total rate is the sum of the two contributions:
$$\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}.$$
Each term is "how sensitive $f$ is to that variable" times "how fast that variable is moving." We sum because the influences accumulate.
Worked example (graduated, step 1 — routine). Let $z = x^2 y$ with $x = \cos t$, $y = \sin t$, so the bug walks the unit circle. Compute the pieces:
$$\frac{\partial f}{\partial x} = 2xy, \qquad \frac{\partial f}{\partial y} = x^2, \qquad \frac{dx}{dt} = -\sin t, \qquad \frac{dy}{dt} = \cos t.$$
Assemble:
$$\frac{dz}{dt} = 2xy(-\sin t) + x^2(\cos t) = -2\cos t\,\sin^2 t + \cos^3 t.$$
Sanity check by direct substitution. Substitute the path first: $z = \cos^2 t\,\sin t$. Differentiate with the product rule: $\frac{dz}{dt} = -2\cos t\,\sin t\cdot \sin t + \cos^2 t\cdot \cos t = -2\cos t\,\sin^2 t + \cos^3 t$. Identical. (Symbolically both simplify to $(1 - 3\sin^2 t)\cos t$.) The chain rule and brute substitution must always agree — the chain rule is just the organized way to get there, and it scales to problems where substituting first is hopeless.
Common Pitfall. Many students write $\frac{dz}{dt} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}$, dropping the inner derivatives $\frac{dx}{dt}$ and $\frac{dy}{dt}$. Every path must carry both factors: the sensitivity of $f$ to the intermediate variable, and the speed of that intermediate variable. Forgetting the inner factor is the multivariable cousin of forgetting the chain-rule factor in Chapter 7 — and it is just as fatal.
Case 2: Several independent variables, $z = f(x, y)$ with $x = x(s,t),\ y = y(s,t)$
Now the intermediates $x$ and $y$ each depend on two outer variables $s$ and $t$. The result $z$ is therefore a function of $s$ and $t$, and we want partial derivatives. The rule is the same — product along each path, sum the paths — applied once per outer variable, holding the other outer variable fixed:
$$\frac{\partial z}{\partial s} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial s}, \qquad \frac{\partial z}{\partial t} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t}.$$
Worked example (step 2 — two layers). Let $z = x^2 + y^2$ with $x = s + t$, $y = s - t$. Then $f_x = 2x$, $f_y = 2y$, and the inner partials are $x_s = 1,\ x_t = 1,\ y_s = 1,\ y_t = -1$. So
$$\frac{\partial z}{\partial s} = 2x(1) + 2y(1) = 2(s+t) + 2(s-t) = 4s,$$ $$\frac{\partial z}{\partial t} = 2x(1) + 2y(-1) = 2(s+t) - 2(s-t) = 4t.$$
Check by substitution: $z = (s+t)^2 + (s-t)^2 = 2s^2 + 2t^2$, so $z_s = 4s$ and $z_t = 4t$. Agreement again — and notice the chain rule got there without ever expanding the square.
The Tree Diagram
The mechanical bookkeeping is easiest to see. Draw $z$ at the top, branch to each intermediate variable, then branch from each intermediate to each outer variable it depends on. For Case 2:
z
/ \
∂f/∂x / \ ∂f/∂y
/ \
x y
/ \ / \
∂x/∂s/ \∂x/∂t ... etc.
s t s t
The rule on a tree: to find $\partial z/\partial s$, trace every path from $z$ down to $s$. Multiply the labels along each path; add the paths together. A variable that appears at two leaves (like $s$ here, reached through both $x$ and $y$) contributes one term per route. This single picture handles any dependency structure — three intermediates, four outers, chains five layers deep. You never have to memorize a new formula; you read it off the tree.
Geometric Intuition. Picture influence flowing downhill through the tree like water through a branching pipe network. A wiggle in $s$ at the bottom sends ripples up every pipe that connects to it. The total disturbance felt at $z$ is the sum of what arrives through each pipe, and each pipe attenuates the signal by the product of its segment derivatives. The chain rule is conservation of influence.
Computational Note. This tree is a computational graph. When you call
.backward()in PyTorch or applyjax.grad, the library has built exactly such a tree — nodes are operations, edges are partial derivatives — and applies the multivariable chain rule from the output back to every input. The technique is reverse-mode automatic differentiation (backpropagation), and it is nothing more than this section executed billions of times per second. We return to it in §30.10.
Case 3: Chain Rule for Implicit Differentiation
The chain rule gives a clean formula for implicit differentiation, sharpening the technique from Chapter 8. Suppose an equation $F(x, y) = 0$ defines $y$ implicitly as a function of $x$. Differentiate both sides with respect to $x$, treating $y$ as $y(x)$ — this is a Case 1 chain rule with $t = x$:
$$F_x \cdot 1 + F_y \cdot \frac{dy}{dx} = 0 \quad\Longrightarrow\quad \frac{dy}{dx} = -\frac{F_x}{F_y} \quad (\text{provided } F_y \neq 0).$$
Example. For the circle $x^2 + y^2 = 1$, set $F = x^2 + y^2 - 1$. Then $F_x = 2x$, $F_y = 2y$, and $\frac{dy}{dx} = -\frac{2x}{2y} = -\frac{x}{y}$ — the same answer Chapter 8 obtained by laborious term-by-term differentiation, now in two lines. In 3D, if $F(x,y,z) = 0$ defines $z$ implicitly, the same logic gives $\frac{\partial z}{\partial x} = -\frac{F_x}{F_z}$ and $\frac{\partial z}{\partial y} = -\frac{F_y}{F_z}$.
Check Your Understanding. Use the implicit chain-rule formula to find $\frac{dy}{dx}$ for the curve $x^3 + y^3 = 6xy$ (the folium of Descartes) at a general point.
Answer
Set $F = x^3 + y^3 - 6xy$. Then $F_x = 3x^2 - 6y$ and $F_y = 3y^2 - 6x$, so $\dfrac{dy}{dx} = -\dfrac{F_x}{F_y} = -\dfrac{3x^2 - 6y}{3y^2 - 6x} = \dfrac{6y - 3x^2}{3y^2 - 6x} = \dfrac{2y - x^2}{y^2 - 2x}$. The tangent is horizontal where the numerator vanishes ($2y = x^2$) and vertical where the denominator vanishes ($y^2 = 2x$).
30.3 The Gradient Vector
We are ready for the chapter's central object. The partial derivatives of $f$ are not just a list of numbers to be carried around separately — assembled into a vector, they become a single geometric entity with a life of its own.
For a function $f(x, y, z)$, the gradient is the vector of partial derivatives:
$$\nabla f = \left\langle \frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z} \right\rangle = \langle f_x,\ f_y,\ f_z \rangle.$$
Read $\nabla f$ as "del $f$" or "grad $f$." For a function of two variables, $\nabla f = \langle f_x, f_y \rangle$ is a 2D vector living in the $xy$-plane. The symbol $\nabla$ ("nabla," after an ancient harp of that shape) is itself a vector differential operator,
$$\nabla = \left\langle \frac{\partial}{\partial x},\ \frac{\partial}{\partial y},\ \frac{\partial}{\partial z} \right\rangle,$$
a thing hungry for a function to differentiate. Feed it a scalar function $f$ and it returns the vector $\nabla f$. In Chapters 34–37 the same operator, fed a vector field via a dot or cross product, will produce divergence ($\nabla \cdot \mathbf{F}$) and curl ($\nabla \times \mathbf{F}$). One operator, three jobs — the gradient is the first and most fundamental.
Example. For $f(x, y) = x^2 + y^2$ we have $\nabla f = \langle 2x, 2y \rangle = 2\langle x, y \rangle$. At the point $(1, 1)$, $\nabla f = \langle 2, 2 \rangle$: a vector of magnitude $\|\nabla f\| = 2\sqrt{2}$ pointing at $45°$, directly away from the origin along the ray from the center of this bowl-shaped paraboloid. Hold that picture — in the next section we will see it is no accident that the gradient points "straight uphill, away from the bottom of the bowl."
Historical Note. The operator $\nabla$ was introduced by William Rowan Hamilton in the 1840s; the name "nabla" was suggested by the physicist Peter Guthrie Tait, after a Hellenistic harp of similar triangular shape. The word "gradient" (from Latin gradus, "step") entered physics through James Clerk Maxwell, who needed a name for the spatial rate of change of his potential functions. The notation has been universal for over a century.
30.4 The Three Superpowers of the Gradient
The gradient earns its central place through three geometric facts. We will state them here, build intuition, then prove all three in §30.5 using the directional derivative. Throughout, picture $z = f(x,y)$ as a landscape — a hill whose height above the point $(x,y)$ is $f(x,y)$.
Superpower 1: $\nabla f$ points in the direction of steepest ascent
At any point, of all the directions you could step, the one in which $f$ increases fastest is the direction of $\nabla f$. And the rate of climb in that best direction is exactly $\|\nabla f\|$. The gradient is the compass needle that always points straight uphill, and its length tells you how steep the climb is.
Superpower 2: $\nabla f$ is perpendicular to level curves
A level curve is a set $f(x,y) = c$ — a contour line of constant height, exactly like the contours on a topographic map. If you walk along a contour, your height never changes, so $f$ is momentarily not changing in that direction. The direction of fastest change must therefore be at right angles to the direction of no change. Hence $\nabla f \perp$ (level curve) at every point. On a contour map, the gradient is the arrow crossing the contours at $90°$, always toward the next-higher contour.
Superpower 3: $\nabla f$ is normal to level surfaces (3D)
The identical statement one dimension up: for $f(x,y,z)$, the level surfaces $f = c$ are surfaces of constant value, and $\nabla f$ is the normal vector to them. This is the fact that will hand us tangent planes in §30.6.
Geometric Intuition. Stand on a hillside in fog with only a contour map. Find the contour line through your feet. The steepest way up is perpendicular to that line — never along it — and the closer together the neighboring contours, the steeper the climb and the longer the gradient arrow. Rain running off the hill flows in the direction of $-\nabla f$: perpendicular to the contours, straight downhill. Every drainage pattern on Earth is a gradient field in action.
Real-World Application — Hiking and terrain analysis (geography/GIS). Geographic Information Systems compute $\nabla(\text{elevation})$ across digital elevation models to produce slope maps ($\|\nabla f\|$) and aspect maps (the compass direction of $\nabla f$). Land planners use them to route trails along gentle gradients, hydrologists use $-\nabla f$ to predict where water pools and floods, and avalanche forecasters flag slopes whose gradient magnitude exceeds the critical angle. The mathematics of this chapter is literally how your phone's hiking app draws its difficulty ratings.
30.5 The Directional Derivative — and the Proof of the Three Superpowers
The partial derivative $f_x$ measures the rate of change of $f$ as you step in the pure $x$-direction; $f_y$ measures it in the pure $y$-direction. But you can step in any direction. The tool that measures the rate of change in an arbitrary direction — and that proves all three superpowers in one stroke — is the directional derivative.
Definition
Let $\mathbf{u} = \langle u_1, u_2\rangle$ be a unit vector ($\|\mathbf{u}\| = 1$). The directional derivative of $f$ at $(x_0, y_0)$ in the direction $\mathbf{u}$ is
$$D_{\mathbf{u}} f(x_0, y_0) = \lim_{h \to 0} \frac{f(x_0 + h u_1,\ y_0 + h u_2) - f(x_0, y_0)}{h}.$$
This is precisely the slope you would feel walking in direction $\mathbf{u}$: numerator is the change in height, denominator is the (signed) distance walked, and the limit takes the step to zero. Notice that choosing $\mathbf{u} = \langle 1, 0\rangle$ recovers $f_x$, and $\mathbf{u} = \langle 0, 1\rangle$ recovers $f_y$. The directional derivative contains the partials as special cases.
The Master Formula
Computing that limit from scratch for every direction would be miserable. The gradient rescues us. Define $g(h) = f(x_0 + hu_1,\ y_0 + hu_2)$, a single-variable function of $h$ tracing $f$ along the line through $(x_0,y_0)$ in direction $\mathbf{u}$. Then $D_{\mathbf{u}}f(x_0,y_0) = g'(0)$. Apply the Case 1 chain rule from §30.2, with $x(h) = x_0 + hu_1$ and $y(h) = y_0 + hu_2$, so $x'(h) = u_1$ and $y'(h) = u_2$:
$$g'(h) = f_x\,u_1 + f_y\,u_2 \quad\Longrightarrow\quad D_{\mathbf{u}} f = f_x u_1 + f_y u_2 = \nabla f \cdot \mathbf{u}.$$
This is the formula that makes the gradient indispensable:
$$\boxed{\,D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}\,} \qquad (\mathbf{u}\text{ a unit vector}).$$
The rate of change in any direction is just the dot product of the gradient with that direction. Compute one gradient and you instantly know the slope in all infinitely many directions.
Worked example. For $f(x, y) = x^2 + y^2$ at the point $(1, 1)$, in the direction $\mathbf{u} = \langle 3/5, 4/5\rangle$ (check: $\|\mathbf{u}\| = \sqrt{9/25 + 16/25} = 1$ ✓). We have $\nabla f(1,1) = \langle 2, 2\rangle$, so
$$D_{\mathbf{u}} f = \langle 2, 2\rangle \cdot \langle 3/5, 4/5\rangle = \tfrac{6}{5} + \tfrac{8}{5} = \tfrac{14}{5} = 2.8.$$
So $f$ climbs at rate $2.8$ if you step off in direction $\mathbf{u}$ from $(1,1)$.
Common Pitfall. The formula $D_{\mathbf{u}}f = \nabla f \cdot \mathbf{u}$ requires $\mathbf{u}$ to be a unit vector. A frequent error is to use the raw direction vector, say $\langle 3, 4\rangle$, and report $D_{\mathbf{u}}f = \langle 2,2\rangle\cdot\langle 3,4\rangle = 14$. That is five times too large, because $\|\langle 3,4\rangle\| = 5$. Always normalize first: $\mathbf{u} = \frac{1}{5}\langle 3,4\rangle = \langle 3/5, 4/5\rangle$. If you forget, your "rate of change per unit distance" is silently measured in the wrong units of length.
Now the Superpowers Fall Out
Rewrite the master formula using the geometric meaning of the dot product. If $\theta$ is the angle between $\nabla f$ and $\mathbf{u}$, then since $\|\mathbf{u}\| = 1$,
$$D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} = \|\nabla f\|\,\|\mathbf{u}\|\cos\theta = \|\nabla f\|\cos\theta.$$
Everything in §30.4 is now visible in this one expression. As $\mathbf{u}$ swings around, only $\cos\theta$ varies, between $-1$ and $1$:
- Maximum rate $= \|\nabla f\|$, achieved at $\cos\theta = 1$, i.e. $\theta = 0$: $\mathbf{u}$ points along $\nabla f$. This is Superpower 1: steepest ascent is the gradient's own direction, with rate $\|\nabla f\|$.
- Minimum rate $= -\|\nabla f\|$, at $\theta = \pi$: $\mathbf{u}$ points opposite the gradient, $-\nabla f$. This is steepest descent — and the engine of §30.8.
- Zero rate, at $\theta = \pi/2$: $\mathbf{u} \perp \nabla f$. Moving perpendicular to the gradient leaves $f$ instantaneously unchanged — i.e. you are moving along a level curve. This is Superpower 2: the gradient is perpendicular to level curves. (Superpower 3 is the identical argument in 3D, since the same formula $D_{\mathbf{u}}f = \nabla f\cdot\mathbf{u}$ holds there.)
Three geometric facts, one clean proof, courtesy of the chain rule and the dot product. This is the kind of unification calculus exists to deliver: geometry and algebra are the same statement read two ways.
Math Major Sidebar — Differentiability vs. existence of directional derivatives. The formula $D_{\mathbf{u}}f = \nabla f\cdot\mathbf{u}$ is not free; it requires $f$ to be differentiable at the point, meaning $f$ admits a genuine linear approximation (a tangent plane), not merely that its partials exist. There are pathological functions whose directional derivatives exist in every direction yet which fail to be differentiable, and for them the dot-product formula breaks. The standard example is $f(x,y) = \frac{x^2 y}{x^4 + y^2}$ (with $f(0,0)=0$): every directional derivative at the origin exists, but $f$ is not even continuous there along the parabola $y = x^2$. The safe theorem: if the partials $f_x, f_y$ are continuous near the point, then $f$ is differentiable there and the formula holds. Continuity of the partials is the hypothesis you should always have in the back of your mind.
30.6 Tangent Planes and Normal Lines
Superpower 3 — $\nabla F$ is normal to the level surface $F = c$ — converts instantly into a formula for tangent planes, the multivariable successor to the tangent line.
Suppose a surface is given as a level set $F(x, y, z) = 0$. At a point $(x_0, y_0, z_0)$ on the surface, the gradient $\nabla F(x_0, y_0, z_0) = \langle F_x, F_y, F_z\rangle$ is a normal vector. A plane through $(x_0,y_0,z_0)$ with this normal is the tangent plane:
$$F_x(x_0,y_0,z_0)(x - x_0) + F_y(x_0,y_0,z_0)(y - y_0) + F_z(x_0,y_0,z_0)(z - z_0) = 0.$$
The line through the point in the direction of $\nabla F$ — perpendicular to the surface — is the normal line.
Example: tangent plane to a sphere. The sphere $x^2 + y^2 + z^2 = 14$ is the level set $F = 0$ with $F = x^2+y^2+z^2-14$. At $(1, 2, 3)$ (check: $1+4+9 = 14$ ✓), $\nabla F = \langle 2x, 2y, 2z\rangle = \langle 2, 4, 6\rangle$. The tangent plane is
$$2(x-1) + 4(y-2) + 6(z-3) = 0 \;\Longrightarrow\; 2x + 4y + 6z = 28 \;\Longrightarrow\; x + 2y + 3z = 14.$$
Notice the normal $\langle 2,4,6\rangle \parallel \langle 1,2,3\rangle$ points radially outward from the sphere's center — exactly as it should, since the radius of a sphere is perpendicular to its surface. The geometry confirms the algebra.
The Special Case of a Graph $z = f(x, y)$
Chapter 29 introduced tangent planes for graphs; the gradient now re-derives that formula as a special case, tying the two chapters together. Write the graph $z = f(x,y)$ as a level surface by setting $F(x,y,z) = f(x,y) - z = 0$. Then $F_x = f_x$, $F_y = f_y$, $F_z = -1$, and the tangent-plane equation becomes
$$f_x(x_0,y_0)(x - x_0) + f_y(x_0,y_0)(y - y_0) - (z - z_0) = 0,$$
which rearranges to the linearization
$$z = z_0 + f_x(x_0,y_0)(x - x_0) + f_y(x_0,y_0)(y - y_0).$$
Compare the single-variable tangent line $y = y_0 + f'(x_0)(x - x_0)$ from Chapter 11. The structure is identical — base value plus derivative times displacement — only now there are two displacement terms, one per input variable, and the single derivative $f'(x_0)$ has become the gradient $\nabla f = \langle f_x, f_y\rangle$. The gradient is the derivative; the tangent plane is the tangent line; we have simply gained a dimension.
30.7 Linear Approximation in Several Variables
The tangent plane, like the tangent line before it, is the best linear approximation to $f$ near the base point. Reading the formula above as an approximation:
$$f(x, y) \approx f(x_0, y_0) + f_x(x_0, y_0)(x - x_0) + f_y(x_0, y_0)(y - y_0).$$
This is the multivariable cousin of $f(x) \approx f(x_0) + f'(x_0)(x - x_0)$ from Chapter 11, and it embodies our recurring theme that approximation is the soul of calculus: replace a curved surface by its flat tangent plane and the local arithmetic becomes trivial.
Worked example. Estimate $f(0.1, 0.05)$ for $f(x, y) = \sqrt{x^2 + y^2 + 1}$ using the base point $(0,0)$. There $f(0,0) = 1$, and
$$f_x = \frac{x}{\sqrt{x^2+y^2+1}}\bigg|_{(0,0)} = 0, \qquad f_y = \frac{y}{\sqrt{x^2+y^2+1}}\bigg|_{(0,0)} = 0,$$
so the linearization is simply $f \approx 1$. The true value is $\sqrt{0.01 + 0.0025 + 1} = \sqrt{1.0125} \approx 1.00623$, an error of about $0.006$. The gradient vanishes at the origin — $(0,0)$ is the bottom of this bowl — so the tangent plane is flat and the first-order approximation is just the constant $1$; the small error is entirely second-order curvature. To do better you need the multivariable Taylor expansion, which Chapter 23's single-variable theory generalizes (and which the second-derivative/Hessian machinery of Chapter 31 begins).
Real-World Application — Error propagation (engineering/experimental physics). Suppose a quantity $f(x, y)$ is computed from two measurements $x$ and $y$ with uncertainties $\Delta x$, $\Delta y$. The linear approximation gives the propagated uncertainty $\Delta f \approx |f_x|\,\Delta x + |f_y|\,\Delta y$ (or, for independent errors, $\Delta f \approx \sqrt{f_x^2 \Delta x^2 + f_y^2 \Delta y^2}$). Every lab report that states "$g = 9.79 \pm 0.04\ \text{m/s}^2$" computed that $\pm$ by dotting a gradient against measurement uncertainties. The gradient is, quite literally, the sensitivity of an output to its inputs.
30.8 Gradient Descent — The Master Algorithm
We now reach the payoff this book has been building toward since Chapter 6, where the single-variable derivative first answered the question "which way should I step to decrease $f$?" The answer there was "step opposite the sign of $f'$." In several variables, Superpower 1 upgrades that answer to its final form: the direction of fastest decrease is $-\nabla f$. Iterate that step and you have the algorithm that trains essentially every modern machine-learning model.
The Algorithm
To minimize a function $f$:
- Start at an initial guess $\mathbf{x}_0$.
- Compute the gradient $\nabla f(\mathbf{x}_k)$.
- Step in the negative gradient direction: $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\,\nabla f(\mathbf{x}_k)$.
- Repeat until $\|\nabla f\|$ is small (you are near a flat spot — a candidate minimum).
The scalar $\eta > 0$ is the learning rate (or step size): how far you move per step. The entire method is one line — step against the gradient — repeated. To maximize instead, flip the sign and step with the gradient ($\mathbf{x}_{k+1} = \mathbf{x}_k + \eta\nabla f$); that variant is called steepest ascent, and it models, for instance, best-response dynamics in economics where agents adjust toward higher payoff.
Why it works. By §30.5, $-\nabla f$ is the direction of steepest descent, with the function decreasing at rate $\|\nabla f\|$. For a small enough step $\eta$, the linear approximation $f(\mathbf{x}_{k+1}) \approx f(\mathbf{x}_k) - \eta\|\nabla f(\mathbf{x}_k)\|^2$ guarantees $f$ goes down (the subtracted term is positive whenever the gradient is nonzero). Repeating, you walk downhill, step by step, until the gradient flattens out.
A Worked Trajectory by Hand
Take $f(x, y) = x^2 + 10y^2$, a long narrow valley (much steeper in $y$ than in $x$), with $\nabla f = \langle 2x, 20y\rangle$. Start at $(5, 5)$ with learning rate $\eta = 0.05$:
$$\mathbf{x}_1 = (5,5) - 0.05\cdot(10, 100) = (5 - 0.5,\ 5 - 5) = (4.5,\ 0).$$
The $y$-coordinate jumped straight from $5$ to $0$ — past nothing, landing exactly on the valley floor in $y$ on the first step, by luck of the numbers. But that knife's-edge behavior is a warning: $f$ is ten times more curved in $y$ than in $x$, so a step size that is reasonable for $x$ is enormous for $y$. Nudge $\eta$ slightly larger and the $y$-iterate will overshoot the valley floor and bounce back and forth with growing amplitude — it diverges. This mismatch is called poor conditioning, and basic gradient descent struggles with it.
Warning. The learning rate is the single most important — and most dangerous — knob in gradient descent. Too small, and convergence is glacially slow. Too large, and each step overshoots the minimum, the iterates blow up, and $f \to \infty$. For the quadratic $f = x^2 + 10y^2$, the iteration is stable in $y$ only when $\eta < 2/20 = 0.1$; at $\eta = 0.11$ the $y$-iterate explodes from $5$ toward $\sim 190$ within twenty steps even as the $x$-iterate quietly converges. There is no single "correct" learning rate — choosing and adapting it is a craft, and the cause of a large fraction of failed machine-learning experiments.
Python Implementation
Here is gradient descent in full, on the same poorly-conditioned valley, with a safe learning rate:
# Gradient descent minimizing f(x,y) = x^2 + 10 y^2
import numpy as np
def f(p: np.ndarray) -> float:
return p[0]**2 + 10*p[1]**2
def grad(p: np.ndarray) -> np.ndarray:
return np.array([2*p[0], 20*p[1]]) # ∇f = <2x, 20y>
x = np.array([5.0, 5.0]) # starting point
eta = 0.04 # learning rate (< 0.1 for stability)
for k in range(40):
x = x - eta * grad(x) # the one essential line
print("minimum found near:", np.round(x, 4)) # -> [0.178 0. ]
print("f at minimum:", round(f(x), 5)) # -> 0.03169
# True minimum is (0,0) with f=0; x-coordinate converges slowly because
# the x-direction is gently curved -- classic ill-conditioning.
The $y$-coordinate races to $0$ (steep direction, fast convergence) while the $x$-coordinate crawls (shallow direction, slow convergence). That asymmetry — fast in some directions, slow in others — is exactly the problem that momentum, RMSProp, and Adam were invented to fix, by giving each coordinate its own effective step size.
Computational Note. Every production deep-learning optimizer is a variant of the loop above. Stochastic gradient descent (SGD) estimates $\nabla f$ from a small random minibatch of data rather than the whole dataset, trading exactness for speed. Momentum averages recent gradients to power through narrow valleys. Adam (Kingma & Ba, 2014) keeps a per-parameter adaptive learning rate and is the default for training transformers. All of them begin with the line
x = x - eta * grad(x); the rest is engineering on top of this chapter's mathematics.
30.9 Gradient Descent and Machine Learning
Why does this one algorithm dominate modern AI? Because training a model is minimizing a function. Reframe the language and the connection is exact.
A machine-learning model has parameters $\boldsymbol{\theta}$ — the weights of a neural network, possibly billions of them. A loss function $L(\boldsymbol{\theta})$ measures how badly the model, with those parameters, predicts the training data: large loss means bad predictions, small loss means good ones. The graph of $L$ over the space of all possible parameter settings is the loss surface (or loss landscape) — a $\boldsymbol{\theta}$-dimensional version of the hill from §30.4. Training means finding the parameters that sit at the bottom of a valley on that surface. And the way you get to the bottom of a valley, by everything in this chapter, is to repeatedly step in the direction $-\nabla L$:
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\,\nabla L(\boldsymbol{\theta}).$$
That is the entire training loop. The gradient $\nabla L$ — one partial derivative per parameter, telling you how the loss responds to nudging each weight — is computed by backpropagation, which is precisely the reverse-mode chain rule of §30.2 applied to the network's computational graph. The multivariable chain rule and the gradient, the two halves of this chapter, are the two halves of how every neural network learns.
To make it concrete, here is gradient descent fitting a line $y = wx + b$ to data by minimizing mean-squared-error loss — the simplest possible "machine learning," with two parameters instead of two billion, but the algorithm is identical:
# Linear regression by gradient descent: fit y = w*x + b
import numpy as np
rng = np.random.default_rng(0)
X = np.linspace(0, 1, 50)
y_true = 2.0 * X + 1.0 + 0.1 * rng.standard_normal(50) # true: w=2, b=1, + noise
w, b, eta = 0.0, 0.0, 0.5 # start at the wrong answer
for step in range(2000):
y_pred = w * X + b
err = y_pred - y_true
grad_w = 2 * np.mean(err * X) # ∂L/∂w
grad_b = 2 * np.mean(err) # ∂L/∂b
w -= eta * grad_w # step the parameters downhill
b -= eta * grad_b
print(f"learned slope w = {w:.3f}, intercept b = {b:.3f}")
# -> learned slope w ~ 2.12, intercept b ~ 0.96 (recovers the true 2.0, 1.0 up to noise)
Starting from the wrong answer $(w,b) = (0,0)$, gradient descent walks the loss surface downhill and recovers the underlying relationship. Scale this from $2$ parameters to billions, replace the line by a deep network, replace the full-data gradient by minibatch estimates, and you have, in outline, how a large language model is trained. This is the most important modern application of calculus, and it rests entirely on the gradient you computed by hand earlier in this chapter.
Real-World Application — Training neural networks (data science / AI). A modern language model has on the order of $10^{11}$ parameters; its loss surface lives in a space of that dimension. There is no hope of visualizing it, solving $\nabla L = 0$ algebraically, or searching it by trial and error. The only feasible approach is to compute $\nabla L$ by backpropagation and step in $-\nabla L$ — gradient descent — millions of times. Every chatbot, image generator, and recommendation engine you have ever used was trained by iterating the single line
θ ← θ − η∇L. The gradient is not an academic curiosity; it is the workhorse of the entire AI industry.Check Your Understanding. For $f(x, y) = x^2 + 10y^2$, you are at the point $(3, 1)$. (a) In what direction should you step to decrease $f$ fastest? (b) What is the rate of fastest decrease? (c) With learning rate $\eta = 0.1$, where does one gradient-descent step take you?
Answer
$\nabla f = \langle 2x, 20y\rangle = \langle 6, 20\rangle$ at $(3,1)$. (a) Fastest decrease is the direction $-\nabla f = \langle -6, -20\rangle$ (normalized, $\frac{1}{\sqrt{436}}\langle -6,-20\rangle$). (b) The fastest rate of decrease is $-\|\nabla f\| = -\sqrt{36 + 400} = -\sqrt{436} \approx -20.88$. (c) $\mathbf{x}_1 = (3,1) - 0.1\langle 6, 20\rangle = (3 - 0.6,\ 1 - 2) = (2.4,\ -1)$ — note the $y$-step overshot the valley floor $y=0$, a sign $\eta$ is a bit large for the steep $y$-direction.
30.10 The Gradient Across the Sciences
Optimization is only one of the gradient's careers. The same vector $\nabla f$ — "how fast and in which direction does $f$ change" — turns out to be the universal grammar of flow across the physical sciences. Wherever something moves from a "more" state to a "less" state, a gradient sets its direction and magnitude.
Physics: force is the negative gradient of potential
A vast swath of physics organizes itself around scalar potential functions whose gradient produces a force:
- Gravity: the gravitational potential $\Phi$ gives force $\mathbf{F} = -m\nabla\Phi$. For $\Phi = -GM/r$ this reproduces Newton's inverse-square law $\mathbf{F} = -\frac{GMm}{r^2}\hat{\mathbf{r}}$.
- Electrostatics: the electric potential $V$ (volts) gives the electric field $\mathbf{E} = -\nabla V$, and the force on charge $q$ is $\mathbf{F} = q\mathbf{E} = -q\nabla V$.
- Springs: the potential energy $U = \tfrac12 kx^2$ gives the restoring force $\mathbf{F} = -\nabla U = -kx\,\hat{\mathbf{x}}$ (Hooke's law).
The recurring minus sign encodes a single physical principle: systems are pushed toward lower potential energy. The gradient points uphill in potential; the force points downhill. This is the same picture as gradient descent — nature, like a learning algorithm, rolls downhill against the gradient.
The four "flux = −(conductivity) × gradient" laws
The deepest unification is that four foundational laws of physics and engineering are the same equation with different names attached to the potential and the flux:
| Law | Flux | Equation |
|---|---|---|
| Fourier (heat conduction) | heat flux $\mathbf{q}$ | $\mathbf{q} = -k\,\nabla T$ |
| Fick (diffusion) | particle flux $\mathbf{J}$ | $\mathbf{J} = -D\,\nabla c$ |
| Darcy (groundwater) | flow velocity $\mathbf{v}$ | $\mathbf{v} = -K\,\nabla h$ |
| Ohm (electric current) | current density $\mathbf{J}$ | $\mathbf{J} = -\sigma\,\nabla V$ |
Heat flows from hot to cold, molecules from concentrated to dilute, water from high head to low, charge from high to low potential. In every case the flux is proportional to the negative gradient of a scalar potential, and the proportionality constant ($k$, $D$, $K$, $\sigma$) is a material property. Learn the gradient once and you have learned the mathematical skeleton of thermodynamics, transport, hydrology, and circuit theory simultaneously — a vivid instance of our theme that calculus appears in every quantitative field.
The Key Insight. The gradient is the mathematics of "downhill." Whenever a quantity flows from more to less — heat from hot to cold, ink from dark to light, parameters from high loss to low loss — the gradient of the governing potential sets the direction and rate of that flow. Master $\nabla f$ and you hold the common key to optimization, physics, transport, and machine learning.
Image processing: the gradient finds edges
A grayscale image is a function $I(x, y)$ giving intensity at each pixel. An edge is where intensity changes abruptly — exactly where $\|\nabla I\|$ is large. Classic edge detectors (Sobel, Canny) approximate $\nabla I$ with finite differences, compute its magnitude, and threshold: high gradient $\Rightarrow$ edge. This is foundational to computer vision, medical imaging, and self-driving cars.
# Edge detection via the image gradient magnitude
import numpy as np
from scipy import ndimage
img = np.zeros((100, 100))
img[30:70, 30:70] = 1.0 # a white square on black
gx = ndimage.sobel(img, axis=0) # ∂I/∂x (approx.)
gy = ndimage.sobel(img, axis=1) # ∂I/∂y
edge_strength = np.sqrt(gx**2 + gy**2) # |∇I| -- large only at the square's border
print("max |∇I| (on edges):", round(edge_strength.max(), 2))
print("min |∇I| (flat regions):", round(edge_strength.min(), 2)) # 0 inside/outside
# The gradient magnitude lights up precisely the four sides of the square.
Real-World Application — Chemotaxis (biology). Single cells navigate without eyes by gradient detection. A bacterium swimming toward food, or an immune cell hunting a pathogen, samples the concentration $c$ of a chemical signal and biases its motion up the gradient $\nabla c$ (chemoattractant) or down it (chemorepellent). The cell is, in effect, running noisy gradient ascent on a chemical landscape — the same algorithm as §30.8, implemented in wetware by evolution a billion years before anyone wrote it down.
30.11 Computing Gradients in Python
Our recurring theme — hand computation builds understanding; machine computation builds power — applies with full force to gradients. You should be able to differentiate by hand to understand what $\nabla f$ means; the machine then lets you compute it for functions far beyond hand reach.
Exact symbolic gradients with sympy:
# Symbolic gradient: differentiate exactly
import sympy as sp
x, y, z = sp.symbols('x y z')
f = x**2 + 2*y**2 + 3*z**2
grad_f = [sp.diff(f, v) for v in (x, y, z)]
print(grad_f) # [2*x, 4*y, 6*z]
Approximate numerical gradients with finite differences:
# Numerical gradient via centered finite differences
import numpy as np
from typing import Callable
def numerical_gradient(f: Callable, point: np.ndarray, h: float = 1e-6) -> np.ndarray:
grad = np.zeros_like(point)
for i in range(len(point)):
step = np.zeros_like(point); step[i] = h
grad[i] = (f(point + step) - f(point - step)) / (2 * h) # central difference
return grad
f = lambda p: p[0]**2 + 2*p[1]**2 + 3*p[2]**2
print(numerical_gradient(f, np.array([1.0, 2.0, 3.0]))) # ~ [2. 8. 18.] (= <2x,4y,6z>)
Both agree with the hand answer $\nabla f = \langle 2x, 4y, 6z\rangle = \langle 2, 8, 18\rangle$ at $(1,2,3)$. For machine learning, neither symbolic nor finite-difference gradients are used in practice: automatic differentiation (PyTorch, TensorFlow, JAX) computes exact gradients efficiently by applying the chain rule to the computational graph — fast like finite differences, exact like sympy, and scalable to billions of parameters. It is the multivariable chain rule of §30.2, industrialized.
30.12 Looking Forward: Constrained Optimization
Gradient descent finds unconstrained minima — the bottom of a valley with no fences. But many real problems carry constraints: maximize utility subject to a budget; minimize material subject to a fixed volume. The key tool, previewed here and developed fully in Chapter 31, is the method of Lagrange multipliers.
The geometric idea is pure gradient reasoning. To optimize $f(x,y)$ subject to a constraint $g(x,y) = 0$, look at where a level curve of $f$ just touches (is tangent to) the constraint curve $g = 0$. At such a point the two curves share a normal direction, so their gradients are parallel:
$$\nabla f = \lambda\,\nabla g \quad\text{for some scalar } \lambda.$$
If the gradients were not parallel, the constraint curve would cross the level curves of $f$ transversally, and you could slide along $g = 0$ to a higher value of $f$ — so you would not yet be at the optimum. The gradient's perpendicularity-to-level-sets (Superpower 2) is exactly what makes this work. Chapter 31 turns this picture into a complete computational method and pairs it with the second-derivative (Hessian) test for classifying the critical points gradient descent finds.
Add to Your Modeling Portfolio. Add a gradient to your model — the multivariable rate of change that drives optimization and flow. Biology: model chemotaxis — a cell at position $\mathbf{x}$ in a chemoattractant field $c(x,y)$ moves with velocity proportional to $\nabla c$. Simulate its path toward the source. Economics: compute the gradient of a utility function $U(x,y)$ and use steepest ascent to trace a consumer's path toward higher utility; overlay the iso-utility (level) curves. Physics: pick a potential $\Phi$ (gravitational $-GM/r$, or a spring $\tfrac12 k r^2$) and compute the force field $\mathbf{F} = -\nabla\Phi$; verify it points "downhill." Data Science: implement gradient descent to fit a model (start with the linear regression of §30.9), then visualize the loss surface as a contour plot with the descent trajectory drawn on top — watch the iterates cross the contours perpendicularly, straight toward the minimum.
30.13 Reflection: Why This Is the Peak
In Chapter 6 the derivative was a number — a slope, a single rate of change. Everything since has been an elaboration of that one idea, and here it reaches its mature form. The gradient $\nabla f$ is the derivative grown up: it carries not just how fast but which way, fusing the rate-of-change theme of Chapter 1 with the geometry-and-algebra theme that has run through every chapter. Its three superpowers — steepest ascent, perpendicularity to level sets, normality to surfaces — are one fact about the dot product, seen from three angles.
And from that single object flows an astonishing range of consequences: the tangent planes of geometry, the force laws of physics, the conduction and diffusion laws of engineering, the edge detectors of computer vision, and — above all — gradient descent, the algorithm that trains every neural network on Earth. When Chapter 1 called this the conceptual peak, this is what it meant: more applied calculus, in more fields, traces back to the gradient than to any other single idea in the book.
Chapter 31 descends from the peak into the valley of optimization in several variables: critical points, the second-derivative (Hessian) test, and constrained extrema via Lagrange multipliers. The gradient locates the flat spots; the next chapter tells you which are peaks, which are valleys, and which are the treacherous saddle points in between. But the foundation is in place. You have reached the center of multivariable calculus — and, not coincidentally, the mathematical heart of modern artificial intelligence.