Chapter 30 — Key Takeaways
The single big idea: the gradient $\nabla f$ is the multivariable derivative. Where $f'(x)$ was a number (magnitude only), $\nabla f$ is a vector (magnitude and direction). That upgrade is what lets calculus optimize anything — including every neural network on Earth.
1. The Multivariable Chain Rule (§30.2)
Change propagates through layered dependencies by multiplying along each path and summing the paths.
- One independent variable ($z = f(x, y)$, $x = x(t)$, $y = y(t)$): $$\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}.$$
- Several independent variables ($x = x(s,t)$, $y = y(s,t)$): one such formula per outer variable, holding the others fixed.
- Tree diagram: branch $z$ to each intermediate, then to each outer variable; trace every path from top to the target, multiply edge labels along a path, add the paths. Handles any dependency structure.
- Implicit form (Case 3): if $F(x,y)=0$, then $\dfrac{dy}{dx} = -\dfrac{F_x}{F_y}$ (when $F_y \neq 0$) — sharpening Chapter 8's implicit differentiation.
- This tree is a computational graph; running it backward is backpropagation.
2. The Gradient (§30.3)
$$\nabla f = \left\langle \frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z}\right\rangle = \langle f_x, f_y, f_z\rangle.$$
The partials, bundled into one vector. Read "del $f$" or "grad $f$." The operator $\nabla$ turns a scalar function into a vector — its first of three jobs (divergence and curl come in Chapters 34–37).
3. The Directional Derivative (§30.5)
The rate of change of $f$ as you step in a unit direction $\mathbf u$:
$$\boxed{\,D_{\mathbf u}f = \nabla f\cdot \mathbf u\,}\qquad(\|\mathbf u\| = 1).$$
- Choosing $\mathbf u = \langle 1,0\rangle$ recovers $f_x$; $\langle 0,1\rangle$ recovers $f_y$. The directional derivative contains the partials.
- Always normalize $\mathbf u$ first. Using a raw vector inflates the answer by $\|\mathbf u\|$.
- Geometric form: $D_{\mathbf u}f = \|\nabla f\|\cos\theta$, where $\theta$ is the angle between $\nabla f$ and $\mathbf u$.
4. Steepest Ascent and the Maximum Rate (§30.4–30.5)
From $D_{\mathbf u}f = \|\nabla f\|\cos\theta$, swinging $\mathbf u$ around varies only $\cos\theta \in [-1, 1]$:
- Fastest increase: along $\nabla f$ itself ($\theta = 0$), at rate $\|\nabla f\|$.
- Fastest decrease: along $-\nabla f$ ($\theta = \pi$), at rate $-\|\nabla f\|$ — steepest descent, the engine of gradient descent.
- Zero change: perpendicular to $\nabla f$ ($\theta = \pi/2$) — you are moving along a level set.
5. The Gradient ⟂ Level Sets, and Tangent Planes (§30.4, §30.6)
- $\nabla f$ is perpendicular (normal) to level curves $f = c$ in 2D and to level surfaces $F = c$ in 3D. (Not tangent — a common error.)
- Tangent plane to a level surface $F(x,y,z) = 0$ at $(x_0,y_0,z_0)$, using the normal $\nabla F$: $$F_x(x - x_0) + F_y(y - y_0) + F_z(z - z_0) = 0.$$
- For a graph $z = f(x, y)$, set $F = f(x,y) - z$ to recover the linearization $$z = z_0 + f_x(x_0,y_0)(x - x_0) + f_y(x_0,y_0)(y - y_0),$$ the multivariable cousin of the Chapter 11 tangent line and the tangent-plane formula introduced in Chapter 29.
6. Gradient Descent — The Master Algorithm (§30.8–30.9)
To minimize $f$: repeat $$\mathbf x_{k+1} = \mathbf x_k - \eta\,\nabla f(\mathbf x_k),$$ until $\|\nabla f\|$ is small. To maximize (steepest ascent), step $+\eta\nabla f$.
- $\eta > 0$ is the learning rate / step size — the most important and most dangerous knob. Too small: glacial. Too large: overshoot and divergence. For $f = x^2 + 10y^2$, stability needs $\eta < 0.1$ (set by the steep $y$-direction).
- Why it works: $-\nabla f$ is the steepest-descent direction, and $f(\mathbf x_{k+1}) \approx f(\mathbf x_k) - \eta\|\nabla f\|^2$ goes down for small $\eta$.
- Machine learning: training = minimizing a loss $L(\boldsymbol\theta)$ on the loss surface; the update $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\nabla L$ is the training loop. $\nabla L$ comes from backpropagation (the §30.2 chain rule). SGD, momentum, and Adam are all variants of the one line
θ ← θ − η∇L.
7. The Gradient Across the Sciences (§30.10)
The gradient is the mathematics of "downhill" / flow:
- Physics: force is the negative gradient of potential, $\mathbf F = -\nabla\Phi$ (gravity, electrostatics $\mathbf E = -\nabla V$, springs).
- Transport laws all share the form flux $= -(\text{conductivity})\times$ gradient: Fourier ($\mathbf q = -k\nabla T$), Fick ($\mathbf J = -D\nabla c$), Darcy ($\mathbf v = -K\nabla h$), Ohm ($\mathbf J = -\sigma\nabla V$).
- Image processing: edges are where $\|\nabla I\|$ is large (Sobel, Canny).
- Biology: chemotaxis is gradient ascent/descent on a chemical field.
Common Errors to Avoid
- Dropping the inner derivative in the chain rule: $\frac{dz}{dt} = f_x + f_y$ is wrong; you need $f_x\frac{dx}{dt} + f_y\frac{dy}{dt}$.
- Not normalizing $\mathbf u$ before $D_{\mathbf u}f = \nabla f\cdot\mathbf u$ — the formula requires a unit vector.
- Calling $\nabla f$ tangent to a level set — it is normal to it.
- Confusing direction with rate: the steepest-ascent direction is $\nabla f$; the maximum rate is the scalar $\|\nabla f\|$.
- A reckless learning rate: too large an $\eta$ makes gradient descent diverge, not converge.
- Forgetting the formula needs differentiability: $D_{\mathbf u}f = \nabla f\cdot\mathbf u$ requires $f$ differentiable (continuous partials suffice).
Connections
- Back to Chapter 7 (single-variable chain rule — the multivariable rule adds a sum over paths) and Chapter 8 (implicit differentiation, now two lines).
- Back to Chapter 6, where the derivative first told us "which way to step" — the gradient-descent anchor, climaxing here.
- Back to Chapter 11 (linear approximation / tangent line) and Chapter 29 (partial derivatives, tangent planes for graphs).
- Forward to Chapter 31: critical points where $\nabla f = \mathbf 0$, the Hessian (second-derivative) test, and Lagrange multipliers for constrained optimization (where $\nabla f = \lambda\nabla g$).
- Forward to Chapters 34–37: the operator $\nabla$ returns as divergence $\nabla\cdot\mathbf F$ and curl $\nabla\times\mathbf F$.
The summit. More applied calculus, in more fields, traces back to the gradient than to any other single idea in this book. You have reached the conceptual peak — and the mathematical heart of modern AI.