Chapter 30 — Key Takeaways

The single big idea: the gradient $\nabla f$ is the multivariable derivative. Where $f'(x)$ was a number (magnitude only), $\nabla f$ is a vector (magnitude and direction). That upgrade is what lets calculus optimize anything — including every neural network on Earth.


1. The Multivariable Chain Rule (§30.2)

Change propagates through layered dependencies by multiplying along each path and summing the paths.

  • One independent variable ($z = f(x, y)$, $x = x(t)$, $y = y(t)$): $$\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}.$$
  • Several independent variables ($x = x(s,t)$, $y = y(s,t)$): one such formula per outer variable, holding the others fixed.
  • Tree diagram: branch $z$ to each intermediate, then to each outer variable; trace every path from top to the target, multiply edge labels along a path, add the paths. Handles any dependency structure.
  • Implicit form (Case 3): if $F(x,y)=0$, then $\dfrac{dy}{dx} = -\dfrac{F_x}{F_y}$ (when $F_y \neq 0$) — sharpening Chapter 8's implicit differentiation.
  • This tree is a computational graph; running it backward is backpropagation.

2. The Gradient (§30.3)

$$\nabla f = \left\langle \frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \frac{\partial f}{\partial z}\right\rangle = \langle f_x, f_y, f_z\rangle.$$

The partials, bundled into one vector. Read "del $f$" or "grad $f$." The operator $\nabla$ turns a scalar function into a vector — its first of three jobs (divergence and curl come in Chapters 34–37).


3. The Directional Derivative (§30.5)

The rate of change of $f$ as you step in a unit direction $\mathbf u$:

$$\boxed{\,D_{\mathbf u}f = \nabla f\cdot \mathbf u\,}\qquad(\|\mathbf u\| = 1).$$

  • Choosing $\mathbf u = \langle 1,0\rangle$ recovers $f_x$; $\langle 0,1\rangle$ recovers $f_y$. The directional derivative contains the partials.
  • Always normalize $\mathbf u$ first. Using a raw vector inflates the answer by $\|\mathbf u\|$.
  • Geometric form: $D_{\mathbf u}f = \|\nabla f\|\cos\theta$, where $\theta$ is the angle between $\nabla f$ and $\mathbf u$.

4. Steepest Ascent and the Maximum Rate (§30.4–30.5)

From $D_{\mathbf u}f = \|\nabla f\|\cos\theta$, swinging $\mathbf u$ around varies only $\cos\theta \in [-1, 1]$:

  • Fastest increase: along $\nabla f$ itself ($\theta = 0$), at rate $\|\nabla f\|$.
  • Fastest decrease: along $-\nabla f$ ($\theta = \pi$), at rate $-\|\nabla f\|$ — steepest descent, the engine of gradient descent.
  • Zero change: perpendicular to $\nabla f$ ($\theta = \pi/2$) — you are moving along a level set.

5. The Gradient ⟂ Level Sets, and Tangent Planes (§30.4, §30.6)

  • $\nabla f$ is perpendicular (normal) to level curves $f = c$ in 2D and to level surfaces $F = c$ in 3D. (Not tangent — a common error.)
  • Tangent plane to a level surface $F(x,y,z) = 0$ at $(x_0,y_0,z_0)$, using the normal $\nabla F$: $$F_x(x - x_0) + F_y(y - y_0) + F_z(z - z_0) = 0.$$
  • For a graph $z = f(x, y)$, set $F = f(x,y) - z$ to recover the linearization $$z = z_0 + f_x(x_0,y_0)(x - x_0) + f_y(x_0,y_0)(y - y_0),$$ the multivariable cousin of the Chapter 11 tangent line and the tangent-plane formula introduced in Chapter 29.

6. Gradient Descent — The Master Algorithm (§30.8–30.9)

To minimize $f$: repeat $$\mathbf x_{k+1} = \mathbf x_k - \eta\,\nabla f(\mathbf x_k),$$ until $\|\nabla f\|$ is small. To maximize (steepest ascent), step $+\eta\nabla f$.

  • $\eta > 0$ is the learning rate / step size — the most important and most dangerous knob. Too small: glacial. Too large: overshoot and divergence. For $f = x^2 + 10y^2$, stability needs $\eta < 0.1$ (set by the steep $y$-direction).
  • Why it works: $-\nabla f$ is the steepest-descent direction, and $f(\mathbf x_{k+1}) \approx f(\mathbf x_k) - \eta\|\nabla f\|^2$ goes down for small $\eta$.
  • Machine learning: training = minimizing a loss $L(\boldsymbol\theta)$ on the loss surface; the update $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\nabla L$ is the training loop. $\nabla L$ comes from backpropagation (the §30.2 chain rule). SGD, momentum, and Adam are all variants of the one line θ ← θ − η∇L.

7. The Gradient Across the Sciences (§30.10)

The gradient is the mathematics of "downhill" / flow:

  • Physics: force is the negative gradient of potential, $\mathbf F = -\nabla\Phi$ (gravity, electrostatics $\mathbf E = -\nabla V$, springs).
  • Transport laws all share the form flux $= -(\text{conductivity})\times$ gradient: Fourier ($\mathbf q = -k\nabla T$), Fick ($\mathbf J = -D\nabla c$), Darcy ($\mathbf v = -K\nabla h$), Ohm ($\mathbf J = -\sigma\nabla V$).
  • Image processing: edges are where $\|\nabla I\|$ is large (Sobel, Canny).
  • Biology: chemotaxis is gradient ascent/descent on a chemical field.

Common Errors to Avoid

  • Dropping the inner derivative in the chain rule: $\frac{dz}{dt} = f_x + f_y$ is wrong; you need $f_x\frac{dx}{dt} + f_y\frac{dy}{dt}$.
  • Not normalizing $\mathbf u$ before $D_{\mathbf u}f = \nabla f\cdot\mathbf u$ — the formula requires a unit vector.
  • Calling $\nabla f$ tangent to a level set — it is normal to it.
  • Confusing direction with rate: the steepest-ascent direction is $\nabla f$; the maximum rate is the scalar $\|\nabla f\|$.
  • A reckless learning rate: too large an $\eta$ makes gradient descent diverge, not converge.
  • Forgetting the formula needs differentiability: $D_{\mathbf u}f = \nabla f\cdot\mathbf u$ requires $f$ differentiable (continuous partials suffice).

Connections

  • Back to Chapter 7 (single-variable chain rule — the multivariable rule adds a sum over paths) and Chapter 8 (implicit differentiation, now two lines).
  • Back to Chapter 6, where the derivative first told us "which way to step" — the gradient-descent anchor, climaxing here.
  • Back to Chapter 11 (linear approximation / tangent line) and Chapter 29 (partial derivatives, tangent planes for graphs).
  • Forward to Chapter 31: critical points where $\nabla f = \mathbf 0$, the Hessian (second-derivative) test, and Lagrange multipliers for constrained optimization (where $\nabla f = \lambda\nabla g$).
  • Forward to Chapters 34–37: the operator $\nabla$ returns as divergence $\nabla\cdot\mathbf F$ and curl $\nabla\times\mathbf F$.

The summit. More applied calculus, in more fields, traces back to the gradient than to any other single idea in this book. You have reached the conceptual peak — and the mathematical heart of modern AI.