Case Study 2 — Gradient Descent and the Training of a Neural Network

DataField.Dev

Case Study 2 — Gradient Descent and the Training of a Neural Network

Domain: Machine learning and data science. The advance: Every modern AI system — language models, image generators, recommendation engines — is trained by a single calculus algorithm: gradient descent, the anchor you first met on a parabola in Chapter 6. This case study follows that one idea from a one-variable derivative all the way to a model with billions of parameters.

What "training a model" actually means

When you ask a chatbot a question and it produces a fluent answer, it is running billions of arithmetic operations — multiplications and additions arranged in layers, with smooth nonlinear functions in between. Each multiplication uses a number called a weight, and a large model has billions of them. Before training, those weights are random noise and the model's outputs are gibberish. Training is the process of adjusting the weights until the model's outputs match patterns in a vast body of example data.

The question is: in a space of billions of weights, which way should each one move? This is an optimization problem of staggering dimension, and the tool that solves it is the derivative — specifically the multivariable gradient of Chapter 30. There is no clever closed-form answer, no formula that hands you the best weights. There is only the calculus you already know, run at enormous scale.

The whole idea on a single parabola (Chapter 6)

Start where the anchor started. Suppose you want to minimize $f(w) = (w-3)^2$, a parabola with its lowest point at $w=3$. The derivative is $f'(w) = 2(w-3)$. The derivative's sign tells you which way is downhill: where $f'(w)>0$ the function is rising, so you should move left; where $f'(w)<0$ it is falling, so you should move right. Both cases are captured by stepping opposite the derivative:

$$w_{n+1} = w_n - \eta\, f'(w_n),$$

where $\eta$ (the learning rate) controls the step size. Let $\eta = 0.1$ and start at $w_0 = 0$. The update becomes $w_{n+1} = w_n - 0.1\cdot 2(w_n - 3) = 0.8\,w_n + 0.6$. By hand:

$$w_0 = 0,\quad w_1 = 0.6,\quad w_2 = 1.08,\quad w_3 = 1.464,\quad w_4 = 1.7712,\ \ldots$$

The values march steadily toward $w = 3$, the true minimum. (Check: the fixed point of $w = 0.8w + 0.6$ is $0.2w = 0.6$, i.e. $w=3$.) That is the entire algorithm. Everything that follows is this same loop, in more dimensions.

# Gradient descent on f(w) = (w-3)**2, the Chapter 6 anchor.
# Minimum is at w = 3. Hand-computed first steps: 0, 0.6, 1.08, 1.464, ...
def f_prime(w: float) -> float:
    return 2 * (w - 3)

w, eta = 0.0, 0.1
for _ in range(50):
    w = w - eta * f_prime(w)
print(round(w, 4))   # -> 3.0 (converges to the true minimum)

Many variables: the gradient points downhill (Chapter 30)

A real model does not minimize over one weight but over millions or billions at once. Chapter 30 supplied exactly the needed upgrade. For a function $L(\mathbf{w})$ of many variables — the loss, which measures how wrong the model's predictions are — the gradient $\nabla L$ is the vector of all the partial derivatives, and it points in the direction of steepest increase. So to decrease the loss, step opposite it:

$$\mathbf{w}_{n+1} = \mathbf{w}_n - \eta\,\nabla L(\mathbf{w}_n).$$

This is the same update as the parabola, with the scalar derivative $f'$ replaced by the gradient vector $\nabla L$. The "loss landscape" of a model is a surface in a space of millions of dimensions, full of valleys and ridges no human can picture; training is a ball rolling downhill on that surface, one gradient step at a time.

Make it concrete with the smallest real example — fitting a line $\hat{y} = m x + b$ to data points $(x_i, y_i)$ by minimizing the squared error

$$L(m,b) = \tfrac{1}{2}\sum_i (m x_i + b - y_i)^2.$$

The two partial derivatives (Chapter 29) are

$$\frac{\partial L}{\partial m} = \sum_i (m x_i + b - y_i)\,x_i, \qquad \frac{\partial L}{\partial b} = \sum_i (m x_i + b - y_i).$$

The gradient is the pair $\nabla L = \left(\frac{\partial L}{\partial m},\ \frac{\partial L}{\partial b}\right)$, and the update nudges $m$ and $b$ together, step by step, until the line fits. A neural network is this same picture with billions of parameters instead of two — but the calculus is identical.

The Key Insight. There is no difference in principle between minimizing $(w-3)^2$ by hand in Chapter 6 and training a language model. Both compute a derivative, both step opposite it, both repeat. The derivative you learned on a parabola is, with no change in concept, the engine of artificial intelligence. What changed is only the number of dimensions and the number of steps.

Backpropagation: the chain rule run backward (Chapter 7)

A model with billions of weights needs billions of partial derivatives, recomputed at every step. Doing that naively — wiggling each weight and remembering the effect — would be hopelessly slow. The trick is backpropagation, and it is nothing but the chain rule of Chapter 7 applied to the model's computational graph.

A network composes simple functions: the loss $L$ depends on the output, which depends on the last layer's weights, which depend on the previous layer, and so on. The chain rule says the derivative of a composition is the product of the derivatives of its parts. Backpropagation evaluates this product backward through the network — from the loss toward the inputs — reusing each intermediate result so that all billions of partial derivatives come out in a single backward pass:

$$\frac{\partial L}{\partial w_i} = \sum_j \frac{\partial L}{\partial u_j}\cdot\frac{\partial u_j}{\partial w_i},$$

where the $u_j$ are intermediate quantities. Each factor is one application of the Chapter 7 chain rule; backpropagation just bookkeeps them efficiently across the whole graph.

Computational Note. Real frameworks (PyTorch, JAX, TensorFlow) never differentiate by hand. They use automatic differentiation: the code records every elementary operation as it runs, then applies the chain rule mechanically to produce the exact gradient — not a finite-difference approximation, but the genuine derivative to machine precision. Automatic differentiation is the industrial-scale version of the differentiation rules of Chapter 7. The rules did not change; their scale did.

Why this scaled into the AI era

The mathematics of gradient descent has been understood since the 19th century. What changed in the 2010s was scale, and it arrived from three directions at once: algorithms (architectures like the Transformer that compose into smooth, differentiable functions), data (vast text and image corpora), and compute (GPUs that perform the matrix arithmetic of a gradient step in massive parallel). A frontier model may have on the order of a trillion weights and may take $\sim 10^{25}$ floating-point operations to train. Every one of those operations is, at bottom, a multiply or an add inside a gradient computation. Strip away the engineering and what remains is the loop from the parabola above, run an astronomical number of times.

Real-World Application — protein folding and beyond (computational biology). In 2021 the AlphaFold system predicted the three-dimensional structures of nearly every known protein, a problem that had resisted biologists for fifty years. AlphaFold is a neural network trained by exactly the gradient descent above — its billions of weights were tuned by stepping opposite the gradient of a loss that measured structural error. The same algorithm that fits a line to two points, scaled up, helped redraw the map of molecular biology. Theme 5 in its purest form: the calculus does not care which field it is borrowed into.

The honest limits

Section 40.10 warned that calculus has edges, and gradient descent shows them plainly. The loss landscape of a deep network is non-convex — riddled with valleys that are not the deepest valley — so gradient descent finds a good minimum, not provably the best one. The gradient guarantees only local downhill, never global optimality. And calculus says nothing about why the resulting models generalize so well to data they were never trained on; that remains an open research question where the theory has not caught up with the practice. Calculus is necessary for modern AI and entirely insufficient to explain it — a fitting note on which to close a book that has tried, throughout, to be honest about where the tools stop.

Discussion Questions

In one sentence, what is the difference in principle between minimizing $(w-3)^2$ and training a neural network? (There isn't one — explain why, naming the Chapter 6 and Chapter 30 ideas involved.)
Why is the gradient, not just any derivative, the right object once there are many weights? What does $\nabla L$ point toward, and why do we subtract it?
What does backpropagation actually compute, and which Chapter 7 rule is it built from? Why is running the chain rule backward through the graph efficient?
Gradient descent finds a local minimum, not necessarily the global one. Why does this not stop modern AI from working? What does it say about the limits of calculus from §40.10?
The learning rate $\eta$ controls step size. Using the parabola example, predict what happens if $\eta$ is far too large (say $\eta = 2$). Does the sequence still converge to $w=3$? (Try the update $w_{n+1} = w_n - 2\cdot 2(w_n-3)$ for one or two steps and see.)

A Short Annotated Reading

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press (free online). The standard reference; Chapters 4 and 6 connect optimization and backpropagation directly to the calculus of this book.
3Blue1Brown. "Neural networks" video series (YouTube). A visual walk-through of gradient descent and backpropagation that makes the loss-landscape picture vivid — the best free on-ramp from calculus to deep learning.
Nielsen, M. (2015). Neural Networks and Deep Learning (free online). Derives backpropagation from the chain rule in full, slow detail; the ideal next step after this chapter.
Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. The Transformer paper behind modern language models — every operation in it is differentiable, which is precisely what lets gradient descent train it.