Case Study 1 — Backpropagation: The Chain Rule Trains the World
Field: Machine learning / data science Calculus used: Chain rule (Section 7.6), product rule, the sigmoid derivative Forward reference: Chapter 30 (the multivariable gradient and gradient descent)
The Problem That Built Modern AI
Every large language model, every image classifier, every recommendation engine you have ever used was trained by the same procedure: adjust millions or billions of internal numbers, called weights, until the model's predictions match the data. The question that stalled artificial intelligence for two decades was deceptively simple — in which direction, and by how much, should each weight be nudged?
The answer is a derivative. If $L$ is a loss measuring how wrong the model currently is, and $w$ is one weight, then $\partial L / \partial w$ tells you exactly how the error responds to a tiny change in that weight. Nudge each weight a small step against its derivative — this is gradient descent, the anchor example introduced in Chapter 6 — and the loss goes down. Repeat a few million times and the network learns.
The obstacle is scale. A modern network has hundreds of billions of weights, and the loss depends on each one only indirectly, filtered through layer after layer of intervening computation. Computing one derivative from the limit definition would be hopeless; computing a billion of them, unthinkable. The breakthrough — backpropagation, popularized by Rumelhart, Hinton, and Williams in 1986 — is the realization that this mountain of derivatives is nothing more than the chain rule of Section 7.6, applied systematically and with all the shared work reused. This case study builds the entire idea from a network you can differentiate by hand.
A Network Small Enough to Hold in Your Head
Strip a neural network down to its skeleton: one input $x$, one hidden neuron, one output. The forward pass — how the network turns an input into a prediction — is a short composition of functions:
$$h = w_1 x + b_1 \qquad \text{(a linear combination)}$$ $$a = \sigma(h) \qquad \text{(a nonlinear "activation")}$$ $$\hat{y} = w_2\, a + b_2 \qquad \text{(another linear combination)}$$
Here $\sigma(z) = \dfrac{1}{1 + e^{-z}}$ is the sigmoid, the smooth S-shaped squashing function we met in the Chapter 2 case study. The four numbers $w_1, b_1, w_2, b_2$ are the weights to be learned. Read the three lines as a single composite function:
$$\hat{y}(x) = w_2\,\sigma(w_1 x + b_1) + b_2.$$
That nesting — a linear function inside $\sigma$ inside another linear function — is exactly the composite structure the chain rule was built for. A deep network just stacks more of these layers; the principle does not change.
Suppose we feed in one training example $(x = 1, \; y = 5)$ — the target output is $5$ — and we measure error by the squared difference:
$$L = (\hat{y} - y)^2 = (\hat{y} - 5)^2.$$
We want $\partial L / \partial w_1$: how the loss changes as we wiggle the first weight, buried deepest in the composition.
Applying the Chain Rule
Trace the dependency chain from $w_1$ all the way out to $L$:
$$w_1 \;\longrightarrow\; h \;\longrightarrow\; a \;\longrightarrow\; \hat{y} \;\longrightarrow\; L.$$
Each arrow is one differentiable step, so the chain rule multiplies their local rates of change:
$$\frac{\partial L}{\partial w_1} = \underbrace{\frac{\partial L}{\partial \hat{y}}}_{\text{loss}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial a}}_{\text{output layer}} \cdot \underbrace{\frac{\partial a}{\partial h}}_{\text{activation}} \cdot \underbrace{\frac{\partial h}{\partial w_1}}_{\text{input layer}}.$$
Now compute each factor — every one is a one-line derivative from this chapter:
| Local derivative | Value | Rule used |
|---|---|---|
| $\dfrac{\partial L}{\partial \hat{y}} = 2(\hat{y} - 5)$ | from $L = (\hat y - 5)^2$ | chain rule on a square |
| $\dfrac{\partial \hat{y}}{\partial a} = w_2$ | from $\hat y = w_2 a + b_2$ | linear |
| $\dfrac{\partial a}{\partial h} = \sigma(h)\big(1 - \sigma(h)\big)$ | the sigmoid derivative | quotient/chain rule |
| $\dfrac{\partial h}{\partial w_1} = x$ | from $h = w_1 x + b_1$ | linear |
The third row is worth a pause: the sigmoid's tidy self-referential derivative $\sigma' = \sigma(1-\sigma)$ comes straight from differentiating $\sigma(z) = (1 + e^{-z})^{-1}$ with the chain rule (outer power, inner $e^{-z}$) — the same machinery as everything else in this chapter. Multiplying the four factors:
$$\frac{\partial L}{\partial w_1} = 2(\hat{y} - 5)\cdot w_2 \cdot \sigma(h)\big(1 - \sigma(h)\big)\cdot x.$$
That is backpropagation in one neuron. The total derivative is the product of the local derivatives along the chain, each trivial on its own, assembled by the chain rule into the answer gradient descent needs.
Running the Numbers Once
Let us make it fully concrete. Initialize $w_1 = 0.5$, $b_1 = 0$, $w_2 = 1$, $b_2 = 0$, and push the example $x = 1$ through:
- $h = (0.5)(1) + 0 = 0.5$
- $a = \sigma(0.5) = \dfrac{1}{1 + e^{-0.5}} \approx 0.6225$
- $\hat{y} = (1)(0.6225) + 0 = 0.6225$
- $L = (0.6225 - 5)^2 \approx 19.16$ — the network is badly wrong, as expected before training.
Now the backward pass:
- $\dfrac{\partial L}{\partial \hat y} = 2(0.6225 - 5) = -8.755$
- $\dfrac{\partial a}{\partial h} = 0.6225(1 - 0.6225) = 0.2350$
- $\dfrac{\partial L}{\partial w_1} = (-8.755)\cdot 1 \cdot 0.2350 \cdot 1 \approx -2.06.$
The negative sign says: increasing $w_1$ would decrease the loss, so gradient descent should push $w_1$ up. One step with learning rate $\alpha = 0.1$ gives $w_1 \leftarrow 0.5 - (0.1)(-2.06) = 0.706$. The same backward pass simultaneously yields $\partial L/\partial w_2 \approx -5.45$, $\partial L/\partial b_1 \approx -2.06$, and $\partial L/\partial b_2 \approx -8.76$ — every gradient from a single sweep.
# Backpropagation through one hidden neuron — the chain rule, by hand.
import numpy as np
x, y = 1.0, 5.0 # one training example
w1, b1, w2, b2 = 0.5, 0.0, 1.0, 0.0 # initial weights
sigmoid = lambda z: 1 / (1 + np.exp(-z))
sigmoid_prime = lambda z: sigmoid(z) * (1 - sigmoid(z)) # sigma' = sigma(1-sigma)
# ---- forward pass ----
h = w1 * x + b1
a = sigmoid(h)
y_hat = w2 * a + b2
L = (y_hat - y) ** 2
# ---- backward pass: multiply local derivatives along the chain ----
dL_dyhat = 2 * (y_hat - y)
dL_dw2 = dL_dyhat * a
dL_db2 = dL_dyhat * 1
dL_da = dL_dyhat * w2
dL_dh = dL_da * sigmoid_prime(h)
dL_dw1 = dL_dh * x
dL_db1 = dL_dh * 1
print(f"L = {L:.4f}")
print(f"dL/dw1 = {dL_dw1:+.4f} dL/dw2 = {dL_dw2:+.4f}")
print(f"dL/db1 = {dL_db1:+.4f} dL/db2 = {dL_db2:+.4f}")
# Output:
# L = 19.1629
# dL/dw1 = -2.0575 dL/dw2 = -5.4497
# dL/db1 = -2.0575 dL/db2 = -8.7551
Wrap that backward pass in a loop, take a small step against every gradient each time, and the network's loss falls toward zero. You have trained a (very small) neural network using nothing but the chain rule.
Why It Scales: Shared Work, One Backward Pass
A network with $N$ layers and $M$ total weights has, in principle, $M$ derivatives to compute. The naive approach — a separate chain-rule calculation for each weight — would be catastrophically slow. The genius of backpropagation is that the chains overlap: the factor $\partial L/\partial a$ feeds the gradient of $w_1$, $b_1$, and every weight beneath it. Compute each intermediate derivative once, cache it, and reuse it on the way down.
Notice it in the worked numbers above: $\partial L/\partial h$ was computed a single time and then multiplied by $x$ to get $\partial L/\partial w_1$ and by $1$ to get $\partial L/\partial b_1$. In a deep network this reuse compounds at every layer. The upshot is that the entire gradient $(\partial L/\partial w_1, \ldots, \partial L/\partial w_M)$ is obtained in one backward sweep, costing roughly the same as a single forward pass — independent of how many weights there are. That efficiency is the difference between training a billion-parameter model in days and never training it at all.
This is also why the algorithm is called backpropagation: information about the error propagates backward from the output, each layer multiplying in its own local derivative as the chain rule prescribes. Forward to predict, backward to learn — and "backward" is just the chain rule run from the outside in, exactly as in the three-layer example $\sin(\sqrt{x^2+1})$ from Section 7.6.
Why This Is the Most Consequential Chain-Rule Application Ever
The chain rule is the calculus operation for compositions, and a neural network is the deepest composition humans routinely build — sometimes hundreds of layers, each a function of the last. Strip away the engineering and modern AI is one sentence of this chapter made enormous: rates of change multiply through a chain of dependencies. Drop a single inner-derivative factor and the gradients would be wrong, training would diverge, and the model would never learn. The pitfall this chapter keeps warning you about — forgetting to multiply by the derivative of the inside — is, at industrial scale, the difference between a working AI and a broken one.
We meet this anchor again in Chapter 30, where the single derivatives here become the multivariable gradient $\nabla L$ and the chain rule generalizes to the multivariable chain rule that backpropagation truly uses. The idea, though, is already complete in your hands.
Connections to Other Chapters
- Chain rule (Section 7.6): the entire engine of backpropagation.
- The sigmoid (Chapter 2 case study): the activation function and its self-referential derivative $\sigma' = \sigma(1-\sigma)$.
- Gradient descent (Chapter 6): the weight-update rule $w \leftarrow w - \alpha\,\partial L/\partial w$ that consumes these gradients.
- Multivariable gradient (Chapter 30): the proper vector-valued generalization, where this anchor example culminates.
Discussion Questions
- Backpropagation is "just the chain rule." Why, then, was it considered a breakthrough worthy of decades of research? What is the algorithmic idea beyond the calculus?
- The vanishing gradient problem: in a deep network, $\partial L/\partial w_1$ for an early weight is a product of many factors. If each factor has magnitude less than $1$, what happens to the product as depth grows — and how is this a direct consequence of the chain rule's multiplication?
- The sigmoid derivative $\sigma(1-\sigma)$ has maximum value $0.25$ (at $\sigma = 0.5$). The ReLU activation $\max(0,x)$ has derivative $1$ for $x>0$. Using your answer to Question 2, explain why ReLU helps deep networks train.
- Modern libraries (PyTorch, JAX, TensorFlow) perform automatic differentiation — applying the chain rule mechanically to every elementary operation a program executes. How is this different from a symbolic system like
sympyreturning a formula? - The text calls the chain rule "the most consequential calculus theorem of the 21st century." Argue for or against, given how much technology depends on it.
Your Turn — Mini-Project
Write a tiny neural network in pure Python (no PyTorch) that learns the linear function $f(x) = 2x + 3$ from data:
- Use a single linear layer $\hat{y} = w_1 x + b_1$ (no activation — the network is linear).
- Generate 100 pairs $(x, y)$ with $y = 2x + 3 + \text{noise}$.
- Initialize $w_1, b_1$ randomly. For each example, use the chain rule to compute $\partial L/\partial w_1$ and $\partial L/\partial b_1$ from $L = (\hat y - y)^2$, then step against them.
After enough iterations, confirm $w_1 \approx 2$ and $b_1 \approx 3$. You will have watched the chain rule discover a function from data.
Annotated Further Reading
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323, 533–536. The paper that put backpropagation on the map — remarkably readable, and it is the chain rule throughout.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press, Chapter 6. The standard graduate treatment of feedforward networks and backprop; free online at deeplearningbook.org.
- Karpathy, A. "The spelled-out intro to neural networks and backpropagation: building micrograd." YouTube. Builds an automatic-differentiation engine from scratch, one chain-rule step at a time — the best way to see this case study run.
The chain rule is arguably the most consequential calculus result of the 21st century. Every AI you interact with — chatbots, autocomplete, medical-imaging classifiers, self-driving perception — learns by propagating error backward through a composition, multiplying local derivatives exactly as Section 7.6 prescribes.