Case Study 1 — Teaching a Machine to Predict: Gradient Descent in Action

Field: Data science / machine learning. This is the chapter's anchor climax — take the gradient, step downhill, repeat — worked end to end by hand on a model small enough to follow every number.

The Problem

Maya is a junior data scientist at a coffee-subscription startup. Marketing hands her a question that sounds simple: how does the amount we spend on a customer's onboarding emails relate to how many months they stay subscribed? She has cleaned the data down to four representative customers. Onboarding spend $x$ (in tens of dollars) and retention $y$ (in months):

$x$ $0$ $1$ $2$ $3$
$y$ $1$ $3$ $4$ $6$

She wants the best straight line $\hat y = wx + b$ through these points — a slope $w$ telling her "each extra \$10 of onboarding buys this many months" and an intercept $b$ telling her the baseline. This is the simplest machine-learning model there is: **linear regression with two parameters**. The plan is to *learn* $w$ and $b$ the way a neural network learns its billions of weights — by gradient descent. Two parameters instead of two billion, but, as §30.9 promised, the algorithm is identical.

Setting Up the Loss Surface

"Learning" means "minimizing a loss." Maya measures how wrong a given line is with the mean-squared-error loss — average squared gap between prediction and truth:

$$L(w, b) = \frac{1}{4}\sum_{i=1}^{4}\big(wx_i + b - y_i\big)^2.$$

Every choice of $(w, b)$ gives one number $L$. The graph of $L$ over the $(w, b)$-plane is the loss surface of §30.9 — a bowl-shaped landscape. Training is the search for the bottom of that bowl. And the way to the bottom of any bowl, by everything in this chapter, is to step in the direction $-\nabla L$.

So Maya needs the gradient $\nabla L = \langle \partial L/\partial w,\ \partial L/\partial b\rangle$. She computes it once, symbolically, using the chain rule (§30.2) on each squared term. Writing the residual $r_i = wx_i + b - y_i$:

$$\frac{\partial L}{\partial w} = \frac{1}{4}\sum_{i} 2 r_i\, x_i = \frac{1}{2}\sum_i (wx_i + b - y_i)\,x_i,$$

$$\frac{\partial L}{\partial b} = \frac{1}{4}\sum_{i} 2 r_i\,(1) = \frac{1}{2}\sum_i (wx_i + b - y_i).$$

These two formulas are the entire engine. Each is a sum over the data of "how wrong we are" times "how this parameter influenced the prediction" — the partial derivative is literally the model's accountability for each parameter.

Stepping Downhill, By Hand

Maya starts, as one does, at the wrong answer: $w_0 = 0$, $b_0 = 0$. The line is flat on the $x$-axis, predicting $0$ months for everyone. She picks a learning rate $\eta = 0.1$.

Step 0 → 1. With $w = 0, b = 0$, every prediction is $\hat y_i = 0$, so the residuals are $r_i = -y_i = (-1, -3, -4, -6)$ at $x = (0,1,2,3)$.

$$\frac{\partial L}{\partial w} = \tfrac12\big[(-1)(0) + (-3)(1) + (-4)(2) + (-6)(3)\big] = \tfrac12(0 - 3 - 8 - 18) = \tfrac12(-29) = -14.5,$$

$$\frac{\partial L}{\partial b} = \tfrac12\big[(-1) + (-3) + (-4) + (-6)\big] = \tfrac12(-14) = -7.$$

So $\nabla L = \langle -14.5,\ -7\rangle$. The negative gradient points "uphill in $w$ and $b$" — both should increase, which makes sense: the line is far too low and too flat. The update:

$$w_1 = 0 - 0.1(-14.5) = 1.45, \qquad b_1 = 0 - 0.1(-7) = 0.7.$$

After one step the line is $\hat y = 1.45x + 0.7$ — already sloping up through the data.

Step 1 → 2. Now predictions are $\hat y_i = 1.45 x_i + 0.7 = (0.7,\ 2.15,\ 3.6,\ 5.05)$ at $x = (0,1,2,3)$. Residuals $r_i = \hat y_i - y_i = (0.7-1,\ 2.15-3,\ 3.6-4,\ 5.05-6) = (-0.3,\ -0.85,\ -0.4,\ -0.95)$.

$$\frac{\partial L}{\partial w} = \tfrac12\big[(-0.3)(0) + (-0.85)(1) + (-0.4)(2) + (-0.95)(3)\big] = \tfrac12(0 - 0.85 - 0.8 - 2.85) = \tfrac12(-4.5) = -2.25,$$

$$\frac{\partial L}{\partial b} = \tfrac12\big[(-0.3) + (-0.85) + (-0.4) + (-0.95)\big] = \tfrac12(-2.5) = -1.25.$$

Update:

$$w_2 = 1.45 - 0.1(-2.25) = 1.675, \qquad b_2 = 0.7 - 0.1(-1.25) = 0.825.$$

Notice the gradient shrank dramatically — from magnitude $\approx 16.1$ to $\approx 2.57$. That is the signature of approaching the valley floor: as the surface flattens, the steps get smaller on their own, exactly as §30.8 described. Maya is watching gradient descent self-throttle.

Where Is the Bottom?

Because this loss is an exact quadratic bowl, Maya can find the true minimum analytically and check that her iterates are heading there. The minimum sits where $\nabla L = \mathbf 0$, i.e. both partials vanish. Using $\sum x_i = 6$, $\sum x_i^2 = 0+1+4+9 = 14$, $\sum y_i = 14$, $\sum x_i y_i = 0+3+8+18 = 29$, and $n = 4$, the normal equations $\nabla L = 0$ are:

$$14w + 6b = 29, \qquad 6w + 4b = 14.$$

Solving: from the second, $b = \tfrac{14 - 6w}{4} = 3.5 - 1.5w$. Substitute into the first: $14w + 6(3.5 - 1.5w) = 29 \Rightarrow 14w + 21 - 9w = 29 \Rightarrow 5w = 8 \Rightarrow w^\star = 1.6$, and $b^\star = 3.5 - 1.5(1.6) = 3.5 - 2.4 = 1.1$.

The exact best-fit line is $\hat y = 1.6x + 1.1$. After just two hand steps Maya is at $(1.675, 0.825)$ — already close in slope, with the intercept still climbing toward $1.1$. Let the loop run and it converges to $(1.6, 1.1)$. Her business answer: each extra \$10 of onboarding is associated with about 1.6 more months of retention, on a baseline of about 1.1 months.

The Same Loop That Trains GPT

Maya wrote the production version in five lines. It is the §30.9 pattern verbatim:

# Linear regression by gradient descent (hand-checked above)
import numpy as np

X = np.array([0.0, 1.0, 2.0, 3.0])
y = np.array([1.0, 3.0, 4.0, 6.0])

w, b, eta = 0.0, 0.0, 0.1
for step in range(500):
    r = (w * X + b) - y          # residuals
    grad_w = np.mean(r * X)      # ∂L/∂w  (= (1/2)·Σ r x  with our 1/4 averaging)
    grad_b = np.mean(r)          # ∂L/∂b
    w -= eta * grad_w            # step downhill — the one essential line
    b -= eta * grad_b

print(round(w, 3), round(b, 3))   # -> 1.6  1.1   (matches the exact minimum)

(The code's np.mean(r*X) equals our hand gradient up to the constant averaging factor — the same direction, so the same descent.) Maya never executed the loop to understand it; she ran two steps by hand first, and the code merely finished the job at machine speed.

The punchline is the one §30.9 insists on. Replace these 4 data points with millions, replace the line $wx + b$ with a deep neural network, replace the two parameters $(w, b)$ with $10^{11}$ weights, and replace the exact gradient with a minibatch estimate — and you have, in outline, how a large language model is trained. The gradient is still computed by the chain rule (now industrialized as backpropagation, §30.2), and the update is still the single line θ ← θ − η∇L. Maya, stepping two points downhill by hand, has done the same arithmetic that trains every model in the building.

Discussion Questions

  1. Conditioning. Maya's $x$-values are spread out while her $y$-values are modest. If she had instead measured spend in dollars (so $x \in \{0, 10, 20, 30\}$), the $w$-direction of the loss surface would become far steeper than the $b$-direction. Predict what this does to a fixed learning rate $\eta = 0.1$, and connect it to the "poor conditioning" of §30.8. (This is why practitioners standardize their features.)
  2. Learning rate. Re-run the hand steps with $\eta = 1.0$. Does the loss still decrease, or do the iterates overshoot? Use the stability reasoning of §30.8 (the warning about $\eta$) to explain.
  3. From line to network. In the formula $\partial L/\partial w = \tfrac12\sum r_i x_i$, identify which factor is "how wrong the model is" and which is "how the parameter influenced the prediction." Explain how backpropagation generalizes exactly these two factors through many layers via the chain rule.
  4. Stopping. Maya stops when $\|\nabla L\|$ is small. Why is a small gradient — rather than a small loss — the right signal that training has converged? (See the algorithm's step 4 in §30.8.)

A Short Annotated Reading

  • Goodfellow, Bengio & Courville, Deep Learning (2016), Ch. 4 (numerical computation) and §6.5 (back-propagation). The canonical modern reference; §6.5 is precisely the reverse-mode chain rule of §30.2 written for networks. Free online.
  • Stewart, Calculus: Early Transcendentals (9th ed.), §14.6 (directional derivatives & the gradient). The textbook foundation for the gradient Maya computed; pair with §14.5 (chain rule).
  • OpenStax, Calculus Volume 3, §4.6. Free, careful treatment of $D_{\mathbf u}f = \nabla f\cdot\mathbf u$ and steepest descent, with worked examples mirroring this case study.
  • Ng, Machine Learning course notes (Stanford CS229), "Linear Regression and Gradient Descent." Shows the exact two-parameter descent above as the gateway to all of supervised learning.