Case Study 1 — How a Machine Learns a Line: Gradient Descent in Action
Field: Data science / machine learning Calculus used: the derivative as a function (§6.2), the derivative as a downhill compass (§6.8), higher derivatives for the convexity check (§6.6) Anchor example: this is the first full episode of the book's gradient descent thread, which culminates in Chapter 30 with the multivariable gradient and neural networks.
In §6.8 we used gradient descent to walk downhill on the toy function $f(x) = (x-3)^2 + 1$, whose minimum we already knew. That was a rehearsal. Here we put the idea to work on the smallest genuine machine-learning task there is: fitting a straight line to data. The story is worth following carefully, because the loop you are about to watch — compute the slope, step against it, repeat — is, with no exaggeration, the same loop that trains the largest neural networks on Earth. Only the number of parameters changes.
A small data problem
A lab has measured five $(x, y)$ pairs and suspects the relationship is roughly proportional, $y \approx mx$, for some unknown slope $m$:
| $x$ | $1$ | $2$ | $3$ | $4$ | $5$ |
|---|---|---|---|---|---|
| $y$ | $2.1$ | $3.9$ | $6.1$ | $8.0$ | $9.9$ |
The points do not lie exactly on a line — real measurements never do — so there is no single $m$ that passes through all five. Instead we want the slope that fits best. To make "best" precise, we need a number that scores any candidate $m$ by how badly it misses, and then we want to make that number as small as possible. That is an optimization problem, and optimization is the natural home of the derivative.
The loss function
The standard score is the mean squared error, the average of the squared vertical gaps between the data and the line $y = mx$:
$$L(m) = \frac{1}{5}\sum_{i=1}^{5}\big(y_i - m\,x_i\big)^2.$$
Read $L$ as a function of one variable, $m$. For each candidate slope it returns a single non-negative number; the smaller $L(m)$, the better the fit. Writing it out for our data,
$$L(m) = \frac{1}{5}\Big[(2.1 - m)^2 + (3.9 - 2m)^2 + (6.1 - 3m)^2 + (8.0 - 4m)^2 + (9.9 - 5m)^2\Big].$$
Each term is a parabola in $m$, and a sum of upward parabolas is itself an upward parabola — so $L$ is a single smooth, bowl-shaped (convex) curve with exactly one lowest point. Our whole job is to find the $m$ at the bottom of that bowl.
Calculus finds the bottom of the bowl
Because $L$ is smooth, its minimum sits where the tangent line is flat — where $L'(m) = 0$ (§6.4). Differentiate term by term, using the chain-rule pattern $\frac{d}{dm}(c - am)^2 = -2a(c - am)$:
$$L'(m) = \frac{1}{5}\Big[-2(2.1 - m) - 4(3.9 - 2m) - 6(6.1 - 3m) - 8(8.0 - 4m) - 10(9.9 - 5m)\Big].$$
This is linear in $m$, so setting $L'(m) = 0$ has a tidy closed form. Collecting the structure, $L'(m) = -\frac{2}{5}\sum_i x_i(y_i - m x_i) = 0$ forces
$$m^\star = \frac{\sum_i x_i y_i}{\sum_i x_i^2}.$$
This is exactly the least-squares formula. A quick check that it is a minimum and not a maximum: the second derivative is $L''(m) = \frac{2}{5}\sum_i x_i^2 > 0$, so the curve is concave up everywhere (§6.6) — a true valley.
import numpy as np
xs = np.array([1, 2, 3, 4, 5])
ys = np.array([2.1, 3.9, 6.1, 8.0, 9.9])
def L(m): return np.mean((ys - m * xs) ** 2)
def Lprime(m): return np.mean(-2 * xs * (ys - m * xs)) # L'(m)
m_star = np.sum(xs * ys) / np.sum(xs ** 2) # closed-form minimizer
print(f"closed-form m* = {m_star:.6f}") # 1.994545
print(f"L(m*) = {L(m_star):.6f}") # 0.007673
The exact answer is $m^\star \approx 1.9945$, with a tiny residual loss $L(m^\star) \approx 0.0077$. The slope is close to $2$ — the data were generated near $y = 2x$ — but not exactly $2$, because the noisy measurements pull the best fit slightly off. (A common student slip is to assume the answer is the clean number $2$; the calculus reports what the data actually say, which is $1.9945$.)
Now do it the machine's way
The closed form was available only because $L$ was a quadratic. Suppose we had no formula and could only evaluate $L$ and its slope $L'$ at any point we choose. Gradient descent (§6.8) thrives in exactly that situation. We pick a starting slope, read the local derivative, and step against it:
$$m_{n+1} = m_n - \alpha\, L'(m_n).$$
m = 0.0 # start with a deliberately wrong guess
alpha = 0.01 # learning rate
for step in range(2000):
m = m - alpha * Lprime(m) # step against the slope (§6.8)
print(f"gradient-descent m = {m:.6f}") # 1.994545
Starting from the badly wrong guess $m = 0$, the iterates climb steadily down the bowl and settle on the very same $1.9945$. The derivative did all the navigating: at $m = 0$ the slope $L'(0)$ is strongly negative (the bowl falls to the right), so the update adds to $m$ and moves it rightward toward the minimum; as $m$ approaches $1.9945$ the slope flattens toward zero and the steps shrink to nothing. The procedure is self-correcting, exactly as §6.8 promised.
Why bother, when the formula was right there? Two reasons, and they are the reasons gradient descent rules modern machine learning. First, closed forms run out fast. Replace the one-parameter line $y = mx$ with a neural network carrying millions of parameters, and $L'(\,\cdot\,) = 0$ becomes a system of millions of hopelessly tangled equations with no algebraic solution. Gradient descent never needs one — it only needs to evaluate the slope and step. Second, the iterative loop parallelizes. A GPU can compute millions of derivative-based updates per second, which is how billion-parameter models are trained at all.
Watching the descent
It helps to see the bowl and the path the algorithm traces down its wall.
import numpy as np
import matplotlib.pyplot as plt
ms = np.linspace(-1, 5, 200)
plt.plot(ms, [L(m) for m in ms], 'b-', lw=2, label='loss $L(m)$')
m, alpha, path = 0.0, 0.01, [0.0]
for _ in range(60):
m = m - alpha * Lprime(m)
path.append(m)
plt.plot(path, [L(m) for m in path], 'ro-', ms=4, label='descent path')
plt.axvline(m_star, color='gray', ls='--', label=f'$m^*={m_star:.3f}$')
plt.xlabel('slope $m$'); plt.ylabel('loss'); plt.legend()
plt.title('Gradient descent rolling to the bottom of the loss bowl')
plt.grid(True, alpha=0.3); plt.show()
Figure CS1.1 — The loss $L(m)$ is an upward parabola. Gradient descent, starting at $m = 0$, takes large steps where the wall is steep (large $|L'|$) and ever-smaller steps as it nears the flat bottom, converging on $m^\star \approx 1.9945$.
From one parameter to a billion
The leap to real machine learning is conceptually small. A neural network is a function with parameters $w_1, w_2, \ldots, w_N$ — for a large language model, $N$ runs into the hundreds of billions. Its loss $L(w_1, \ldots, w_N)$ measures how badly the network predicts its training data. There is no formula for the minimizing parameters. But for each parameter $w_i$ we can compute the partial derivative $\partial L / \partial w_i$ — the slope of the loss with respect to that one knob, holding the others fixed (the gradient of Chapter 30) — and nudge it:
$$w_i \leftarrow w_i - \alpha\,\frac{\partial L}{\partial w_i}.$$
That is identical to the update we just ran for $m$, applied to every parameter at once. The vector of all the partials is the gradient $\nabla L$, the multivariable heir to our single $L'$. The slopes themselves are computed by automatic differentiation, a mechanical application of the chain rule (Chapter 7) through every operation in the network — the engine inside PyTorch, TensorFlow, and JAX.
# Conceptual sketch of a real training loop (not runnable as written):
for epoch in range(num_epochs):
for batch in training_data:
loss = compute_loss(network, batch)
grad = autodiff(loss, network.parameters) # all partials at once
for w in network.parameters:
w -= learning_rate * grad[w] # the §6.8 update, N times
Run that loop for hours across a cluster of GPUs and the parameters drift into a configuration that makes the loss small. Every chatbot, image generator, and recommendation engine you have ever used was shaped by this single idea: the derivative tells you which way is downhill, so step that way and repeat.
A caution about the learning rate
The one tuning knob, $\alpha$, is delicate. On our convex bowl, the gap to the minimum is multiplied by a fixed factor each step; too small an $\alpha$ makes that factor near $1$ and convergence crawls, while too large an $\alpha$ overshoots the bottom and can send the iterates flying outward (§6.8's warning). Try alpha = 0.1 in the code above and the descent oscillates wildly or diverges. Choosing and adapting $\alpha$ — the job of optimizers like Adam and RMSprop — is one of the central practical arts of machine learning, and it is governed entirely by the local behavior of the derivative.
Discussion Questions
- Our $L(m)$ was convex (one bowl, one bottom). Real neural-network losses are non-convex with many local minima. Why does gradient descent still work well in practice even though it can only find a nearby minimum?
- Production training uses stochastic gradient descent — each step uses the slope estimated from a single batch rather than the whole dataset. What is gained, and what is the cost?
- Verify by hand that $L''(m) > 0$ for our data. Why does this single fact guarantee gradient descent cannot get trapped at a maximum or saddle here?
- The "vanishing gradient problem" stalled early deep networks: in some layers $\partial L/\partial w_i$ shrank to nearly zero, freezing those parameters. Using §6.8, explain why a near-zero derivative halts learning.
- Suppose your measurements were far noisier, so the five points scattered widely. Would $m^\star$ change? Would $L(m^\star)$? Which one reports the quality of the fit?
Your Turn — Mini-Project
Extend the model from $y = mx$ to $y = mx + b$, which has two parameters. The loss $L(m, b)$ is now a function of two variables, so gradient descent needs both partial derivatives, estimated numerically:
import numpy as np
xs = np.array([1, 2, 3, 4, 5])
ys = np.array([2.5, 3.9, 6.1, 8.5, 10.3])
def L(m, b): return np.mean((ys - m * xs - b) ** 2)
def grad(m, b, h=1e-5):
dm = (L(m + h, b) - L(m - h, b)) / (2 * h) # symmetric difference (§6.2)
db = (L(m, b + h) - L(m, b - h)) / (2 * h)
return dm, db
m, b = 0.0, 0.0
for _ in range(20000):
dm, db = grad(m, b)
m -= 0.01 * dm
b -= 0.01 * db
print(f"m = {m:.4f}, b = {b:.4f}")
Compare your $(m, b)$ to the closed-form least-squares solution (any statistics text gives the formulas). You have just performed gradient descent in two dimensions — a baby step toward Chapter 30, where the same idea scales to networks of arbitrary size.
Further Reading
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press (free at deeplearningbook.org). Chapter 4 motivates numerical optimization; Chapter 8 is the definitive treatment of gradient-based training. Start here after Chapter 30.
- Ruder, S. (2017). "An overview of gradient descent optimization algorithms," arXiv:1609.04747. A readable survey of the learning-rate variants (momentum, Adam, RMSprop) hinted at above.
- Karpathy, A. "Neural Networks: Zero to Hero" (YouTube). Builds gradient descent and backpropagation from scratch in Python — the best way to feel what this case study describes.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press (free online). The rigorous account of why gradient descent converges cleanly on convex bowls like our $L(m)$.
Gradient descent is the most consequential algorithm of 21st-century mathematics, and it is, at heart, one line of high-school algebra applied a great many times in a row. The derivative as a function — the central idea of this chapter — is what makes that line point the right way.