Case Study 2 — Fitting a Model by Gradient Descent, End-to-End (Data Science Track)

DataField.Dev

Case Study 2 — Fitting a Model by Gradient Descent, End-to-End (Data Science Track)

A complete worked portfolio for the Data Science track: a model fit to data by gradient descent, validated against the exact solution, then extended to a nonlinear fit. Every tool is traced home. Use it as the template for your own Data-Science-track portfolio.

1. The problem

A logistics analyst has 50 delivery records: for each, the distance driven $x_i$ (in km) and the delivery time $y_i$ (in minutes). She wants a model that predicts time from distance — and, more importantly, she wants to build it the way every modern machine-learning system is built: by gradient descent, the data-science anchor of this book (introduced in Chapter 6 as "the derivative is a direction to step," matured in Chapter 30). No one hands her the coefficients; she will learn them.

We walk the full modeling cycle of §39.2.

2. The assumptions

A linear relationship. Time rises roughly linearly with distance: $\hat y = mx + c$, slope $m$ (minutes per km) and intercept $c$ (fixed handling time). This is the central modeling choice and the first thing to question later.
Additive, roughly symmetric noise. Real times scatter around the line from traffic and chance; we model that scatter as mean-zero noise, which makes squared error the right thing to minimize.
Independent records. Each delivery is its own data point.

As always, these are false in detail and useful in aggregate — exactly what §39.2 demands we state up front.

3. The model and the loss

We want the line that fits best. "Best" means smallest mean squared error (MSE):

$$L(m, c) = \frac{1}{n}\sum_{i=1}^{n}\big(m x_i + c - y_i\big)^2.$$

This loss is a function of two parameters $(m, c)$. Notice it is a sum — an accumulation of error, the integral idea from Chapter 13 applied to discrete data. Squaring makes every error positive and punishes large misses more than small ones; it also makes $L$ a smooth, convex bowl with a single lowest point.

To find that point we go downhill. The direction of steepest descent is the negative gradient (Chapter 30), whose components are partial derivatives (Chapter 29) computed by the chain rule (Chapter 7):

$$\frac{\partial L}{\partial m} = \frac{2}{n}\sum_{i=1}^{n}\big(m x_i + c - y_i\big)x_i, \qquad \frac{\partial L}{\partial c} = \frac{2}{n}\sum_{i=1}^{n}\big(m x_i + c - y_i\big).$$

Gradient descent then steps the parameters against the gradient, scaled by a learning rate $\eta$:

$$m \leftarrow m - \eta\,\frac{\partial L}{\partial m}, \qquad c \leftarrow c - \eta\,\frac{\partial L}{\partial c}.$$

Every piece is calculus: the loss accumulates error (Chapter 13), its gradient is the vector of partials (Chapter 30), and each update is a linear-approximation step — the tangent-line idea from Chapter 11 — that improves the fit a little.

4. Implementing it from scratch

# Fit a line by gradient descent; confirm it matches the exact least-squares solution.
import numpy as np

rng = np.random.default_rng(0)
x = np.linspace(0, 10, 50)
y = 2.0 * x + 1.0 + rng.normal(0, 1.5, x.size)   # true slope 2, intercept 1, + noise

m, c, eta, n = 0.0, 0.0, 0.005, x.size
for _ in range(20000):
    err = (m * x + c) - y
    grad_m = (2/n) * np.sum(err * x)   # dL/dm
    grad_c = (2/n) * np.sum(err)       # dL/dc
    m -= eta * grad_m                  # step downhill
    c -= eta * grad_c
print(f"Gradient descent:  m = {m:.4f}, c = {c:.4f}")   # m = 2.174, c = 0.323

Starting from $(m, c) = (0, 0)$, after 20,000 small steps the parameters settle at

$$m \approx 2.174 \ \text{min/km}, \qquad c \approx 0.323 \ \text{min}.$$

The analyst's model: a delivery takes about 2.17 minutes per kilometer plus a third of a minute of fixed overhead. The fitted slope is not exactly the "true" 2.0 because this particular noise sample shifts the optimum slightly — gradient descent correctly finds the minimizer of the loss on the data it has, whatever that is.

5. Validation

Independent cross-check against the exact solution. A linear least-squares fit has a closed form (solve the normal equations with linear algebra). It must give the same answer gradient descent crawled toward:

A = np.vstack([x, np.ones_like(x)]).T
m_ls, c_ls = np.linalg.lstsq(A, y, rcond=None)[0]
print(f"Least squares:     m = {m_ls:.4f}, c = {c_ls:.4f}")  # m = 2.174, c = 0.323

Both give $m \approx 2.174$, $c \approx 0.323$. The iterative learner and the one-shot linear-algebra solution agree exactly. This is genuine validation: the loss is convex, so it has a single minimum, and following the gradient downhill must reach it. Two independent routes to the same point — the same standard the SIR final-size relation set in Case Study 1.

Cross-validation (honesty). Fitting and testing on the same data flatters the model. We hold out 20% of the records, fit on the remaining 80%, and measure error on the unseen 20%. If the held-out error is close to the training error, the model generalizes; a large gap would signal overfitting (§39.8) — the model memorized noise instead of learning signal. For a line on roughly linear data, the two errors stay close, which is the sign of an honest fit.

Common Pitfall — the learning rate. $\eta$ is the most error-prone knob in gradient descent (§39.6.2). Too small and the loop needs millions of iterations; too large and the steps overshoot, the loss diverges, and the parameters explode to nan. The cure: scale the features (so both partials have comparable magnitude) and reduce $\eta$ until the loss decreases monotonically. There is no universally correct $\eta$.

6. Extension: a nonlinear fit (the same tools, harder loss)

Suppose the analyst's deliveries instead show saturation — beyond some distance, time levels off because long routes use highways. A line cannot capture that; a logistic curve can:

$$\hat y(x) = \frac{A}{1 + e^{-k(x - x_0)}},$$

with ceiling $A$, steepness $k$, and midpoint $x_0$. The loss is still MSE, but now over three parameters $\theta = (A, k, x_0)$, and the bowl is no longer convex. The recipe is identical: derive $\partial L/\partial A$, $\partial L/\partial k$, $\partial L/\partial x_0$ by the chain rule (Chapter 7), then run the same θ ← θ - η ∇L update. The fit returns the saturation level $A$ (the maximum predicted time) and the inflection point at $x = x_0$ (where time rises fastest), each read straight off the fitted parameters — derivatives of Chapter 6 made concrete. The lesson: the same gradient step solves a problem whose answer has no closed form at all.

Real-World Application — Training every ML model. This three-line update is how neural networks, logistic regressions, and large language models are trained. In a deep network the loss is non-convex, the gradient is computed by backpropagation (the multivariable chain rule of Chapter 30, applied layer by layer), and the optimizer is a refined gradient descent (momentum, Adam). Strip a billion-parameter model to its core and you find exactly this: parameter $\leftarrow$ parameter $- \eta\,\nabla(\text{loss})$.

7. The tools, traced home

Tool	Where it appears here	Home chapter
Derivative (rate, chain rule)	each $\partial L/\partial(\cdot)$	6, 7
Integral (accumulation)	the loss as a sum of errors	13
Gradient (steepest descent)	the learning step $-\eta\nabla L$	30
Linear approximation	each tangent-line update	11
Optimization (the minimum)	the converged best fit	31
Series / Taylor	smoothness of the logistic / activations	23

8. Limitations (stated plainly)

The linear model assumes the relationship is a line (rarely exactly true — hence the logistic extension); MSE assumes symmetric noise and is sensitive to outliers (one bad GPS reading drags the fit); gradient descent on a non-convex loss can land in a poor local minimum depending on the starting point and $\eta$; and 50 records is a small sample. The fitted slope carries real uncertainty that a single number hides. State the assumptions, validate on held-out data, and report uncertainty — the §39.8 discipline.

9. Conclusion (the punchline first)

Delivery time rises about 2.17 minutes per kilometer with a small fixed overhead — a model learned, not given, and confirmed to match the exact least-squares optimum. Behind that sentence sits the entire data-science toolkit: a loss that accumulates error (integral), a gradient that points downhill, a chain rule that computes it, and an optimization that the descent reaches. The same machine, scaled up, trains every model in modern AI. That is the data-science track's proof that you can do calculus, not just recite it.

Discussion Questions

Gradient descent and the closed-form least-squares solution gave the same answer. Why did we bother with the slow iterative method when a one-line formula exists?
The fitted slope was 2.174, not the "true" 2.0. Is the model wrong? Explain what gradient descent actually minimized.
For the logistic extension the loss is non-convex. What can go wrong that cannot go wrong for the linear fit, and how would you guard against it?
Your training error is tiny but held-out error is large. Name the problem and two ways to fix it.

Short Annotated Reading

Boyd & Vandenberghe, Convex Optimization (free PDF). Why convexity guarantees gradient descent reaches the global minimum — the theory behind the linear-fit validation here.
Goodfellow, Bengio & Courville, Deep Learning (free online), Ch. 4 & 6. Gradient-based optimization and backpropagation; connects this case study's three-line update to full neural networks.
James, Witten, Hastie & Tibshirani, An Introduction to Statistical Learning (free PDF). Accessible treatment of least squares, the bias–variance tradeoff, and cross-validation — the validation discipline of §6 above.