Case Study 2 — The Loss Surface of a Machine Learning Model

DataField.Dev

Case Study 2 — The Loss Surface of a Machine Learning Model

Field: Data science and machine learning

When a neural network "learns," what it is really doing is walking downhill on a surface — a surface that lives in a space of as many dimensions as the model has parameters. That surface is the graph of a function of several variables, its low point is the trained model, and the steps the algorithm takes are computed from partial derivatives. This case study strips the idea down to its smallest honest example, a model with just two parameters, so that the loss surface is a graph we can actually draw and the partial derivatives are quantities we can compute by hand. Everything here scales — unchanged in principle — to the millions of parameters of a modern network. This is the anchor example introduced back in Chapter 6 ("the derivative tells you which way to step"), now reaching its multivariable form.

The model and its loss

Maya, a data scientist, is fitting the simplest possible predictive model: a straight line

$$\hat{y} = w_0 + w_1 x,$$

where $w_0$ is the intercept and $w_1$ the slope. Her training data is two points, $(x, y) = (1, 2)$ and $(x, y) = (2, 2)$. For any choice of the parameters $(w_0, w_1)$ the model makes a prediction at each $x$, and the loss measures how badly those predictions miss the true $y$ values. Maya uses the standard mean-squared-error loss, written here with a convenient factor of $\tfrac12$:

$$L(w_0, w_1) = \tfrac12\Big[(w_0 + w_1 \cdot 1 - 2)^2 + (w_0 + w_1 \cdot 2 - 2)^2\Big].$$

Read this carefully: $L$ is a function of the parameters, not of the data. The data $(1,2)$ and $(2,2)$ are baked-in constants; the inputs that vary are $w_0$ and $w_1$. This is the crucial reframing of Section 29.1 — a function of several variables sends a point $(w_0, w_1)$ of parameter space to a single number, the loss. Its graph is a surface floating above the $(w_0, w_1)$-plane, and "training the model" means finding the lowest point of that surface.

Because each squared term is a parabola and the sum of upward parabolas opens upward, the loss surface is a bowl — a paraboloid, the very first non-trivial surface in the catalog of Section 29.3. Maya knows in advance, from the geometry, that a bowl has exactly one lowest point and no saddles or false minima. That is precisely why mean-squared-error loss with a linear model is the gentlest possible training problem, and why it is the right place to learn the machinery.

The contour map: a target Maya can read

Before computing anything, Maya plots the level curves $L(w_0, w_1) = c$ (Section 29.4). For a bowl, the level curves $L = c$ are nested closed loops — here, tilted ellipses — shrinking toward the single point at the bottom of the bowl. The contour map looks like a target, and the bullseye is the optimal parameter setting.

The reading rules of Section 29.4 give Maya an immediate diagnostic. Tightly spaced contours mark directions in which the loss changes fast — parameters the model is sensitive to. Widely spaced contours mark sloppy directions where the loss barely moves. The fact that the ellipses are tilted (their axes not aligned with the $w_0$- and $w_1$-axes) tells her the two parameters are correlated in their effect on the loss — a fact that will, in a larger model, govern how fast training converges. A contour map of a loss surface is the single most useful diagnostic picture in applied machine learning, and it is nothing more than the level-curve idea of this chapter.

The partial derivatives: the heart of training

Now the central computation. Gradient descent updates each parameter by stepping against the rate at which the loss increases in that parameter's direction — and that rate is a partial derivative (Section 29.8). Maya needs both.

To find $\partial L / \partial w_0$, she holds $w_1$ fixed and differentiates, using the chain rule on each squared term (the inner derivative of $w_0 + w_1 - 2$ with respect to $w_0$ is $1$, and likewise for the second term):

$$\frac{\partial L}{\partial w_0} = \tfrac12\Big[2(w_0 + w_1 - 2)(1) + 2(w_0 + 2w_1 - 2)(1)\Big] = (w_0 + w_1 - 2) + (w_0 + 2w_1 - 2).$$

Collecting terms,

$$\frac{\partial L}{\partial w_0} = 2w_0 + 3w_1 - 4.$$

For $\partial L / \partial w_1$ she holds $w_0$ fixed; now the inner derivatives are the coefficients of $w_1$, namely $1$ and $2$:

$$\frac{\partial L}{\partial w_1} = \tfrac12\Big[2(w_0 + w_1 - 2)(1) + 2(w_0 + 2w_1 - 2)(2)\Big] = (w_0 + w_1 - 2) + 2(w_0 + 2w_1 - 2),$$

which collects to

$$\frac{\partial L}{\partial w_1} = 3w_0 + 5w_1 - 6.$$

These two partials are the entire engine of training. In modern frameworks they are computed automatically by backpropagation, but backpropagation is just the chain rule applied to compute partial derivatives at scale — exactly the operation Maya just did by hand. Chapter 30 will bundle the pair into the gradient $\nabla L = (L_{w_0}, L_{w_1})$, the single vector pointing in the direction of steepest increase, whose negative is the direction gradient descent travels.

One step of gradient descent

Maya starts, as one usually does, from the origin $(w_0, w_1) = (0, 0)$ — a model that predicts $0$ for everything, which is badly wrong since both true values are $2$. She evaluates the partials there:

$$\frac{\partial L}{\partial w_0}(0, 0) = 2(0) + 3(0) - 4 = -4, \qquad \frac{\partial L}{\partial w_1}(0, 0) = 3(0) + 5(0) - 6 = -6.$$

Both partials are negative, which says: increasing either parameter decreases the loss. So gradient descent — which steps opposite the partials — will increase both $w_0$ and $w_1$. With a learning rate (step size) of $\eta = 0.1$, the update is

$$w_0 \leftarrow 0 - 0.1(-4) = 0.4, \qquad w_1 \leftarrow 0 - 0.1(-6) = 0.6.$$

After one step the parameters have moved from $(0, 0)$ to $(0.4, 0.6)$ — downhill on the bowl, toward lower loss. Repeat the evaluate-and-step loop and the parameters spiral into the bottom of the bowl. The geometric picture of Section 29.8 is exact: each partial is a slope of the loss landscape in one axis direction, and the algorithm reads those slopes to decide which way is down.

Where the bottom is, and Clairaut along the way

The lowest point of the bowl is where the surface is level in both axis directions — where both partials vanish (the multivariable echo of $f'=0$, Section 29.14). Maya sets

$$2w_0 + 3w_1 - 4 = 0, \qquad 3w_0 + 5w_1 - 6 = 0,$$

and solves. From the first, $w_0 = (4 - 3w_1)/2$; substituting into the second gives $3(4 - 3w_1)/2 + 5w_1 - 6 = 0$, i.e. $6 - \tfrac{9}{2}w_1 + 5w_1 - 6 = 0$, so $\tfrac12 w_1 = 0$ and $w_1 = 0$, whence $w_0 = 2$. The optimum is $(w_0, w_1) = (2, 0)$: the best straight-line fit is the flat line $\hat y = 2$. That is exactly right — both data points sit at $y = 2$, so the constant function $2$ passes through both and drives the loss to zero. The calculus and the common sense agree.

One last check ties the chapter together. Maya computes the second partials to confirm the surface really is a clean bowl. Differentiating the first partials again:

$$L_{w_0 w_0} = 2, \quad L_{w_1 w_1} = 5, \quad L_{w_0 w_1} = 3, \quad L_{w_1 w_0} = 3.$$

The mixed partials are equal, $L_{w_0 w_1} = L_{w_1 w_0} = 3$, just as Clairaut's theorem (Section 29.9) guarantees for this smooth polynomial loss. This is not a curiosity: those four numbers form the Hessian matrix whose symmetry — supplied by Clairaut — is what makes the second-derivative test of Chapter 31 well behaved, and whose positive curvature confirms $(2, 0)$ is a genuine minimum and not a saddle. The equality of mixed partials, which looked like an abstract theorem in Section 29.9, is the structural fact that guarantees Maya's optimizer is descending toward a true bottom.

Discussion Questions

The loss surface here was a bowl with a single minimum. Real deep-network loss surfaces have many saddle points (Section 29.3). Why does the presence of saddles make training harder, and what does $L_{w_0} = L_{w_1} = 0$ fail to distinguish?
Maya's contour ellipses were tilted, signaling correlated parameters. How would the contour map look if the two parameters affected the loss independently, and what would that imply about the off-diagonal mixed partial $L_{w_0 w_1}$?
Backpropagation is described as "the chain rule for computing partial derivatives at scale." Using Maya's by-hand computation of $\partial L/\partial w_1$, identify exactly where the chain rule appeared.
The learning rate was $\eta = 0.1$. Argue from the linearization idea of Section 29.10 why too large a step can overshoot the bottom of the bowl, even though each partial correctly points downhill.

Annotated Reading

Stewart, Calculus: Early Transcendentals, §14.3 (Partial Derivatives) and §14.4 (Tangent Planes and Linear Approximations). The exact tools Maya uses; §14.4's linearization is the basis for the learning-rate discussion in Question 4.
OpenStax Calculus Volume 3, §4.3 (Partial Derivatives). Free, parallel coverage, with a clean treatment of higher-order and mixed partials underlying the Clairaut check.
Goodfellow, Bengio & Courville, Deep Learning (MIT Press, free online), Ch. 4 (Numerical Computation) and §6.5 (Back-Propagation). Where the two-parameter toy here grows into million-parameter reality; §6.5 makes explicit that backpropagation computes the partial derivatives of the loss with respect to every weight.
3Blue1Brown, "Gradient descent, how neural networks learn" (video). A visual companion that animates the loss-surface descent described in this case study.