Case Study 28.2 — Why Your Model Trains Slowly: The Loss Bowl and Its Condition Number

DataField.Dev

Case Study 28.2 — Why Your Model Trains Slowly: The Loss Bowl and Its Condition Number

Field: machine learning & optimization. Anchor tie-in: this is the optimization bowl of §28.6 doing real work; the elongated-ellipse picture is Figure 28.1, and the story continues in optimization.

The problem

An engineer is training a model with gradient descent, and it is crawling. The loss barely drops for hundreds of iterations, then suddenly plummets. A colleague suggests "normalize your features" and, almost magically, training that took thousands of steps now finishes in dozens. What happened? The answer is entirely a story about a positive definite matrix and its eigenvalues. The loss function, near its minimum, is a quadratic form whose matrix is the Hessian; that Hessian is positive definite at a good minimum (§28.6), so the loss surface is a bowl; and the shape of that bowl — round versus long-and-thin — is set by the ratio of the Hessian's largest to smallest eigenvalue, the condition number. A round bowl is easy to roll a marble down; a long thin valley makes the marble zig-zag agonizingly. Feature normalization works because it reshapes the bowl to be rounder. This case study makes that precise and watches it happen in code.

The loss surface is a quadratic form

Consider the simplest learning problem with a quadratic loss: minimizing $f(\mathbf{x}) = \tfrac12\mathbf{x}^{\mathsf{T}}A\mathbf{x}$, where $A$ is symmetric positive definite and $\mathbf{x}$ is the parameter vector. (This is exactly the local picture of any smooth loss near a minimum, by the Taylor expansion of §28.6, with $A$ the Hessian. It is also literally the loss of linear least squares, whose Hessian is the positive semidefinite matrix $A^{\mathsf{T}}A$ from Chapter 17.) The gradient is $\nabla f = A\mathbf{x}$, and gradient descent takes steps $$\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\,A\mathbf{x}_k = (I - \eta A)\,\mathbf{x}_k,$$ for a step size (learning rate) $\eta$. The unique minimum is at $\mathbf{x} = \mathbf{0}$ (where $\nabla f = \mathbf{0}$), and positive definiteness guarantees this is a genuine bottom, not a saddle the marble could roll past.

How fast do we get there? Rotate into the eigenvector coordinates of $A$ (the spectral theorem, §28.3), where $A$ becomes diagonal with the eigenvalues $\lambda_1, \dots, \lambda_n$ on the diagonal. In those coordinates the descent decouples into independent one-dimensional problems, one per eigen-direction, and along the $i$-th direction the error shrinks by the factor $(1 - \eta\lambda_i)$ every step. Convergence requires $|1 - \eta\lambda_i| < 1$ for every eigenvalue, which forces the step size to be small enough for the largest eigenvalue: $\eta < 2/\lambda_{\max}$. But a step size held small by $\lambda_{\max}$ may be far too timid for the direction with the smallest eigenvalue $\lambda_{\min}$, where the error then shrinks only by the sluggish factor $(1 - \eta\lambda_{\min}) \approx 1$. The marble races down the steep walls of the valley and barely creeps along its shallow floor.

The slowest direction is the shallowest one (smallest eigenvalue, the long axis of the ellipse), and the step size is capped by the steepest one (largest eigenvalue, the short axis). The mismatch between these two is the condition number, and it is the whole story of slow training.

Watching the condition number control the speed

The standard analysis makes this exact: with the optimal fixed step size $\eta = 2/(\lambda_{\max} + \lambda_{\min})$, the number of iterations to reach a fixed accuracy grows in proportion to the condition number $\kappa = \lambda_{\max}/\lambda_{\min}$. Let us see it directly. We run gradient descent on three positive definite bowls of increasing elongation — condition numbers 1, 10, and 50 — from the same starting point, each with its own optimal step size, and count the iterations to drive the gradient below $10^{-6}$.

# Iterations of gradient descent on a PD quadratic scale with the condition number.
import numpy as np
def gd(A, x0, lr, tol=1e-6, maxit=1_000_000):
    x = x0.copy()
    for k in range(1, maxit + 1):
        g = A @ x                       # gradient of (1/2) x^T A x
        x = x - lr * g
        if np.linalg.norm(g) < tol:
            return k
    return maxit

x0 = np.array([1.0, 1.0])
for A, name in [(np.array([[1.0, 0.0], [0.0, 1.0]]),  "round bowl"),
                (np.array([[10.0, 0.0], [0.0, 1.0]]), "elongated"),
                (np.array([[50.0, 0.0], [0.0, 1.0]]), "very elongated")]:
    ev = np.linalg.eigvalsh(A)
    lr_opt = 2.0 / (ev.min() + ev.max())   # optimal step size for a PD quadratic
    print(f"{name:14s}: cond = {np.linalg.cond(A):4.0f},  iterations = {gd(A, x0, lr_opt)}")

round bowl    : cond =    1,  iterations = 2
elongated     : cond =   10,  iterations = 82
very elongated: cond =   50,  iterations = 445

The numbers tell the story exactly. The perfectly round bowl (condition number 1) is solved in 2 iterations — when all eigenvalues are equal, gradient descent points straight at the minimum and the optimal step lands there almost immediately. Stretch the bowl to condition number 10 and it takes 82 iterations; stretch it to 50 and it takes 445. Iterations grow in lockstep with the condition number, just as the theory predicts. A loss surface that is a long thin valley — a covariance-like Hessian with one large and one tiny eigenvalue — is intrinsically slow for gradient descent, no matter how cleverly you tune the single step size, because one number cannot simultaneously suit a steep wall and a shallow floor.

Why feature normalization fixes it

Now the practical punchline. Where does an elongated Hessian come from? Most often from features measured on wildly different scales: a feature in dollars (ranging into the thousands) and a feature that is a fraction (ranging $0$ to $1$) produce a loss that is extremely steep along one parameter and extremely shallow along another — exactly a large condition number. Feature normalization — rescaling each feature to comparable variance — equalizes the diagonal of the Hessian, pulling the eigenvalues toward each other and the condition number toward 1. Geometrically, it reshapes the loss bowl from a long thin valley into a near-circular one, and a near-circular bowl is the case gradient descent solves in a handful of steps. The colleague's advice "normalize your features" is, underneath, "make the Hessian better-conditioned so the bowl is round."

This is also why more sophisticated optimizers exist. Preconditioning multiplies the gradient by an approximate inverse Hessian, which is precisely a change of variables that rounds out the bowl — the ideal preconditioner is $A^{-1}$ itself, which turns the problem into the perfectly-conditioned identity and lets descent finish in one step (that limit is Newton's method). Momentum methods and adaptive optimizers (Adam, RMSProp) are cheaper, approximate ways to cope with a poorly-conditioned bowl without forming the Hessian explicitly. Every one of these techniques is, at bottom, a strategy for dealing with the eigenvalue spread of a positive definite matrix — the same spread that sets the aspect ratio of the contour ellipse in Figure 28.1.

A subtle but important caveat. Real deep-learning loss surfaces are not globally convex — they have many critical points, and the Hessian is positive definite only locally, near a good minimum. Far from any minimum the Hessian can be indefinite (a saddle, §28.6.2), and indeed saddle points, not bad local minima, are now believed to be the main obstacle in high-dimensional non-convex training. But the local geometry near any minimum the optimizer settles into is still a positive definite bowl, and its condition number still governs the final convergence rate — so the lesson transfers, with the honest qualification that the global picture is a landscape of many bowls and saddles rather than one grand bowl.

Connecting back to the chapter

Every concept in this case study is a concept from the chapter wearing applied clothing. The loss-is-a-bowl claim is the positive definiteness of the Hessian (§28.6). The decoupling into independent directions is the spectral theorem's sum-of-squares (§28.3). The slow-versus-fast directions are the long-versus-short axes of the contour ellipse (§28.5), with the smallest eigenvalue giving the longest, slowest axis. The condition number is the eigenvalue ratio that §28.6's Real-World Application flagged. And the saddle caveat is the indefinite Hessian of §28.6.2. The marble in the salad bowl from the opening of the chapter is, quite literally, a model training by gradient descent — and how round that bowl is decides how long you wait.

Takeaways

Near a minimum, a loss function is a positive definite quadratic form whose matrix is the Hessian; gradient descent is a marble rolling down that bowl.
The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ of the Hessian controls convergence speed: iterations grow in proportion to $\kappa$. A round bowl ($\kappa \approx 1$) converges in a few steps; a long thin valley ($\kappa$ large) is intrinsically slow, because one step size cannot suit both the steep walls and the shallow floor.
Feature normalization and preconditioning work by reshaping the bowl to be rounder — pulling the Hessian's eigenvalues together and shrinking the condition number — which is why they accelerate training.
Real non-convex losses are positive definite only locally; far from a minimum the Hessian can be indefinite (a saddle), which is the dominant obstacle in high-dimensional optimization — but the local convergence rate is still set by the positive definite bowl's condition number.