Case Study 2 — Least Squares and Maximum Likelihood
Field: Statistics, machine learning, data science Calculus used: Unconstrained multivariable optimization (Section 31.4), critical points via $\nabla S = \mathbf 0$, the Hessian as a guarantee of a global minimum (Section 31.11), connection to maximum likelihood
The problem
Priya is a data analyst at a regional utility, and she has four months of data relating the average daily temperature drop (in tens of degrees below comfort, call it $x$) to household heating cost (in hundreds of dollars, call it $y$):
| month | $x$ | $y$ |
|---|---|---|
| Jan | 1 | 2 |
| Feb | 2 | 2 |
| Mar | 3 | 4 |
| Apr | 4 | 5 |
Plotted, the four points rise roughly along a line, but not perfectly — no single line passes through all four. Priya wants the best straight-line fit $y = mx + c$, the one a utility forecaster could use to predict next winter's costs. But "best" needs a precise meaning, and that meaning turns the messy art of curve-fitting into a clean exercise in the multivariable optimization of Section 31.4.
Defining "best": the sum of squared residuals
For any candidate line with slope $m$ and intercept $c$, the residual at the $i$-th point is the vertical miss $y_i - (mx_i + c)$ — how far the line's prediction lands from the actual value. Some residuals are positive (line too low), some negative (line too high). To measure total badness-of-fit we cannot just add them, because positives and negatives would cancel. Instead we square each residual and sum:
$$S(m, c) = \sum_{i=1}^{4}\big(y_i - mx_i - c\big)^2.$$
This is the sum of squared residuals, and minimizing it defines the method of least squares. Squaring does three jobs at once: it kills the sign cancellation, it penalizes big misses disproportionately (a residual of $2$ costs four times one of $1$), and — crucially for us — it makes $S$ a smooth, differentiable function of the two parameters $m$ and $c$. So the question "what is the best line?" becomes "where is $S(m,c)$ smallest?" — an unconstrained optimization in two variables, exactly the setting of Sections 31.2–31.4. There is no constraint here; $m$ and $c$ may be anything, and the structure of $S$ alone pins down the answer.
Finding the critical point
Following the recipe of Section 31.2, we set the gradient to zero. Differentiating $S$ with respect to each parameter (chain rule on each squared term):
$$\frac{\partial S}{\partial m} = \sum_i 2\big(y_i - mx_i - c\big)(-x_i) = 0,$$ $$\frac{\partial S}{\partial c} = \sum_i 2\big(y_i - mx_i - c\big)(-1) = 0.$$
Dividing by $-2$ and expanding the sums gives the celebrated normal equations (Section 31.11):
$$m\sum x_i^2 + c\sum x_i = \sum x_i y_i, \qquad m\sum x_i + c\,n = \sum y_i.$$
These are linear in $m$ and $c$ — a $2\times 2$ system — which is the quiet miracle of least squares: a nonlinear-looking fitting problem collapses to high-school algebra. Now tabulate the sums from Priya's data ($n = 4$):
$$\sum x_i = 1+2+3+4 = 10,\quad \sum y_i = 2+2+4+5 = 13,$$ $$\sum x_i^2 = 1+4+9+16 = 30,\quad \sum x_i y_i = 2+4+12+20 = 38.$$
The normal equations become
$$30m + 10c = 38, \qquad 10m + 4c = 13.$$
From the second equation, $c = \dfrac{13 - 10m}{4}$. Substitute into the first (multiply through by $4$ to clear fractions):
$$120m + 10(13 - 10m) = 152 \ \Longrightarrow\ 120m + 130 - 100m = 152 \ \Longrightarrow\ 20m = 22 \ \Longrightarrow\ m = 1.1.$$
Then $c = \dfrac{13 - 10(1.1)}{4} = \dfrac{13 - 11}{4} = 0.5.$ The best-fit line is
$$\boxed{\,y = 1.1\,x + 0.5.\,}$$
A heating-cost forecaster reading this would say: each additional unit of temperature drop adds about \$110 to the monthly bill, on a base of \$50.
Why this critical point is the answer — the Hessian guarantees it
Setting $\nabla S = \mathbf 0$ only finds a critical point. As the whole thrust of Section 31.4 warns, a critical point could be a maximum, a minimum, or a saddle. We must classify it — and here the news is uniformly good. The second partials are constant (because $S$ is quadratic in $m, c$):
$$S_{mm} = 2\sum x_i^2 = 60, \qquad S_{cc} = 2n = 8, \qquad S_{mc} = 2\sum x_i = 20.$$
The discriminant is
$$D = S_{mm}S_{cc} - S_{mc}^2 = (60)(8) - (20)^2 = 480 - 400 = 80 > 0,$$
and $S_{mm} = 60 > 0$, so by the second-derivative test the critical point $(m, c) = (1.1, 0.5)$ is a local minimum. But it is more than local. Because $S$ is a sum of squares, its Hessian is positive-definite everywhere — the eigenvalues are both positive (their sum is the trace $60 + 8 = 68 > 0$ and their product is $D = 80 > 0$), which holds at every point, not just the critical one. That is exactly the convexity condition of Section 31.12: a function with a positive-definite Hessian everywhere is convex, has no saddle points and no spurious local minima, and its single critical point is the global minimum.
This is the deep reason least squares is so trustworthy. Unlike neural-network training (Section 31.6), whose loss landscape is riddled with saddle points, the least-squares surface is a perfect bowl. There is one lowest point, gradient descent finds it from any starting guess, and the normal equations hand it to you in closed form. Every linear regression you will ever run is this one guaranteed-global critical-point computation.
The maximum-likelihood connection
Why squared residuals, and not, say, absolute residuals? The answer is a beautiful bridge to probability. Suppose each observation is the true line plus independent Gaussian noise $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$. The probability density of seeing $y_i$ given the line is proportional to $\exp\!\big(-\tfrac{1}{2\sigma^2}(y_i - mx_i - c)^2\big)$. The likelihood of the whole dataset is the product over $i$, and its logarithm — the log-likelihood — is
$$\ell(m, c) = \text{const} - \frac{1}{2\sigma^2}\sum_i\big(y_i - mx_i - c\big)^2 = \text{const} - \frac{1}{2\sigma^2}\,S(m, c).$$
Maximizing $\ell$ is therefore identical to minimizing $S$. Least squares is maximum likelihood under Gaussian noise. The squaring was never arbitrary; it is what the normal distribution dictates. This is the unconstrained face of the same coin whose constrained face — maximizing a log-likelihood subject to probabilities summing to one — produced the $\hat p_i = n_i/n$ estimator via Lagrange multipliers in Section 31.11. Whether your parameters roam free or live on a constraint, statistical estimation is multivariable optimization in disguise.
Verifying with the machine
The hand answer deserves a numerical confirmation. The output below is hand-computed, not executed here.
# Least-squares line for Priya's data via the normal equations.
import numpy as np
x = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([2.0, 2.0, 4.0, 5.0])
A = np.vstack([x, np.ones_like(x)]).T # design matrix [x | 1]
(m, c), *_ = np.linalg.lstsq(A, y, rcond=None)
print(f"slope m = {m:.3f}, intercept c = {c:.3f}")
# slope m = 1.100, intercept c = 0.500
numpy's least-squares solver returns $m = 1.1$, $c = 0.5$ — matching the hand-solved normal equations to the digit, the cross-check our three-tier teaching pattern demands.
Discussion questions
- Add a fifth, badly-mismeasured point $(5, 1)$ to the dataset (an outlier). Recompute the normal equations and observe how far the slope moves. Why does squaring make least squares sensitive to outliers, and what does the maximum-likelihood story say about when that sensitivity is justified?
- Show algebraically that the best-fit line always passes through the centroid $(\bar x, \bar y)$. (Hint: the second normal equation, divided by $n$, is this statement.)
- The Hessian of $S$ does not depend on the $y$-values at all — only on the $x$'s. Explain why, and what this means about the shape of the error bowl versus its location.
- Suppose you wanted to fit a line through the origin ($c = 0$ forced). Is this an unconstrained or a constrained problem? Set it up and solve, then compare the slope to the free-intercept fit.
Annotated reading
- Stewart, Calculus: Early Transcendentals, Section 14.7. Presents the least-squares derivation as the marquee application of the second-derivative test, including the Hessian check that the critical point is a minimum. The cleanest calculus-first treatment.
- OpenStax, Calculus Volume 3, Section 4.7. Free, with a worked least-squares example and exercises paralleling this case; good for additional practice.
- Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (free PDF), Chapter 3. The modern statistical-learning view: normal equations in matrix form, the geometry of projection, and the bridge to regularized regression. Read it to see how the two-variable problem here scales to thousands of predictors.
- Bishop, Pattern Recognition and Machine Learning, Chapter 1. Develops the least-squares-equals-Gaussian-maximum-likelihood equivalence in full, the cleanest exposition of the connection sketched above.