Case Study 1 — Calibrating a Sensor: Least Squares as Projection
Field: data science / instrumentation. This case study makes the chapter's central claim — least squares is orthogonal projection — concrete on a problem every experimentalist faces: turning a raw sensor reading into a trustworthy measurement.
The problem: a sensor that lies, but lies predictably
You have bought a cheap temperature sensor for a greenhouse-automation project. Out of the box it reports a number, but that number is not the temperature — it is a raw voltage-derived reading $r$ that drifts from the truth in some systematic way. To make it useful you run a calibration: you place the sensor in a series of controlled environments whose true temperatures $y$ you know (from a laboratory reference thermometer) and record what the cheap sensor reports. The goal is a conversion rule that maps any future raw reading $r$ to a best-estimate true temperature.
The simplest useful model is a straight line: $$y \approx c_0 + c_1\, r,$$ where $c_1$ corrects the sensor's scale (how many degrees per unit of raw reading) and $c_0$ corrects its offset (a constant bias). If the sensor were perfect, you would find $c_0 = 0$ and $c_1 = 1$. Real sensors deviate, and calibration measures the deviation.
Here are five calibration points — raw readings $r$ against known reference temperatures $y$ (in suitable units):
| Reference $y$ | 1.2 | 1.9 | 3.2 | 3.9 | 5.3 |
|---|---|---|---|---|---|
| Raw reading $r$ | 1 | 2 | 3 | 4 | 5 |
If you plot these five points, they fall almost on a line, but not exactly — every real measurement carries noise. There is no line passing through all five points, so the system
$$\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}\begin{bmatrix} c_0 \\ c_1 \end{bmatrix} = \begin{bmatrix} 1.2 \\ 1.9 \\ 3.2 \\ 3.9 \\ 5.3 \end{bmatrix}, \qquad\text{i.e.}\qquad A\mathbf{c} = \mathbf{y},$$
has no exact solution. The data vector $\mathbf{y}$ lies off the column space of $A$. This is exactly the overdetermined situation of Chapter 13 — five equations, two unknowns — and it is the natural home of least squares.
The geometry: project the data onto the model
Chapter 19's reframing is the whole point. The design matrix $A$ has two columns: a column of ones (encoding the offset) and the column of raw readings $r = (1,2,3,4,5)$ (encoding the scale). As $\mathbf{c} = (c_0, c_1)$ ranges over all possible coefficient pairs, the prediction vector $A\mathbf{c}$ ranges over the entire column space $C(A)$ — a two-dimensional plane sitting inside $\mathbb{R}^5$. The observed data $\mathbf{y}$ is a single point in $\mathbb{R}^5$ that, because of noise, does not lie on that plane.
"Find the best-fitting line" therefore means "find the point of the plane $C(A)$ closest to $\mathbf{y}$." And that, by everything in this chapter, is the orthogonal projection of $\mathbf{y}$ onto $C(A)$. The closest-point theorem (§19.10) guarantees this projection is the unique nearest point, so the least-squares line is not one arbitrary good fit among many — it is the best fit in the precise sense of minimizing total squared error. The fitted coefficients $\hat{\mathbf{c}}$ are the coordinates of that projection in the columns of $A$, found by solving the normal equations $A^{\mathsf{T}}A\hat{\mathbf{c}} = A^{\mathsf{T}}\mathbf{y}$.
Solving it
We form the small $2\times 2$ system and solve. The Gram matrix and right-hand side work out to
$$A^{\mathsf{T}}A = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}, \qquad A^{\mathsf{T}}\mathbf{y} = \begin{bmatrix} 15.5 \\ 56.7 \end{bmatrix},$$
where the $(1,1)$ entry is the number of points, the off-diagonal $15 = \sum r_i$, and $55 = \sum r_i^2$. Solving $A^{\mathsf{T}}A\hat{\mathbf{c}} = A^{\mathsf{T}}\mathbf{y}$ gives the fitted offset and scale.
# Sensor calibration as a least-squares projection. (Chapter 19, Case Study 1)
import numpy as np
r = np.array([1., 2., 3., 4., 5.]) # raw sensor readings
y = np.array([1.2, 1.9, 3.2, 3.9, 5.3]) # known reference temperatures
A = np.column_stack([np.ones_like(r), r]) # design matrix: [1 | r]
# Solve the normal equations (the projection of y onto C(A)).
c_hat = np.linalg.solve(A.T @ A, A.T @ y)
fitted = A @ c_hat # p = projection = fitted values
resid = y - fitted # e = error = residuals
print("c_hat (offset, scale) =", np.round(c_hat, 4)) # [0.04 1.02]
print("fitted values p =", np.round(fitted, 4)) # [1.06 2.08 3.1 4.12 5.14]
print("residuals e =", np.round(resid, 4)) # [0.14 -0.18 0.1 -0.22 0.16]
print("A^T e (should be ~0) =", np.round(A.T @ resid, 6)) # [-0. -0.]
print("||e|| =", round(float(np.linalg.norm(resid)), 4)) # 0.3688
# Coefficient of determination R^2 = 1 - SSE / TSS
sse = resid @ resid
tss = (y - y.mean()) @ (y - y.mean())
print("R^2 =", round(1 - sse / tss, 4)) # 0.9871
The output reads c_hat = [0.04 1.02], fitted values [1.06 2.08 3.1 4.12 5.14], residuals [0.14 -0.18 0.1 -0.22 0.16], A^T e = [-0. -0.], ||e|| = 0.3688, and R^2 = 0.9871. The calibration line is
$$y \approx 0.04 + 1.02\, r.$$
Reading the result
Every number tells a story, and each is a sentence in the language of projection.
The coefficients. The fitted scale $c_1 = 1.02$ says the cheap sensor under-reports by about $2\%$ — each unit of raw reading corresponds to $1.02$ true degrees, so you multiply readings up slightly. The offset $c_0 = 0.04$ is nearly zero, so there is almost no constant bias. The conversion rule for any future reading $r$ is simply $\hat y = 0.04 + 1.02\,r$. That is what calibration delivers: not a fit to these five points, but a rule for the next ten thousand.
The orthogonality check. The line A^T e = [-0. -0.] is the chapter's signature, holding to machine precision. The residual vector $\mathbf{e}$ is orthogonal to both columns of $A$ — orthogonal to the column of ones (so the residuals sum to zero) and orthogonal to the column of readings (so the residuals are uncorrelated with $r$). This is not something we imposed afterward; it is the defining property of the projection, and it is why a statistician can assert "the residuals have mean zero and are uncorrelated with the predictor" without checking — it follows from the geometry. The fitted values $\mathbf{p}$ live in the plane $C(A)$; the residual $\mathbf{e}$ stands perpendicular to it in the left null space $N(A^{\mathsf{T}})$.
The fit quality. The residual norm $\lVert\mathbf{e}\rVert \approx 0.37$ is the straight-line distance from the data $\mathbf{y}$ to the model plane $C(A)$ — the smallest such distance achievable, by the closest-point theorem. The coefficient of determination $R^2 = 0.987$ says the line explains $98.7\%$ of the variation in the reference temperatures; in projection language, $R^2$ is the squared ratio $\lVert\mathbf{p}_{\text{centered}}\rVert^2 / \lVert\mathbf{y}_{\text{centered}}\rVert^2$ — how much of the (centered) data vector's squared length is captured by its shadow on the model. A high $R^2$ means the data vector points almost straight into the column space; the error component is small.
The individual residuals. Look at the residual vector $\mathbf{e} = (0.14, -0.18, 0.10, -0.22, 0.16)$ entry by entry: at each calibration point it is the vertical gap between the observed reference and the fitted line. The signs alternate without an obvious trend, which is the visual hallmark of a model that fits well — there is no systematic curve the line is missing, just measurement scatter. (If, instead, the residuals were all positive in the middle and negative at the ends, that pattern would itself be a direction in $\mathbb{R}^5$ orthogonal to $C(A)$ but structured — a hint that a quadratic term, i.e. a third column $r^2$, belongs in the model. Reading residuals is reading the part of the data the current subspace cannot reach.) The largest single residual, $-0.22$ at $r = 4$, is the noisiest calibration point; the projection does not chase it, because chasing one point would pull the whole line away from the others and increase the total squared distance — the closest-point theorem forbids it.
Why projection is the right idea, not just a recipe
It would be possible to "fit a line" by countless other rules — minimize the largest error, minimize the sum of absolute errors, eyeball it. Least squares is special because it is the projection, and projection has the unique closest-point guarantee plus the clean orthogonal decomposition $\mathbf{y} = \mathbf{p} + \mathbf{e}$ with $\lVert\mathbf{y}\rVert^2 = \lVert\mathbf{p}\rVert^2 + \lVert\mathbf{e}\rVert^2$. That decomposition is what lets us cleanly split "signal we modeled" from "noise we did not," and it is why least squares dovetails with the statistics of variance.
A caution the chapter insisted on applies directly here. We solved the normal equations because they make the projection transparent, but in production calibration code — especially with many predictors or readings spanning a huge range — you should call np.linalg.lstsq(A, y) instead, which uses the QR factorization of Chapter 20 and never forms the ill-conditioned $A^{\mathsf{T}}A$. And the full-column-rank condition matters: if you accidentally included two readings columns that were scalar multiples of each other (say raw readings in two units), $A^{\mathsf{T}}A$ would be singular and the coefficients would be meaningless even though a best-fit plane shadow still exists. Independent predictors are the price of unique coefficients.
Takeaway
Calibrating a sensor is projecting a noisy data vector onto the plane of all possible linear models. The fitted coefficients are the projection's coordinates; the residuals are the orthogonal error; the orthogonality $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$ is the normal equation; and "best fit" means "closest point," guaranteed unique by the closest-point theorem. The same projection that found the foot of a perpendicular in §19.5 just turned an unreliable $\$5$ sensor into a trustworthy instrument — and it is the identical machinery behind every regression in linear regression and every calibration curve in a laboratory.