Application: Linear Regression as Projection onto the Column Space

DataField.Dev

46 min read

> Learning paths. Math majors — read everything, especially the orthogonality derivation of the normal equations in §17.5 and the Math-Major Sidebar on uniqueness and the pseudoinverse in §17.9; the full-column-rank condition is the hinge the whole...

Prerequisites

chapter-13-column-space-and-null-space
chapter-06-subspaces-span-independence

Learning Objectives

Explain why a tall, overdetermined system Ax = b usually has NO exact solution, and recast the goal as minimizing the residual norm ‖Ax − b‖.
State and justify the geometric heart of least squares: the best fit is the orthogonal projection of b onto the column space C(A), so the residual is orthogonal to C(A).
Derive the normal equations A^T A x̂ = A^T b from the orthogonality condition, and state the full-column-rank condition under which they have a unique solution.
Fit a straight line and a polynomial to data by building the design matrix, forming the normal equations, and solving them by hand and with numpy.
Compute and interpret the coefficient of determination R^2 as the fraction of variance the fit explains.
Implement least_squares(A, b) from scratch via the normal equations and verify it against np.linalg.lstsq.

In This Chapter

17.1 Why does the most useful equation in data science have no solution?
17.2 What is an overdetermined system, and why is it the rule, not the exception?
17.3 What does it mean for the best fit to be a projection?
17.4 How do we package "fit a line" as a matrix problem? (the design matrix)
17.5 What are the normal equations, and why do they hold? (the algebra of orthogonality)
17.6 A full worked example by hand: the best line through three points
17.7 How do you fit a curve? Polynomial regression is still linear
17.8 How good is the fit? The coefficient of determination $R^2$
17.9 When are the normal equations a bad idea? (conditioning and alternatives)
17.10 Build it from scratch: least_squares(A, b)
17.11 What should you carry forward from this chapter?

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Application: Linear Regression as Projection onto the Column Space

Learning paths. Math majors — read everything, especially the orthogonality derivation of the normal equations in §17.5 and the Math-Major Sidebar on uniqueness and the pseudoinverse in §17.9; the full-column-rank condition is the hinge the whole chapter turns on. CS / Data Science — focus on the Geometric Intuition callouts, the design-matrix recipe in §17.4 and §17.7, the numpy verifications, the $R^2$ discussion in §17.8, and both case studies; this is the chapter where the four-subspaces machinery starts paying your rent. Physics / Engineering — focus on the projection picture, the calibration case study, and the Common Pitfall on ill-conditioning (when not to use the normal equations). This chapter is the application that closes Part III; it assumes the column space $C(A)$ of Chapter 13 and the span/independence ideas of Chapter 6. It needs only an intuitive notion of "perpendicular" — the full geometry of projection arrives in Chapter 19.

17.1 Why does the most useful equation in data science have no solution?

Here is a situation you will meet in every quantitative field. You collect data — hours studied versus exam score, square footage versus house price, voltage versus temperature — and you want to summarize it with a simple rule: a straight line, say, $y = c_0 + c_1 x$. You have five data points but only two unknowns, $c_0$ and $c_1$. Each data point demands that the line pass exactly through it, which is one equation; five points give five equations in two unknowns. That is an $A\mathbf{x} = \mathbf{b}$ with more equations than unknowns — and unless the points happen to lie perfectly on a single line, there is no $(c_0, c_1)$ that satisfies all five at once. The system has no solution.

This is not a defect in the data or a failure of your algebra. It is the normal state of affairs whenever you have more measurements than parameters, which is exactly the regime real data lives in. Measurements carry noise; the world is messier than any straight line. So the honest mathematical question is not "which line passes through all the points?" — none does — but "which line comes closest?" That single shift, from solving exactly to fitting as well as possible, is the entire subject of this chapter, and it turns out to be one of the most beautiful payoffs of everything Part III has built.

The branch of statistics that does this is linear regression, and you have surely seen its output: a scatter of dots with a straight line drawn through their middle. What is rarely explained — and what this chapter insists on — is that the line is not chosen by some statistical sleight of hand. It is chosen by geometry. The data vector $\mathbf{b}$ lives in a high-dimensional space, the achievable fits form a flat subspace inside that space (the column space $C(A)$ of Chapter 13), and the best fit is the point of that subspace closest to $\mathbf{b}$ — the foot of the perpendicular dropped from $\mathbf{b}$ onto the subspace. Least squares is orthogonal projection onto the column space. Hold onto that sentence; everything else is commentary on it.

Geometric Intuition — Picture the data vector $\mathbf{b}$ as a single point hanging somewhere in $\mathbb{R}^m$ (one coordinate per data point — for five data points, a point in $\mathbb{R}^5$). The set of all outputs your model can produce, as you sweep the parameters over every possible value, is the column space $C(A)$ — a flat subspace through the origin, of low dimension (a line, a plane). The model can land anywhere in $C(A)$ but nowhere outside it. Since $\mathbf{b}$ almost never lies in $C(A)$, you cannot reach it. The closest you can come is to drop straight down onto the flat — the orthogonal projection of $\mathbf{b}$ onto $C(A)$. That foot of the perpendicular is the best fit $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$; the little arrow from $\hat{\mathbf{b}}$ up to $\mathbf{b}$ is the residual, the part of the data the model cannot explain.

Let me name the two viewpoints we will hold simultaneously, because the magic of this chapter is that they are the same picture seen from two sides. In data space (the everyday view) you see a 2D scatter plot: dots, and a line wandering among them, with little vertical gaps between each dot and the line. In column space (the linear-algebra view) you see $\mathbb{R}^m$ with one point $\mathbf{b}$, one flat subspace $C(A)$, and one perpendicular dropped onto it. The vertical gaps in the first picture are the components of the residual vector in the second. Minimizing the total squared gap (the statistician's goal) is identical to minimizing the length of the residual vector (the geometer's goal), and the geometer's answer — drop a perpendicular — solves the statistician's problem in one stroke.

Why should you trust that this one idea is worth a whole chapter? Because it is, by a wide margin, the most-used algorithm built on linear algebra. An economist fitting GDP growth against interest rates, a biologist relating drug dose to response, a sports analyst predicting wins from payroll, a machine-learning engineer training the final layer of a network — all of them are solving an overdetermined $A\mathbf{x} = \mathbf{b}$ by least squares, whether or not they say so. The spreadsheet function LINEST, the lm() in R, scikit-learn's LinearRegression, the trendline in your charting tool: every one is forming a column space and projecting onto it. Learn the geometry once and you understand all of them at the root, not as black boxes but as a perpendicular dropped onto a subspace.

This chapter is the capstone of Part III for a reason. Chapters 13 through 16 taught you to see the column space, to read its dimension off a rank, to recognize when $\mathbf{b}$ is reachable and when it is not. Linear regression is the moment all of that abstraction earns its keep: it answers the most practiced computation in all of applied mathematics — fit a model to data — using nothing but the column space and the idea of "closest point." We will lead, as always, with the geometry; the algebra of the normal equations comes only after the picture is clear. By the end you will set up the design matrix for any linear model, derive the normal equations from a single orthogonality condition, fit lines and curves by hand and in numpy, judge a fit with $R^2$, and — just as importantly — know when the normal equations are the wrong tool and what to reach for instead.

17.2 What is an overdetermined system, and why is it the rule, not the exception?

An overdetermined system is a linear system $A\mathbf{x} = \mathbf{b}$ with more equations than unknowns — an $m \times n$ matrix $A$ where $m > n$, a tall matrix. Geometrically, you are asking a small number of knobs ($n$ of them) to satisfy a large number of demands ($m$ of them) simultaneously. Chapter 13 already told you what to expect: the column space $C(A)$ has dimension at most $n$, but it lives in the much larger output space $\mathbb{R}^m$. So $C(A)$ is a thin flat inside a fat space, and a generic target $\mathbf{b}$ — picked from the fat space — will not lie on the thin flat. No solution.

Contrast the three regimes from Chapter 13. A square invertible system has exactly one solution: $C(A)$ fills all of $\mathbb{R}^n$, so every $\mathbf{b}$ is reachable. A wide system ($m < n$, more unknowns than equations) typically has infinitely many solutions: too many knobs. The tall system ($m > n$) is the opposite: too few knobs to hit an arbitrary target, so typically no exact solution. Regression lives entirely in this tall regime, and the question "which line fits best?" is precisely "what do we do when $A\mathbf{x} = \mathbf{b}$ is tall and inconsistent?"

Let me make this concrete with the smallest example that shows the phenomenon. Suppose we want a line $y = c_0 + c_1 x$ through three points $(0, 1)$, $(1, 3)$, $(2, 2)$. Each point gives one equation: $$ \begin{aligned} c_0 + c_1\cdot 0 &= 1\\ c_0 + c_1\cdot 1 &= 3\\ c_0 + c_1\cdot 2 &= 2 \end{aligned} \qquad\Longleftrightarrow\qquad \underbrace{\begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}}_{A} \underbrace{\begin{bmatrix} c_0 \\ c_1 \end{bmatrix}}_{\mathbf{x}} = \underbrace{\begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix}}_{\mathbf{b}}. $$ Three equations, two unknowns: a $3 \times 2$ tall system. The first two equations alone force $c_0 = 1$ and $c_1 = 2$, giving the line $y = 1 + 2x$ — but then the third equation demands $1 + 2\cdot 2 = 5 = 2$, which is false. No line passes through all three points; the data is not collinear. This three-point system is small enough to hold in your head, and we will solve it exactly as a least-squares problem by hand in §17.6.

The Key Insight — A tall system $A\mathbf{x} = \mathbf{b}$ has an exact solution if and only if $\mathbf{b} \in C(A)$ (Chapter 13's solvability criterion, unchanged). For real data, $\mathbf{b}$ almost never lands in the low-dimensional column space, so there is no exact solution. Least squares does not pretend otherwise: it abandons the demand "$A\mathbf{x} = \mathbf{b}$ exactly" and replaces it with "make $A\mathbf{x}$ as close to $\mathbf{b}$ as possible." The genius is in measuring "close" the right way — by the length of the residual — so that the answer becomes a projection.

17.2.1 Why measure error with the squared length?

When we say "as close as possible," we must say in what sense. Define the residual vector $\mathbf{r} = \mathbf{b} - A\mathbf{x}$: it records, component by component, how far each equation is from being satisfied. Component $r_i$ is the gap between the $i$-th data value $b_i$ and the model's prediction $(A\mathbf{x})_i$. We want all the gaps small at once, so we minimize a single number built from the whole residual vector: its length, or equivalently its squared length, $$ \lVert \mathbf{r} \rVert^2 = \lVert \mathbf{b} - A\mathbf{x} \rVert^2 = \sum_{i=1}^{m} (b_i - (A\mathbf{x})_i)^2 = r_1^2 + r_2^2 + \cdots + r_m^2. $$ This is the sum of squared residuals, and minimizing it is what puts the "least squares" in least squares. Why squared, and not, say, the sum of absolute values $\sum |r_i|$? Three reasons, in increasing depth. First, the squared length is exactly the Euclidean distance $\lVert \mathbf{b} - A\mathbf{x}\rVert$ — the honest geometric distance in $\mathbb{R}^m$ — so minimizing it is finding the closest point, our geometric goal. Second, it is smooth and differentiable, so calculus and linear algebra both bite cleanly (absolute values have corners). Third — and this is the deep one — minimizing the squared length corresponds, under the assumption of independent Gaussian noise, to the most likely parameter values; this is the statistical justification, developed in any statistics course and in the cross-link below. For us, the geometric reason is enough: squared length is distance, and we want the closest point.

Common Pitfall — "Least squares minimizes the perpendicular distance from each point to the line." No — it minimizes the vertical distance (the gap in $y$, holding $x$ fixed), because the residual is $\mathbf{b} - A\mathbf{x}$, an error in the output variable only. The perpendicular-distance fit is a different procedure (total least squares / orthogonal regression), useful when both variables carry error. Ordinary least squares treats the inputs $x_i$ as exact and only the outputs $y_i$ as noisy. The "orthogonality" in this chapter is in $\mathbb{R}^m$ between the residual vector and the column space — not between points and the line in the 2D plot. Keep the two pictures separate.

17.3 What does it mean for the best fit to be a projection?

Now the heart of the chapter. We want the vector $\hat{\mathbf{x}}$ that makes $A\hat{\mathbf{x}}$ as close to $\mathbf{b}$ as possible. The achievable outputs $A\mathbf{x}$, as $\mathbf{x}$ ranges over all of $\mathbb{R}^n$, are exactly the column space $C(A)$ — that was Chapter 13's definition. So our problem is purely geometric:

Among all vectors $\mathbf{p}$ in the subspace $C(A)$, find the one closest to the fixed point $\mathbf{b}$.

You already know the answer to this from everyday geometry. The closest point on a plane to a point floating above it is the foot of the perpendicular — drop straight down. The shortest path from a point to a line, or a plane, or any flat, is always along the perpendicular. The closest point $\mathbf{p}$ is the orthogonal projection of $\mathbf{b}$ onto $C(A)$, and the vector from $\mathbf{p}$ to $\mathbf{b}$ — the residual $\mathbf{r} = \mathbf{b} - \mathbf{p}$ — is perpendicular to the entire subspace $C(A)$.

The Key Insight — The least-squares fit is the orthogonal projection of $\mathbf{b}$ onto the column space $C(A)$. Write $\hat{\mathbf{b}} = A\hat{\mathbf{x}} = \operatorname{proj}_{C(A)}\mathbf{b}$. The defining property — the one we will turn into an equation — is that the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ is orthogonal to $C(A)$. Everything else in this chapter, including the normal equations, is just this orthogonality written in symbols.

Why is the perpendicular the closest? Here is the one-line geometric proof, the same Pythagorean argument that justifies "shortest distance is perpendicular" in school geometry, now in $\mathbb{R}^m$. Let $\mathbf{p}$ be the foot of the perpendicular and let $\mathbf{q}$ be any other point in $C(A)$. The three points $\mathbf{b}$, $\mathbf{p}$, $\mathbf{q}$ form a right triangle: the leg $\mathbf{b} - \mathbf{p}$ (the residual) is perpendicular to the subspace, hence perpendicular to the leg $\mathbf{p} - \mathbf{q}$ (which lies in the subspace). By the Pythagorean theorem, $$ \lVert \mathbf{b} - \mathbf{q}\rVert^2 = \lVert \mathbf{b} - \mathbf{p}\rVert^2 + \lVert \mathbf{p} - \mathbf{q}\rVert^2 \;\ge\; \lVert \mathbf{b} - \mathbf{p}\rVert^2, $$ with equality only when $\mathbf{q} = \mathbf{p}$. So $\mathbf{p}$ is strictly closer to $\mathbf{b}$ than any other point of the subspace. The perpendicular foot is the unique minimizer. (We will sharpen "unique" with a rank condition in §17.5, and Chapter 19 develops the projection operator in full generality.)

Geometric Intuition — Imagine the column space as the floor of a room and $\mathbf{b}$ as a lightbulb hanging from the ceiling. The least-squares fit $\hat{\mathbf{b}}$ is the shadow the bulb casts straight down onto the floor — the projection. The string from the bulb to its shadow, hanging straight down, is the residual: it is vertical, i.e. perpendicular to the floor, which is why it is perpendicular to every direction lying in the floor. Any other point on the floor is farther from the bulb (walk along the floor away from the shadow and the slanted string to the bulb only gets longer). The shadow is the closest the floor can get to the bulb — and that shadow is exactly $A\hat{\mathbf{x}}$, the best the model can do.

This is the recurring theme of the whole book made vivid one more time: geometry and algebra are two views of one object. "Least squares" sounds like a numerical optimization — minimize a sum of squares. But the answer is a geometric object — a projection — and the route to computing it will be a single matrix equation. The statistician's optimization, the geometer's perpendicular, and the algebraist's equation are three faces of the same thing. By the end of this chapter you should be unable to hear "regression" without seeing a perpendicular dropped onto a subspace.

17.4 How do we package "fit a line" as a matrix problem? (the design matrix)

Before we can project, we must get the data into the form $A\mathbf{x} = \mathbf{b}$. The matrix $A$ here has a name: the design matrix (statisticians) or model matrix. Each row is one data point; each column is one feature the model uses. The unknown $\mathbf{x}$ is the vector of model parameters (the coefficients), and $\mathbf{b}$ is the vector of observed outputs. Setting up $A$ correctly is half the battle, and it is mechanical once you see the pattern.

Take our anchor: fitting a line $y = c_0 + c_1 x$ to five data points. Suppose the data is $$ (x, y) = (1, 2),\ (2, 1),\ (3, 4),\ (4, 3),\ (5, 5). $$ The model says each $y_i$ should equal $c_0 + c_1 x_i$. There are two parameters, $c_0$ and $c_1$, so $A$ has two columns. The first column multiplies $c_0$ — and since $c_0$ is multiplied by $1$ in every equation, that column is all ones. The second column multiplies $c_1$, so it holds the $x$-values. The system is $$ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} \begin{bmatrix} c_0 \\ c_1 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \\ 4 \\ 3 \\ 5 \end{bmatrix}, \qquad\text{i.e.}\qquad A\mathbf{x} = \mathbf{b}. $$ Five equations, two unknowns — overdetermined, as promised. The all-ones column is what lets the line have a nonzero intercept $c_0$; drop it and you force the line through the origin. The column space $C(A)$ here is a 2-dimensional plane sitting inside $\mathbb{R}^5$ — spanned by the all-ones vector and the $x$-vector. The data $\mathbf{b} = (2,1,4,3,5)$ is a point in $\mathbb{R}^5$ that almost certainly does not lie on that plane (the points are not collinear), so no line fits exactly, and least squares will project $\mathbf{b}$ onto the plane.

The Key Insight — The columns of the design matrix span the space of all fits the model can express. For a line, that space is 2-dimensional (intercept direction + slope direction); for a quadratic, 3-dimensional; and so on. Fitting is choosing the point of this span closest to the data. Adding a feature adds a column, which enlarges $C(A)$ and can only reduce the residual — a fact with a sharp edge we return to in §17.8 (more features always fit the training data better, which is not the same as predicting better).

Check Your Understanding — You want to fit the model $y = c_0 + c_1 x + c_2 z$ (two predictors $x$ and $z$, plus an intercept) to $8$ data points. What are the dimensions of the design matrix $A$, and what is the largest the dimension of $C(A)$ could be?

Answer

$A$ is $8 \times 3$ — eight rows (one per data point), three columns (the all-ones intercept column, the $x$ column, the $z$ column). The column space $C(A)$ lives in $\mathbb{R}^8$ and has dimension at most $3$ (it cannot exceed the number of columns). It equals $3$ exactly when the three columns are linearly independent — i.e. when the predictors are not collinear and not constant. This is the full column rank condition that §17.5 shows is exactly what makes the fit unique.

17.5 What are the normal equations, and why do they hold? (the algebra of orthogonality)

Now we convert the geometry — the residual is orthogonal to $C(A)$ — into an equation we can solve. This is the central derivation of the chapter, and it is short because the geometry did the work.

The residual is $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$. "Orthogonal to the column space $C(A)$" means orthogonal to every vector in $C(A)$, and since $C(A)$ is spanned by the columns of $A$, it is enough for $\mathbf{r}$ to be orthogonal to each column of $A$. Orthogonality is a zero dot product, so for every column $\mathbf{a}_j$ we require $\mathbf{a}_j \cdot \mathbf{r} = 0$. Stacking these $n$ dot products into one statement: each $\mathbf{a}_j \cdot \mathbf{r}$ is exactly the $j$-th entry of $A^{\mathsf{T}}\mathbf{r}$ (the rows of $A^{\mathsf{T}}$ are the columns of $A$). So "orthogonal to all columns" is the single matrix equation $$ A^{\mathsf{T}}\mathbf{r} = \mathbf{0} \qquad\Longleftrightarrow\qquad A^{\mathsf{T}}(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}. $$ Distribute and move one term across, and you have the famous result.

The normal equations. The least-squares solution $\hat{\mathbf{x}}$ — the coefficients of the best fit — satisfies $$ > A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}. > $$ These are $n$ equations in the $n$ unknowns $\hat{\mathbf{x}}$ (a square system, even though the original $A\mathbf{x} = \mathbf{b}$ was tall). Condition: if $A$ has full column rank (its columns are linearly independent), then $A^{\mathsf{T}}A$ is invertible and the solution is unique: $$ > \hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}. > $$

Look at what just happened. The tall, unsolvable system $A\mathbf{x} = \mathbf{b}$ ($m$ equations) has been replaced by the square, solvable system $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ ($n$ equations). We multiplied both sides by $A^{\mathsf{T}}$ — but this is not the same as "solving" the original system (you cannot just multiply an inconsistent system by a matrix and conjure a solution). What multiplying by $A^{\mathsf{T}}$ does is geometric: $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$ is the precise algebraic encoding of "the residual lies in the left null space $N(A^{\mathsf{T}})$" — orthogonal to every column, hence orthogonal to $C(A)$. The normal equations are the four-fundamental-subspaces picture (Chapter 14) doing exactly what it was built for: $C(A)$ and $N(A^{\mathsf{T}})$ are orthogonal complements, and least squares splits $\mathbf{b}$ into its $C(A)$ part (the fit) and its $N(A^{\mathsf{T}})$ part (the residual).

Why $A^{\mathsf{T}}A$ is invertible exactly when $A$ has full column rank

This is the load-bearing condition, so it deserves a proof in the book's four-part shape (§10 of the style guide).

Theorem (invertibility of the Gram matrix). For a real $m \times n$ matrix $A$, the $n \times n$ matrix $A^{\mathsf{T}}A$ is invertible if and only if $A$ has linearly independent columns (full column rank, $\operatorname{rank}(A) = n$).

Why we care. This is the exact condition under which the normal equations have a unique solution, hence under which "the best-fit line" is well defined. When it fails — collinear features — there is no unique answer, and a naive solver will choke or lie.

Key idea. $A^{\mathsf{T}}A$ and $A$ have the same null space. So $A^{\mathsf{T}}A$ kills only $\mathbf{0}$ exactly when $A$ does, which is exactly independence of columns.

Proof. It suffices to show $N(A^{\mathsf{T}}A) = N(A)$; an $n \times n$ matrix is invertible iff its null space is $\{\mathbf{0}\}$ (Chapter 9), so the two matrices are invertible together, and $N(A) = \{\mathbf{0}\}$ is precisely "columns independent" (Chapter 13). The easy inclusion: if $A\mathbf{x} = \mathbf{0}$ then $A^{\mathsf{T}}A\mathbf{x} = A^{\mathsf{T}}\mathbf{0} = \mathbf{0}$, so $N(A) \subseteq N(A^{\mathsf{T}}A)$. The reverse: suppose $A^{\mathsf{T}}A\mathbf{x} = \mathbf{0}$. Take the dot product of both sides with $\mathbf{x}$: $$ > 0 = \mathbf{x}^{\mathsf{T}}A^{\mathsf{T}}A\mathbf{x} = (A\mathbf{x})^{\mathsf{T}}(A\mathbf{x}) = \lVert A\mathbf{x}\rVert^2. > $$ A squared length is zero only for the zero vector, so $A\mathbf{x} = \mathbf{0}$, giving $N(A^{\mathsf{T}}A) \subseteq N(A)$. The two null spaces are equal. $\blacksquare$

What this means. Geometrically: $A^{\mathsf{T}}A$ is invertible exactly when the columns of $A$ are genuinely independent directions, so that $C(A)$ really has dimension $n$ and "project onto it" pins down a unique set of coordinates $\hat{\mathbf{x}}$. If two columns are parallel (two redundant features), $A^{\mathsf{T}}A$ is singular, and infinitely many coefficient vectors $\hat{\mathbf{x}}$ produce the same projected point — the fit $\hat{\mathbf{b}}$ is still unique (the projection always exists), but its coordinates are not.

The matrix $A^{\mathsf{T}}A$ has a name worth knowing: it is the Gram matrix of the columns of $A$. Its entry $(i,j)$ is the dot product $\mathbf{a}_i \cdot \mathbf{a}_j$ of column $i$ with column $j$. It is always symmetric ($(A^{\mathsf{T}}A)^{\mathsf{T}} = A^{\mathsf{T}}A$) and, when $A$ has full column rank, positive definite — a property that becomes central in Chapter 28. For now, the symmetry alone is a gift: it makes $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ a small, well-behaved square system you can solve with the tools of Chapter 4 (Gaussian elimination) or Chapter 9 (the inverse), or — in the from-scratch toolkit — the LU decomposition of Chapter 10.

Warning

— The neat formula $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ is for deriving and understanding, not for computing. Forming $A^{\mathsf{T}}A$ and inverting it is numerically dangerous: it squares the condition number of $A$ (we demonstrate this in §17.9), so a mildly ill-conditioned $A$ becomes a severely ill-conditioned $A^{\mathsf{T}}A$, and floating-point error explodes. Real software never inverts $A^{\mathsf{T}}A$; it solves the least-squares problem through the QR factorization of Chapter 20 or the SVD of Chapter 30, which work on $A$ directly and avoid the squaring. Use the normal equations to think; use QR/SVD to compute at scale. We flag exactly when this bites in the Common Pitfall of §17.9.

17.6 A full worked example by hand: the best line through three points

Let's earn the formula with pencil on the small system of §17.2 — fitting a line to $(0,1)$, $(1,3)$, $(2,2)$ — so that every number is checkable. The design matrix and data are $$ A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix}. $$

Step 1 — Form $A^{\mathsf{T}}A$. Multiply the transpose by $A$: $$ A^{\mathsf{T}}A = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{bmatrix}\begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}. $$ Read the entries as dot products: top-left is (column 1)·(column 1) $= 1+1+1 = 3$ (the count of data points); top-right is (column 1)·(column 2) $= 0+1+2 = 3$ (the sum of the $x$'s); bottom-right is (column 2)·(column 2) $= 0+1+4 = 5$ (the sum of $x^2$). Symmetric, as promised.

Step 2 — Form $A^{\mathsf{T}}\mathbf{b}$. $$ A^{\mathsf{T}}\mathbf{b} = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{bmatrix}\begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} = \begin{bmatrix} 1+3+2 \\ 0+3+4 \end{bmatrix} = \begin{bmatrix} 6 \\ 7 \end{bmatrix}. $$ The top entry is the sum of the $y$'s; the bottom is the sum of $x_i y_i$.

Step 3 — Solve the $2\times 2$ normal equations $\begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}\hat{\mathbf{x}} = \begin{bmatrix} 6 \\ 7 \end{bmatrix}$. The determinant is $3\cdot 5 - 3\cdot 3 = 6 \neq 0$ (so $A$ has full column rank — the two columns are independent). By the $2\times 2$ inverse rule of Chapter 9, $$ \hat{\mathbf{x}} = \frac{1}{6}\begin{bmatrix} 5 & -3 \\ -3 & 3 \end{bmatrix}\begin{bmatrix} 6 \\ 7 \end{bmatrix} = \frac{1}{6}\begin{bmatrix} 30 - 21 \\ -18 + 21 \end{bmatrix} = \frac{1}{6}\begin{bmatrix} 9 \\ 3 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 0.5 \end{bmatrix}. $$ So the best-fit line is $\hat{y} = 1.5 + 0.5\,x$: intercept $1.5$, slope $0.5$.

Step 4 — Find the fit and the residual, and check orthogonality. The fitted values are $$ \hat{\mathbf{b}} = A\hat{\mathbf{x}} = \begin{bmatrix} 1.5 + 0 \\ 1.5 + 0.5 \\ 1.5 + 1 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 2.0 \\ 2.5 \end{bmatrix}, \qquad \mathbf{r} = \mathbf{b} - \hat{\mathbf{b}} = \begin{bmatrix} 1 - 1.5 \\ 3 - 2 \\ 2 - 2.5 \end{bmatrix} = \begin{bmatrix} -0.5 \\ 1.0 \\ -0.5 \end{bmatrix}. $$ Now the moment of truth — is the residual orthogonal to the column space? Dot it with each column of $A$. With the all-ones column: $(-0.5) + 1.0 + (-0.5) = 0$. With the $x$-column $(0,1,2)$: $0\cdot(-0.5) + 1\cdot 1.0 + 2\cdot(-0.5) = 1 - 1 = 0$. Both zero, exactly as the geometry demanded: $\mathbf{r} \perp C(A)$. The fit is the projection; we did not assume it, we derived coefficients and the orthogonality fell out. The squared residual length is $\lVert\mathbf{r}\rVert^2 = 0.25 + 1 + 0.25 = 1.5$ — the smallest achievable by any line.

Check Your Understanding — In Step 4 we verified $\mathbf{r}\perp$ both columns of $A$. Why does checking orthogonality against the columns guarantee $\mathbf{r}$ is orthogonal to the whole column space $C(A)$, not just those two vectors?

Answer

Every vector in $C(A)$ is a linear combination of the columns, $\mathbf{v} = \alpha\mathbf{a}_1 + \beta\mathbf{a}_2$. The dot product is linear, so $\mathbf{r}\cdot\mathbf{v} = \alpha(\mathbf{r}\cdot\mathbf{a}_1) + \beta(\mathbf{r}\cdot\mathbf{a}_2) = \alpha\cdot 0 + \beta\cdot 0 = 0$. Orthogonality to a spanning set propagates to every linear combination, hence to the entire subspace. This is exactly why the normal equations need only $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$ (orthogonality to the columns) rather than some infinite list of conditions.

Now the anchor itself — the five-point line of §17.4 — solved the same way, this time confirming by numpy. The design matrix and data give (you can grind these by hand exactly as above) $$ A^{\mathsf{T}}A = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}, \qquad A^{\mathsf{T}}\mathbf{b} = \begin{bmatrix} 15 \\ 53 \end{bmatrix}, $$ whose solution is $\hat{\mathbf{x}} = (0.6,\, 0.8)$: the best-fit line is $\hat{y} = 0.6 + 0.8\,x$.

# Least squares by the normal equations, then checked against numpy's solver.
import numpy as np
x = np.array([1., 2., 3., 4., 5.])
y = np.array([2., 1., 4., 3., 5.])
A = np.column_stack([np.ones_like(x), x])      # design matrix: [1, x]
b = y

AtA, Atb = A.T @ A, A.T @ b                      # form the normal equations
x_hat = np.linalg.solve(AtA, Atb)                # solve the 2x2 system
print("normal equations ->", x_hat)             # [0.6 0.8]

x_lstsq, *_ = np.linalg.lstsq(A, b, rcond=None)  # library least squares (QR-based)
print("np.linalg.lstsq   ->", x_lstsq)           # [0.6 0.8]   (matches!)

r = b - A @ x_hat                                # residual
print("A^T r (should be ~0):", A.T @ r)          # [-0.  0.]
print("||r||^2 (min SSE):", float(r @ r))        # 3.6

The two methods agree to machine precision — $\hat{\mathbf{x}} = (0.6, 0.8)$ — and the residual is orthogonal to both columns (A.T @ r prints essentially $\mathbf{0}$). The minimum sum of squared residuals is $3.6$; no line on Earth fits these five points with a smaller total squared vertical gap. Note the two indexing worlds from the style guide: the math writes the intercept as the first coordinate $\hat x_1 = c_0$, while numpy stores it at x_hat[0].

17.6.1 Seeing the fit: the scatter plot and the fitted line

A picture closes the loop between the column-space view and the everyday view. The code below produces Figure 17.1 — the anchor scatter with its least-squares line and the vertical residual segments whose squared lengths we just minimized.

# Figure 17.1 — scatter, least-squares line, and the residual segments.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.]); y = np.array([2., 1., 4., 3., 5.])
A = np.column_stack([np.ones_like(x), x])
c0, c1 = np.linalg.solve(A.T @ A, A.T @ y)       # (0.6, 0.8)
xs = np.linspace(0.5, 5.5, 100)
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(x, y, color="C0", zorder=3, label="data $(x_i, y_i)$")
ax.plot(xs, c0 + c1 * xs, "C1-", lw=2, label=f"fit $\\hat y = {c0:.1f} + {c1:.1f}x$")
for xi, yi in zip(x, y):                          # vertical residual segments
    ax.plot([xi, xi], [yi, c0 + c1 * xi], "C3--", lw=1)
ax.set_xlabel("x"); ax.set_ylabel("y")
ax.legend(loc="upper left"); ax.grid(True, alpha=0.3)
ax.set_title("Least-squares line: minimizing the squared vertical gaps")
plt.show()

Figure 17.1 — A scatter of five points $(1,2), (2,1), (3,4), (4,3), (5,5)$ with the least-squares line $\hat{y} = 0.6 + 0.8x$ drawn through their middle. Five short dashed vertical segments connect each data point to the line; these are the residuals $r_i$. Least squares chooses the line that makes the sum of the squares of these vertical segment lengths as small as possible — here, $\sum r_i^2 = 3.6$. Alt-text: a 2D scatter plot of five dots rising left-to-right, an orange straight line fitted through them, and red dashed vertical ticks marking the gap from each dot to the line.

The dashed segments in Figure 17.1 are the components of the residual vector $\mathbf{r}$ from the column-space picture — one segment per data point, one component per coordinate of $\mathbb{R}^5$. The everyday "vertical gaps" and the abstract "residual orthogonal to $C(A)$" are the very same object, viewed in 2D data space versus 5D column space. That equivalence is the whole chapter in one figure.

Real-World Application — Calibration in experimental science. Every measuring instrument — a thermometer, a pressure gauge, a mass spectrometer — outputs a raw signal (a voltage, a count) that must be converted to a physical quantity. You record the instrument's signal at several known reference values, then fit a line (or low-degree polynomial) by least squares: the slope and intercept are the calibration constants. The residuals tell you the instrument's noise floor. We work a full thermistor calibration in Case Study 2, where a six-point linear fit nails the sensor's response to $R^2 = 0.9999$.

17.6.2 The projection matrix: seeing the column-space picture in numbers

We have been saying "the fit $\hat{\mathbf{b}}$ is the projection of $\mathbf{b}$ onto $C(A)$." Let's make that literal with a single matrix that performs the projection. Chaining the steps — solve for $\hat{\mathbf{x}}$, then form $A\hat{\mathbf{x}}$ — gives $$ \hat{\mathbf{b}} = A\hat{\mathbf{x}} = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\,\mathbf{b} = P\mathbf{b}, \qquad\text{where}\qquad P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}. $$ The matrix $P$ is the projection matrix onto the column space $C(A)$: feed it any data vector $\mathbf{b}$ and it returns the closest point in $C(A)$. It is built once from $A$ alone and works for every $\mathbf{b}$. (Chapter 19 develops $P$ in full; we meet it here because it makes the geometry tangible.) For the anchor's $5\times 2$ matrix, $P$ is a $5\times 5$ matrix with two properties that encode the word "projection," and numpy confirms both.

# The projection matrix P onto C(A): symmetric, idempotent, trace = rank.
import numpy as np
x = np.array([1., 2., 3., 4., 5.]); y = np.array([2., 1., 4., 3., 5.])
A = np.column_stack([np.ones_like(x), x])
P = A @ np.linalg.inv(A.T @ A) @ A.T               # 5x5 projection matrix
print("symmetric? ", np.allclose(P, P.T))          # True
print("idempotent? P@P == P:", np.allclose(P @ P, P))  # True
print("trace(P) =", round(np.trace(P), 6))         # 2.0  == rank(A)
print("P b =", P @ y)                              # [1.4 2.2 3.  3.8 4.6] == A x_hat

Both checks pass, and each one means something. Symmetry ($P = P^{\mathsf{T}}$) is the algebraic signature of an orthogonal projection — the perpendicular kind, as opposed to a slanted projection. Idempotence ($P^2 = P$) says projecting twice does nothing new: once you are on the floor, the shadow of the shadow is where you already stand. And the trace equals the rank ($\operatorname{tr}(P) = 2 = \dim C(A)$) because a projection matrix has eigenvalue $1$ along the subspace it projects onto and $0$ perpendicular to it (a fact we prove in Chapter 23), so its trace counts the dimension of $C(A)$. Three abstract phrases — orthogonal, idempotent, dimension — and numpy displays all three in one $5\times 5$ matrix.

The same picture yields a Pythagorean accounting that is worth seeing once. The data splits cleanly into its in-subspace part and its orthogonal part, $\mathbf{b} = \hat{\mathbf{b}} + \mathbf{r}$, with $\hat{\mathbf{b}} \perp \mathbf{r}$, so the squared lengths add: $$ \lVert\mathbf{b}\rVert^2 = \lVert\hat{\mathbf{b}}\rVert^2 + \lVert\mathbf{r}\rVert^2. $$

# Pythagoras in R^5: data^2 = fit^2 + residual^2 (the orthogonal split).
b = y; b_hat = P @ b; r = b - b_hat
print("||b||^2          =", float(b @ b))          # 55.0
print("||b_hat||^2 + ||r||^2 =", float(b_hat @ b_hat + r @ r))  # 55.0
print("b_hat . r =", round(float(b_hat @ r), 12))  # ~0  (orthogonal)

The identity $55 = 51.4 + 3.6$ holds to machine precision, and $\hat{\mathbf{b}}\cdot\mathbf{r} \approx 0$ confirms the right angle. This is the same right triangle as the $R^2$ accounting of §17.8, and it is the engine behind why "least squares" and "projection" are not two facts but one. Whenever you see a sum-of-squares decomposition in statistics — total $=$ explained $+$ residual — you are looking at the Pythagorean theorem in a column space, no more and no less.

Check Your Understanding — A projection matrix $P$ satisfies $P^2 = P$. What does this idempotence say about applying the fit twice — i.e., projecting $\hat{\mathbf{b}}$ onto $C(A)$ again? And what would the residual of that second projection be?

Answer

Projecting twice changes nothing: $P\hat{\mathbf{b}} = P(P\mathbf{b}) = P^2\mathbf{b} = P\mathbf{b} = \hat{\mathbf{b}}$. The fit is already in $C(A)$, so it is its own closest point — its projection is itself. The residual of the second projection is therefore $\hat{\mathbf{b}} - \hat{\mathbf{b}} = \mathbf{0}$. Geometrically: a point already on the floor casts itself as its own shadow. This is exactly why idempotence is the defining algebraic property of a projection.

17.6.3 More than one input: fitting a plane (multiple regression)

So far one input $x$ predicts one output $y$, and the fit is a line in the 2D scatter plot. But the column-space machinery never cared that there was only one predictor — it only cared about the columns of $A$. Add a second predictor $z$ and the model becomes $y = c_0 + c_1 x + c_2 z$, a plane in $(x, z, y)$-space. The design matrix simply grows a column: $$ A = \begin{bmatrix} 1 & x_1 & z_1 \\ 1 & x_2 & z_2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & z_m \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} c_0 \\ c_1 \\ c_2 \end{bmatrix}, $$ and the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ are now $3\times 3$. With $k$ predictors plus an intercept, $A$ has $k+1$ columns, $C(A)$ is a $(k+1)$-dimensional flat inside $\mathbb{R}^m$, and the fit is the projection of $\mathbf{b}$ onto it — a hyperplane in feature space. This is multiple regression, the everyday tool of econometrics, epidemiology, and machine learning, and it is the identical projection. Each coefficient $c_j$ now reads as the effect of feature $j$ holding the others fixed — the marginal value of one more bedroom, say, with square footage and age held constant. Case Study 1 fits exactly such a model to house prices with three features and recovers interpretable per-feature dollar effects.

The Key Insight — There is really only one linear-regression algorithm, and you now own it. Line, polynomial, plane, hyperplane, multi-feature model — they differ only in which columns you put in the design matrix $A$. Build the columns, form $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$, solve. The geometry is always "project $\mathbf{b}$ onto $C(A)$." This is the leverage Chapter 13 promised: master one subspace idea, and an entire industry's worth of curve-fitting reduces to choosing columns.

17.6.4 A second worked example: estimating an economic trend by hand

The three-point fit of §17.6 was deliberately tiny, so let me work one more case fully by pencil — this time in economics, where least squares earns its keep estimating trends. An analyst at a small firm has four years of annual revenue (in millions of dollars), recorded at years $t = 0, 1, 2, 3$: the data is $\mathbf{b} = (10,\, 14,\, 15,\, 21)$. Revenue is climbing, but not on a perfectly straight track, and the analyst wants the best-fit trend line $\hat{y} = c_0 + c_1 t$ — the level $c_0$ at year zero and, more importantly, the average growth $c_1$ in millions per year. This is the same line-fitting model as before; only the field, and the story the slope tells, have changed. The design matrix pairs an all-ones intercept column with the year column: $$ A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 10 \\ 14 \\ 15 \\ 21 \end{bmatrix}. $$ Four equations, two unknowns — tall and (because the points are not collinear) inconsistent, so we project. Form the two sides of the normal equations exactly as before. The Gram matrix reads off as dot products of the columns: its top-left entry is the count of points ($4$), the off-diagonal is the sum of the years ($0+1+2+3 = 6$), and the bottom-right is the sum of squared years ($0+1+4+9 = 14$). The right-hand side $A^{\mathsf{T}}\mathbf{b}$ holds the sum of the revenues and the sum of $t_i\, b_i$: $$ A^{\mathsf{T}}A = \begin{bmatrix} 4 & 6 \\ 6 & 14 \end{bmatrix}, \qquad A^{\mathsf{T}}\mathbf{b} = \begin{bmatrix} 10+14+15+21 \\ 0\cdot10 + 1\cdot14 + 2\cdot15 + 3\cdot21 \end{bmatrix} = \begin{bmatrix} 60 \\ 107 \end{bmatrix}. $$ The determinant of $A^{\mathsf{T}}A$ is $4\cdot 14 - 6\cdot 6 = 20 \neq 0$, so the two columns are linearly independent — $A$ has full column rank, and §17.5 guarantees a unique least-squares solution. Apply the $2\times 2$ inverse rule of Chapter 9: $$ \hat{\mathbf{x}} = \frac{1}{20}\begin{bmatrix} 14 & -6 \\ -6 & 4 \end{bmatrix}\begin{bmatrix} 60 \\ 107 \end{bmatrix} = \frac{1}{20}\begin{bmatrix} 840 - 642 \\ -360 + 428 \end{bmatrix} = \frac{1}{20}\begin{bmatrix} 198 \\ 68 \end{bmatrix} = \begin{bmatrix} 9.9 \\ 3.4 \end{bmatrix}. $$ So the trend line is $\hat{y} = 9.9 + 3.4\,t$: the model puts year-zero revenue at \$9.9M and, the headline number, estimates the firm growing at an average of \$3.4M per year. That slope is the whole point of the exercise — a single coefficient extracted from four noisy observations by dropping a perpendicular onto a plane in $\mathbb{R}^4$.

Now confirm the projection is genuine by checking orthogonality, the defining property of §17.3. The fitted values and residual are $$ \hat{\mathbf{b}} = A\hat{\mathbf{x}} = \begin{bmatrix} 9.9 \\ 13.3 \\ 16.7 \\ 20.1 \end{bmatrix}, \qquad \mathbf{r} = \mathbf{b} - \hat{\mathbf{b}} = \begin{bmatrix} 0.1 \\ 0.7 \\ -1.7 \\ 0.9 \end{bmatrix}. $$ Dot the residual with the all-ones column: $0.1 + 0.7 - 1.7 + 0.9 = 0$. Dot it with the year column $(0,1,2,3)$: $0(0.1) + 1(0.7) + 2(-1.7) + 3(0.9) = 0.7 - 3.4 + 2.7 = 0$. Both zero — the residual is orthogonal to $C(A)$, so $\hat{\mathbf{b}}$ really is the foot of the perpendicular from $\mathbf{b}$ onto the column space, the closest a straight trend can come to this revenue history. The squared residual length is $\lVert\mathbf{r}\rVert^2 = 0.01 + 0.49 + 2.89 + 0.81 = 4.2$, and numpy agrees with every figure:

# Economic trend estimation by the normal equations, checked against numpy.
import numpy as np
t = np.array([0., 1., 2., 3.])                   # years
b = np.array([10., 14., 15., 21.])               # revenue ($M)
A = np.column_stack([np.ones_like(t), t])        # design matrix [1, t]
x_hat = np.linalg.solve(A.T @ A, A.T @ b)         # solve normal equations
print("trend (level, growth) ->", x_hat)         # [9.9 3.4]
x_lstsq, *_ = np.linalg.lstsq(A, b, rcond=None)
print("np.linalg.lstsq       ->", x_lstsq)        # [9.9 3.4]  (matches)
r = b - A @ x_hat
print("A^T r (should be ~0):", A.T @ r)           # [ 0. -0.]
print("R^2 =", 1 - (r @ r) / np.sum((b - b.mean())**2))  # 0.93226

The trend explains $R^2 = 0.932$ of the year-to-year variation in revenue (mean revenue \$15M, total spread $\mathrm{SS}{\text{tot}} = 62$, residual $\mathrm{SS} = 4.2$, so $R^2 = 1 - 4.2/62$) — a strong straight-line trend with a little wiggle the line cannot capture. Notice we changed *nothing* about the method between the study-hours line, the calibration curve, and this revenue trend: same design matrix recipe, same normal equations, same projection onto $C(A)$. Only the columns and the interpretation of the slope differ. That sameness is the entire lesson of the chapter, now demonstrated across three different fields.}

Real-World Application — Trend estimation in economics. Fitting a line to a time series — revenue, GDP, unemployment, prices against the year or quarter — is the most basic tool of empirical economics, and its slope is the estimated growth rate (per year, per quarter). Economists rarely stop at one predictor: they regress an outcome on several variables at once (the multiple regression of §17.6.3), reading each coefficient as that factor's effect with the others held fixed, exactly as Chapter 13's column space promised. The same projection that drew our four-point trend underlies demand curves, Phillips curves, and the regressions reported in every central-bank forecast — see linear regression in statistics for the inference layer (is a slope of $3.4$ significantly above zero?) and the economics of measurement for what these aggregates mean.

Check Your Understanding — In the revenue fit, the residual at year $t = 2$ was the largest in magnitude ($r_3 = -1.7$). Does a single large residual mean we made an arithmetic error, and what does that component being negative tell you about year 2?

Answer

No error — a large residual is perfectly consistent with least squares; the method minimizes the sum of squared residuals, not each one, so individual gaps can be sizeable as long as no other line does better overall. We verified the fit is correct by the orthogonality check ($A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$), which is the true certificate of optimality. The negative sign, $r_3 = b_3 - \hat b_3 = 15 - 16.7 = -1.7$, means year 2's actual revenue (\$15M) fell *below* the trend's prediction (\$16.7M) — that year underperformed the straight-line growth. Reading the signs of the residuals is how analysts spot which years beat or missed the trend.

17.7 How do you fit a curve? Polynomial regression is still linear

A surprise that delights every student: fitting a parabola, or any polynomial, to data is still a linear least-squares problem. "But a parabola is not a straight line!" True — yet the word "linear" in linear regression refers to linearity in the coefficients, not in the variable $x$. The model $y = c_0 + c_1 x + c_2 x^2$ is a linear combination of the known functions $1$, $x$, $x^2$ with unknown weights $c_0, c_1, c_2$. Linear combinations of fixed columns are exactly what the column space is made of. So we change only the design matrix; the projection machinery is identical.

To fit a quadratic to data points $(x_i, y_i)$, build the design matrix with three columns — the all-ones column, the $x$ column, and the $x^2$ column: $$ A = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & x_m^2 \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} c_0 \\ c_1 \\ c_2 \end{bmatrix}. $$ This $A$ is a Vandermonde matrix (named for its powers-of-$x$ structure). Its column space is the 3-dimensional space of all quadratics evaluated at the data points, and least squares projects $\mathbf{b}$ onto it — same normal equations $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$, now $3\times 3$. A degree-$d$ polynomial uses $d+1$ columns ($1, x, x^2, \dots, x^d$). Watch it on five points that bend:

# Polynomial regression: fit y = c0 + c1 x + c2 x^2 by the SAME normal equations.
import numpy as np
x = np.array([-2., -1., 0., 1., 2.])
y = np.array([4.5, 1.0, 0.5, 1.5, 5.0])          # roughly a parabola
A = np.column_stack([np.ones_like(x), x, x**2])  # Vandermonde columns: 1, x, x^2
coeffs = np.linalg.solve(A.T @ A, A.T @ y)        # normal equations, now 3x3
print("normal equations ->", coeffs)             # [0.285714 0.15  1.107143]
print("np.polyfit deg 2 ->", np.polyfit(x, y, 2)[::-1])  # same, reordered
r = y - A @ coeffs
print("R^2 =", 1 - (r @ r) / np.sum((y - y.mean())**2))  # 0.9935

The fitted parabola is $\hat{y} = 0.2857 + 0.15\,x + 1.1071\,x^2$, explaining $99.3\%$ of the variation (the $R^2$ of the next section). np.polyfit(x, y, 2) returns the same coefficients (it reports highest power first, hence the [::-1] to match our order). Nothing about the method changed — only the columns. This is the leverage the whole book promised: learn the projection once, and line-fitting, curve-fitting, plane-fitting, and multi-feature regression are all the same act.

Common Pitfall — "A higher-degree polynomial fits better, so use the highest degree you can." A degree-$d$ polynomial through $m$ points has $d+1$ coefficients; push $d+1 = m$ and the curve passes through every point exactly ($R^2 = 1$, zero residual), because the column space has grown to fill all of $\mathbb{R}^m$. That is not a triumph — it is overfitting: the curve has memorized the noise and wiggles wildly between points, predicting new data terribly. The Vandermonde matrix also becomes badly ill-conditioned as $d$ grows, so $A^{\mathsf{T}}A$ is numerically hopeless. More columns always reduce the training residual; they do not improve prediction. Choosing the right model size is a core theme of statistics and machine learning (the bias–variance tradeoff), beyond this chapter but flagged for you here.

17.8 How good is the fit? The coefficient of determination $R^2$

We have a best line, but is it any good? A perfect fit has residual $\mathbf{r} = \mathbf{0}$; a useless fit has a residual as big as the data's own spread. The standard one-number summary is the coefficient of determination, written $R^2$, and it has a clean linear-algebra reading in terms of the lengths we already computed.

Start from the total spread of the data around its mean $\bar{y}$. Define three sums of squares: the total $\mathrm{SS}_{\text{tot}} = \sum_i (y_i - \bar{y})^2$ (how much $y$ varies overall), the residual $\mathrm{SS}_{\text{res}} = \sum_i (y_i - \hat{y}_i)^2 = \lVert\mathbf{r}\rVert^2$ (what the fit leaves unexplained), and the explained $\mathrm{SS}_{\text{reg}} = \sum_i (\hat{y}_i - \bar{y})^2$ (what the fit captures). Then $$ R^2 = 1 - \frac{\mathrm{SS}_{\text{res}}}{\mathrm{SS}_{\text{tot}}} = \frac{\mathrm{SS}_{\text{reg}}}{\mathrm{SS}_{\text{tot}}}. $$ $R^2 = 1$ means zero residual (perfect fit); $R^2 = 0$ means the fit is no better than just predicting the mean for everyone. For the anchor line, $\mathrm{SS}_{\text{res}} = 3.6$ and $\mathrm{SS}_{\text{tot}} = 10$, so $R^2 = 1 - 3.6/10 = 0.64$: the line explains $64\%$ of the variation in the scores.

# R^2 from the residual we already have — three sums of squares.
import numpy as np
x = np.array([1., 2., 3., 4., 5.]); y = np.array([2., 1., 4., 3., 5.])
A = np.column_stack([np.ones_like(x), x])
y_hat = A @ np.linalg.solve(A.T @ A, A.T @ y)     # fitted values
ss_res = np.sum((y - y_hat)**2)                   # 3.6
ss_tot = np.sum((y - y.mean())**2)                # 10.0
print("R^2 =", 1 - ss_res / ss_tot)              # 0.64

There is a satisfying geometric reason $R^2$ never exceeds $1$ and never goes below $0$ for a model that includes an intercept: it is the Pythagorean theorem in the column space again. Once the all-ones column is in $A$, the mean $\bar{y}$ is itself the projection of $\mathbf{b}$ onto that single column, and the three sums of squares are the squared sides of a right triangle, $\mathrm{SS}_{\text{tot}} = \mathrm{SS}_{\text{reg}} + \mathrm{SS}_{\text{res}}$ — the same $\lVert\mathbf{b}-\bar{\mathbf{y}}\rVert^2 = \lVert\hat{\mathbf{b}}-\bar{\mathbf{y}}\rVert^2 + \lVert\mathbf{r}\rVert^2$ split we proved in §17.3. So $R^2$ is the squared cosine of the angle between the centered data and its fit. For the simple line, it even equals the square of the familiar correlation coefficient $r$ (here $r = 0.8$, $r^2 = 0.64$) — the statistics-class formula and the linear-algebra projection are, once more, the same object.

Real-World Application — Predicting house prices (data science). Multiple regression on features like square footage, number of bedrooms, and age fits a hyperplane through a cloud of houses in feature space; the coefficients tell you the marginal dollar value of each feature, and $R^2$ tells you how much of the price variation those features explain. Case Study 1 builds exactly this model on eight houses and reaches $R^2 \approx 0.99$. The same least-squares-as-projection idea scales to thousands of features in production pricing and demand models — see linear regression for the data-science framing and regression in statistics for the inferential side (confidence intervals, hypothesis tests on the coefficients).

17.9 When are the normal equations a bad idea? (conditioning and alternatives)

We end the theory honestly. The normal equations are the clearest route to least squares — they make the geometry into one square system — but they are often the worst route numerically, and a professional needs to know exactly when to abandon them. The villain is the conditioning of $A^{\mathsf{T}}A$.

Recall (and we develop fully in Chapter 38) that the condition number $\kappa(A)$ measures how much a matrix amplifies relative error: solving a system with condition number $\kappa$ can lose about $\log_{10}\kappa$ digits of accuracy. The fatal fact is that forming $A^{\mathsf{T}}A$ squares the condition number: $\kappa(A^{\mathsf{T}}A) = \kappa(A)^2$. So if $A$ has $\kappa(A) = 10^4$ (mildly ill-conditioned, common with nearly-collinear features or high-degree polynomials), then $\kappa(A^{\mathsf{T}}A) = 10^8$, and you have thrown away eight digits before you even start solving. In double precision (about 16 digits) you may have nothing left.

# The normal equations SQUARE the condition number — see it directly.
import numpy as np
x = np.array([1., 2., 3., 4., 5.])
A = np.column_stack([np.ones_like(x), x, x + 1e-3*np.array([0,1,0,-1,0])])  # near-collinear col
print("cond(A)     =", np.linalg.cond(A))         # ~1.19e4
print("cond(A^T A) =", np.linalg.cond(A.T @ A))   # ~1.43e8  (the square)

The output confirms it: $\kappa(A) \approx 1.2\times 10^4$ but $\kappa(A^{\mathsf{T}}A) \approx 1.4\times 10^8 \approx \kappa(A)^2$. The near-collinear third column (almost a copy of the $x$ column) makes $A^{\mathsf{T}}A$ nearly singular, and inverting it is asking for trouble.

Common Pitfall — "To fit a model, just compute $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$." This is the single most common numerical mistake in applied least squares. Forming and inverting $A^{\mathsf{T}}A$ squares the condition number and can destroy accuracy when columns are nearly collinear or the design matrix is poorly scaled (high-degree polynomial fits are notorious). Two cures, both covered later in this book: the QR factorization (Chapter 20) solves least squares using $A = QR$ without ever forming $A^{\mathsf{T}}A$, and the singular value decomposition (Chapter 30) handles even rank-deficient problems gracefully via the pseudoinverse. This is why np.linalg.lstsq uses SVD internally rather than the normal equations — and why you should too, for anything beyond a well-conditioned textbook problem. The normal equations remain the right tool for understanding least squares and for small, well-behaved fits; just do not reach for the explicit inverse on real, messy data.

Math-Major Sidebar — When $A$ does not have full column rank, $A^{\mathsf{T}}A$ is singular and $\hat{\mathbf{x}}$ is not unique — but the projection $\hat{\mathbf{b}} = \operatorname{proj}_{C(A)}\mathbf{b}$ still exists and is unique (the closest point in a subspace is always unique). What fails is the coordinates: infinitely many $\hat{\mathbf{x}}$ give the same $A\hat{\mathbf{x}}$, differing by any vector in $N(A)$. The canonical choice is the minimum-norm least-squares solution $\hat{\mathbf{x}} = A^{+}\mathbf{b}$, where $A^{+}$ is the Moore–Penrose pseudoinverse, the object that the SVD of Chapter 30 constructs for every matrix. When $A$ has full column rank, $A^{+} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ and we recover the normal-equation formula; the pseudoinverse is the honest generalization that survives rank deficiency. This is the bridge from this chapter to the deepest factorization in the book.

Historical Note — The method of least squares was published by Adrien-Marie Legendre in 1805, but Carl Friedrich Gauss claimed to have used it since around 1795 and famously employed it in 1801 to predict where the newly-lost asteroid Ceres would reappear — a triumph that helped establish the method [verify]. The priority dispute between Gauss and Legendre is one of the famous quarrels in the history of mathematics [verify]. The recognition that least squares is orthogonal projection — the geometric viewpoint this chapter leads with — crystallized later, as the language of vector spaces and inner products matured in the late 19th and early 20th centuries. Gauss also connected the method to the normal distribution of errors, which is why the bell curve is called "Gaussian."

17.10 Build it from scratch: `least_squares(A, b)`

You now know the entire algorithm: form the normal equations and solve them with the linear-system tools you built earlier in the book. Time to add it to the toolkit. This function seeds toolkit/projection.py; Chapter 19 will deepen it with a general project_onto operator and the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$.

Build Your Toolkit — Implement least_squares(A, b) in toolkit/projection.py, reusing your earlier solvers (no numpy inside the implementation):

Form the Gram matrix $G = A^{\mathsf{T}}A$ and the right-hand side $\mathbf{c} = A^{\mathsf{T}}\mathbf{b}$ — use your transpose and matmul from Chapter 8.

Solve the square system $G\hat{\mathbf{x}} = \mathbf{c}$ with gaussian_elimination / back_substitute from Chapter 4 (or solve_lu from Chapter 10). Return $\hat{\mathbf{x}}$.

Guard the condition: if $G$ is singular (your elimination hits a zero pivot), raise an informative error — $A$ lacks full column rank, so the coefficients are not unique (§17.5). Do not silently return garbage.

Verify (using numpy only to check, never inside the function): on the anchor data of §17.6, least_squares(A, b) returns [0.6, 0.8]; compare with np.linalg.lstsq(A, b, rcond=None)[0] — they must agree to machine precision. Also confirm your residual is orthogonal to the columns: np.allclose(A.T @ (b - A @ x_hat), 0). As a stress test, build the three-point matrix of §17.6 and confirm you recover [1.5, 0.5].

A sketch of the core, to show how thin it is once the earlier toolkit exists (implement it yourself before peeking):

# Sketch: least squares via the normal equations. Reuses Ch.8 + Ch.4 toolkit.
def least_squares(A, b):
    At = transpose(A)                  # Chapter 8
    G  = matmul(At, A)                 # Gram matrix A^T A  (Chapter 8)
    c  = matvec(At, b)                 # A^T b
    try:
        x_hat = gaussian_elimination(G, c)   # solve G x = c  (Chapter 4)
    except ZeroPivotError:
        raise ValueError("A^T A is singular: A lacks full column rank (see 17.5)")
    return x_hat
# On the anchor data this returns [0.6, 0.8]  (matches np.linalg.lstsq).

Notice that least_squares adds no new numerical machinery — it is a five-line wrapper around the solvers you already wrote, plus one guard for the rank condition. That is the payoff of the progressive toolkit and of this chapter's central message: once you recognize regression as "project $\mathbf{b}$ onto $C(A)$, which means solve $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$," the implementation is something you already know how to do. The reference module is assembled separately with the QR-based version teased in §17.9 for the ill-conditioned case; your job here is the normal-equations version, the one that makes the geometry literal.

Computational Note — np.linalg.lstsq(A, b, rcond=None) returns a 4-tuple: the solution, the residual sum of squares (an array — exactly our $\mathrm{SS}_{\text{res}} = \lVert\mathbf{r}\rVert^2$, here $3.6$), the rank of $A$, and the singular values of $A$. It is built on the SVD, so it is more robust than your normal-equations function on ill-conditioned data — which is precisely the §17.9 lesson. Use your from-scratch version to confirm you understand the mechanics; use lstsq when accuracy on real data matters. The residual array is empty when $A$ is rank-deficient or wide, a numpy quirk worth knowing before you index into it.

17.11 What should you carry forward from this chapter?

Linear regression is where the four-fundamental-subspaces picture of Part III stops being beautiful theory and starts being the workhorse of quantitative life. The single sentence to engrave: the least-squares fit is the orthogonal projection of the data vector $\mathbf{b}$ onto the column space $C(A)$, and the residual is whatever is left over, orthogonal to $C(A)$. From that geometric fact, everything followed — the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ are just "the residual is orthogonal to every column," and the full-column-rank condition is just "the columns are independent, so the coordinates are pinned down."

You also met the limits. The normal equations are perfect for understanding and for small, well-conditioned problems, but they square the condition number, so for real data you reach instead for the QR factorization (Chapter 20) or the SVD (Chapter 30) — the same SVD that np.linalg.lstsq uses under the hood, and the same decomposition that will let you handle rank-deficient fits with the pseudoinverse. That forward arrow is not incidental: orthogonality, the very idea that powered this chapter, is the entire subject of Part IV. We leaned on an intuitive notion of "perpendicular" here; Chapter 18 makes the dot product, norm, and angle precise, and Chapter 19 builds the projection operator we used informally into a full theory, with the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ at its center.

Most of all, carry forward the habit of seeing both pictures at once — the scatter-plot-with-a-line of data space and the point-projected-onto-a-subspace of column space — and knowing they are the same act. That double vision is what this book has been training since Chapter 1, and regression is its first big dividend. Next, Part IV gives us the right angles and projections to make "closest" exact, powering not just least squares but QR factorization, Fourier series, and the rotations of quantum mechanics.

Prerequisites

Learning Objectives

In This Chapter

Application: Linear Regression as Projection onto the Column Space

17.1 Why does the most useful equation in data science have no solution?

17.2 What is an overdetermined system, and why is it the rule, not the exception?

17.2.1 Why measure error with the squared length?

17.3 What does it mean for the best fit to be a projection?

17.4 How do we package "fit a line" as a matrix problem? (the design matrix)

17.5 What are the normal equations, and why do they hold? (the algebra of orthogonality)

Why $A^{\mathsf{T}}A$ is invertible exactly when $A$ has full column rank

Warning

17.6 A full worked example by hand: the best line through three points

17.6.1 Seeing the fit: the scatter plot and the fitted line

17.6.2 The projection matrix: seeing the column-space picture in numbers

17.6.3 More than one input: fitting a plane (multiple regression)

17.6.4 A second worked example: estimating an economic trend by hand

17.7 How do you fit a curve? Polynomial regression is still linear

17.8 How good is the fit? The coefficient of determination $R^2$

17.9 When are the normal equations a bad idea? (conditioning and alternatives)

17.10 Build it from scratch: least_squares(A, b)

17.11 What should you carry forward from this chapter?

17.10 Build it from scratch: `least_squares(A, b)`