Chapter 17 — Key Takeaways
The big ideas
- Linear regression IS orthogonal projection onto the column space. When the data vector $\mathbf{b}$ does not lie in $C(A)$ — the normal state of affairs for real, noisy data — there is no exact solution to $A\mathbf{x} = \mathbf{b}$. The best you can do is drop a perpendicular from $\mathbf{b}$ onto the subspace $C(A)$; the foot of that perpendicular is the fit $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$, and the leftover is the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$, orthogonal to $C(A)$. (Theme: geometry and algebra are two views of one object — "minimize squared error" and "find the closest point in a subspace" are the same act.)
- Overdetermined ($m > n$, tall) systems are the regression regime. $C(A)$ is a thin, low-dimensional flat inside a fat space $\mathbb{R}^m$, so a generic $\mathbf{b}$ misses it. Least squares abandons "solve exactly" for "get as close as possible," measuring closeness by the residual norm $\lVert A\mathbf{x} - \mathbf{b}\rVert$ — the honest Euclidean distance, which is why minimizing it is projection.
- The normal equations are just orthogonality, written down. Requiring $\mathbf{r} \perp C(A)$ means $\mathbf{r}$ is orthogonal to every column of $A$, i.e. $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, which rearranges to $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. The tall, unsolvable $m$-equation system becomes a square, solvable $n$-equation one.
- Uniqueness needs full column rank. $A^{\mathsf{T}}A$ is invertible if and only if the columns of $A$ are linearly independent ($\operatorname{rank}(A) = n$), because $N(A^{\mathsf{T}}A) = N(A)$. Then $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ is unique. When the condition fails (collinear features), the fit $\hat{\mathbf{b}}$ is still unique but its coordinates $\hat{\mathbf{x}}$ are not.
- One algorithm, many models — just change the columns. Fitting a line, a polynomial, a plane, or a multi-feature hyperplane all use the same normal equations; they differ only in the design matrix $A$. Polynomial regression is still linear least squares ("linear in the coefficients"), built on a Vandermonde matrix. Projecting onto the all-ones column alone is just averaging.
- $R^2$ is the Pythagorean theorem in column space. $R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}}$ measures the fraction of variance the fit explains. With an intercept in the model, $\mathrm{SS}_{\text{tot}} = \mathrm{SS}_{\text{reg}} + \mathrm{SS}_{\text{res}}$ is the right-triangle split $\lVert\mathbf{b}-\bar{\mathbf{y}}\rVert^2 = \lVert\hat{\mathbf{b}}-\bar{\mathbf{y}}\rVert^2 + \lVert\mathbf{r}\rVert^2$.
- The normal equations are for understanding, not industrial computing. Forming $A^{\mathsf{T}}A$ squares the condition number ($\kappa(A^{\mathsf{T}}A) = \kappa(A)^2$), so for ill-conditioned or large problems you lose accuracy. Real software (and
np.linalg.lstsq) uses the QR factorization (Chapter 20) or the SVD (Chapter 30) on $A$ directly. (Theme: computation validates theory — every number here was checked againstnumpy, and you builtleast_squaresfrom scratch.)
Skills you gained
- Recognizing an overdetermined $A\mathbf{x} = \mathbf{b}$ and recasting "fit a model" as "project $\mathbf{b}$ onto $C(A)$."
- Building the design matrix for a line, a polynomial (Vandermonde), and a multi-feature model.
- Forming and solving the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ by hand for small systems.
- Verifying a fit two ways: checking $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$ (residual orthogonal to the columns), and matching
np.linalg.lstsq. - Computing and interpreting $R^2$, and recognizing the explained/residual/total sum-of-squares split as Pythagoras.
- Stating the full-column-rank condition and explaining why collinear features break coefficient uniqueness.
- Diagnosing when the normal equations are numerically unsafe (ill-conditioning, high-degree polynomials) and naming the fixes (QR, SVD, pseudoinverse).
- Implementing
least_squares(A, b)from scratch on the earlier toolkit, with a singular-Gram-matrix guard, and verifying against numpy.
Terms to know
linear regression, least squares, overdetermined system, residual ($\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$), design (model) matrix, normal equations ($A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$), Gram matrix ($A^{\mathsf{T}}A$), orthogonal projection onto $C(A)$, projection matrix ($P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$), fitted values ($\hat{\mathbf{b}} = A\hat{\mathbf{x}}$), full column rank, coefficient of determination ($R^2$), polynomial / multiple regression, Vandermonde matrix, ill-conditioning (squared condition number), pseudoinverse ($A^{+}$).
How this connects to the rest of the book
- Backward. This chapter is the payoff of Part III: the column space $C(A)$ (Chapter 13) is the set of achievable fits; the full-column-rank condition rests on independence and the null space (Chapters 6, 13); the $2\times 2$ and $4\times 4$ normal systems are solved with Gaussian elimination (Chapter 4), the inverse (Chapter 9), and LU (Chapter 10); and $\mathbf{b} \in C(A)$ is the solvability criterion of Chapter 13, now read as "is the fit exact?"
- Forward. Orthogonality was used intuitively here and becomes the whole of Part IV. Chapter 18 makes the dot product, norm, and angle precise; Chapter 19 builds the orthogonal-projection operator and the projection matrix $P$ in full (deepening this chapter's
toolkit/projection.py); Chapter 20 gives the QR factorization, the numerically sound way to solve least squares without forming $A^{\mathsf{T}}A$; and Chapter 30's SVD delivers the pseudoinverse that solves rank-deficient least squares and is whatnp.linalg.lstsquses under the hood. The Gram matrix $A^{\mathsf{T}}A$, symmetric and positive (semi)definite, returns in Chapter 28. - The throughline. Least squares is the cleanest evidence for the book's recurring claim that the four fundamental subspaces are the organizing framework for all of linear algebra: regression splits $\mathbf{b}$ into its $C(A)$ part (the fit) and its $N(A^{\mathsf{T}})$ part (the residual), the two orthogonal-complement subspaces of the output space. Learn the projection once, and an entire industry's worth of data fitting reduces to a perpendicular dropped onto a column space.