Chapter 17 — Key Takeaways

DataField.Dev

Chapter 17 — Key Takeaways

The big ideas

Linear regression IS orthogonal projection onto the column space. When the data vector $\mathbf{b}$ does not lie in $C(A)$ — the normal state of affairs for real, noisy data — there is no exact solution to $A\mathbf{x} = \mathbf{b}$. The best you can do is drop a perpendicular from $\mathbf{b}$ onto the subspace $C(A)$; the foot of that perpendicular is the fit $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$, and the leftover is the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$, orthogonal to $C(A)$. (Theme: geometry and algebra are two views of one object — "minimize squared error" and "find the closest point in a subspace" are the same act.)
Overdetermined ($m > n$, tall) systems are the regression regime. $C(A)$ is a thin, low-dimensional flat inside a fat space $\mathbb{R}^m$, so a generic $\mathbf{b}$ misses it. Least squares abandons "solve exactly" for "get as close as possible," measuring closeness by the residual norm $\lVert A\mathbf{x} - \mathbf{b}\rVert$ — the honest Euclidean distance, which is why minimizing it is projection.
The normal equations are just orthogonality, written down. Requiring $\mathbf{r} \perp C(A)$ means $\mathbf{r}$ is orthogonal to every column of $A$, i.e. $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, which rearranges to $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. The tall, unsolvable $m$-equation system becomes a square, solvable $n$-equation one.
Uniqueness needs full column rank. $A^{\mathsf{T}}A$ is invertible if and only if the columns of $A$ are linearly independent ($\operatorname{rank}(A) = n$), because $N(A^{\mathsf{T}}A) = N(A)$. Then $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ is unique. When the condition fails (collinear features), the fit $\hat{\mathbf{b}}$ is still unique but its coordinates $\hat{\mathbf{x}}$ are not.
One algorithm, many models — just change the columns. Fitting a line, a polynomial, a plane, or a multi-feature hyperplane all use the same normal equations; they differ only in the design matrix $A$. Polynomial regression is still linear least squares ("linear in the coefficients"), built on a Vandermonde matrix. Projecting onto the all-ones column alone is just averaging.
$R^2$ is the Pythagorean theorem in column space. $R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}}$ measures the fraction of variance the fit explains. With an intercept in the model, $\mathrm{SS}_{\text{tot}} = \mathrm{SS}_{\text{reg}} + \mathrm{SS}_{\text{res}}$ is the right-triangle split $\lVert\mathbf{b}-\bar{\mathbf{y}}\rVert^2 = \lVert\hat{\mathbf{b}}-\bar{\mathbf{y}}\rVert^2 + \lVert\mathbf{r}\rVert^2$.
The normal equations are for understanding, not industrial computing. Forming $A^{\mathsf{T}}A$ squares the condition number ($\kappa(A^{\mathsf{T}}A) = \kappa(A)^2$), so for ill-conditioned or large problems you lose accuracy. Real software (and np.linalg.lstsq) uses the QR factorization (Chapter 20) or the SVD (Chapter 30) on $A$ directly. (Theme: computation validates theory — every number here was checked against numpy, and you built least_squares from scratch.)

Skills you gained

Recognizing an overdetermined $A\mathbf{x} = \mathbf{b}$ and recasting "fit a model" as "project $\mathbf{b}$ onto $C(A)$."
Building the design matrix for a line, a polynomial (Vandermonde), and a multi-feature model.
Forming and solving the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ by hand for small systems.
Verifying a fit two ways: checking $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$ (residual orthogonal to the columns), and matching np.linalg.lstsq.
Computing and interpreting $R^2$, and recognizing the explained/residual/total sum-of-squares split as Pythagoras.
Stating the full-column-rank condition and explaining why collinear features break coefficient uniqueness.
Diagnosing when the normal equations are numerically unsafe (ill-conditioning, high-degree polynomials) and naming the fixes (QR, SVD, pseudoinverse).
Implementing least_squares(A, b) from scratch on the earlier toolkit, with a singular-Gram-matrix guard, and verifying against numpy.

Terms to know

linear regression, least squares, overdetermined system, residual ($\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$), design (model) matrix, normal equations ($A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$), Gram matrix ($A^{\mathsf{T}}A$), orthogonal projection onto $C(A)$, projection matrix ($P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$), fitted values ($\hat{\mathbf{b}} = A\hat{\mathbf{x}}$), full column rank, coefficient of determination ($R^2$), polynomial / multiple regression, Vandermonde matrix, ill-conditioning (squared condition number), pseudoinverse ($A^{+}$).

How this connects to the rest of the book

Backward. This chapter is the payoff of Part III: the column space $C(A)$ (Chapter 13) is the set of achievable fits; the full-column-rank condition rests on independence and the null space (Chapters 6, 13); the $2\times 2$ and $4\times 4$ normal systems are solved with Gaussian elimination (Chapter 4), the inverse (Chapter 9), and LU (Chapter 10); and $\mathbf{b} \in C(A)$ is the solvability criterion of Chapter 13, now read as "is the fit exact?"
Forward. Orthogonality was used intuitively here and becomes the whole of Part IV. Chapter 18 makes the dot product, norm, and angle precise; Chapter 19 builds the orthogonal-projection operator and the projection matrix $P$ in full (deepening this chapter's toolkit/projection.py); Chapter 20 gives the QR factorization, the numerically sound way to solve least squares without forming $A^{\mathsf{T}}A$; and Chapter 30's SVD delivers the pseudoinverse that solves rank-deficient least squares and is what np.linalg.lstsq uses under the hood. The Gram matrix $A^{\mathsf{T}}A$, symmetric and positive (semi)definite, returns in Chapter 28.
The throughline. Least squares is the cleanest evidence for the book's recurring claim that the four fundamental subspaces are the organizing framework for all of linear algebra: regression splits $\mathbf{b}$ into its $C(A)$ part (the fit) and its $N(A^{\mathsf{T}})$ part (the residual), the two orthogonal-complement subspaces of the output space. Learn the projection once, and an entire industry's worth of data fitting reduces to a perpendicular dropped onto a column space.