Chapter 19 — Key Takeaways

DataField.Dev

Chapter 19 — Key Takeaways

The big ideas

Projection is "drop a perpendicular," made exact in any dimension. The closest point in a subspace $S$ to a vector $\mathbf{b}$ is its orthogonal projection $\mathbf{p}$ — the unique point whose error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ is perpendicular to all of $S$. "Closest point" and "perpendicular error" are the same fact.
Projecting onto a line: $\;\mathbf{p} = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\mathbf{a}\cdot\mathbf{a}}\,\mathbf{a}$. The scalar $\hat c = (\mathbf{a}\cdot\mathbf{b})/(\mathbf{a}\cdot\mathbf{a})$ is forced by demanding $\mathbf{a}\cdot\mathbf{e} = 0$; it is the signed scalar projection $\lVert\mathbf{b}\rVert\cos\theta / \lVert\mathbf{a}\rVert$ — projection and the cosine of Chapter 18 measure the same thing.
Projecting onto a subspace $C(A)$ reduces to one linear system, the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$, which encode "the error is orthogonal to every column." The projection is $\mathbf{p} = A\hat{\mathbf{x}}$.
The projection matrix is $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$, so $\mathbf{p} = P\mathbf{b}$ for every $\mathbf{b}$. Its two defining properties are idempotence $P^2 = P$ (projecting twice changes nothing) and symmetry $P^{\mathsf{T}} = P$ (what makes the projection orthogonal rather than oblique). Its eigenvalues are all $0$ or $1$, and $\operatorname{tr}(P) = \dim C(A)$.
Least squares IS orthogonal projection. Fitting a model by minimizing $\lVert A\mathbf{x} - \mathbf{b}\rVert$ means finding the point of $C(A)$ closest to the data $\mathbf{b}$ — the projection. The fitted values are $\hat{\mathbf{b}} = P\mathbf{b}$ (the "hat matrix"); the residuals are the orthogonal error; "residuals uncorrelated with predictors" is exactly $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$. This is the rigorous grounding of Chapter 17's regression.
Orthonormal bases make projection trivial. If the columns form an orthonormal set $\mathbf{q}_1,\dots,\mathbf{q}_n$ (so $Q^{\mathsf{T}}Q = I$), the inverse vanishes: $P = QQ^{\mathsf{T}}$ and $\mathbf{p} = \sum_i(\mathbf{q}_i\cdot\mathbf{b})\mathbf{q}_i$ — just dot, scale, add, with no system to solve. This is the motivation for Gram–Schmidt in Chapter 20.

Skills you gained

Project a vector onto a line and onto a general subspace, by hand and with numpy.
Build the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ and verify $P^2 = P$ and $P^{\mathsf{T}} = P$.
Solve the normal equations to obtain the least-squares solution, and interpret the residual as an orthogonal error.
Decompose a vector orthogonally as $\mathbf{b} = \mathbf{p} + \mathbf{e}$ and use the complementary projector $I - P$ to remove a known component.
Project onto an orthonormal basis with the simplified dot-product formula.
State and apply the full-column-rank condition that $(A^{\mathsf{T}}A)^{-1}$ requires.
(Toolkit) Implement project_onto and projection_matrix from scratch in toolkit/projection.py, verified against numpy.

The one proof to remember

The closest-point theorem (§19.10): for any other point $\mathbf{y}$ of the subspace, $$\lVert\mathbf{b} - \mathbf{y}\rVert^2 = \lVert\mathbf{b} - \mathbf{p}\rVert^2 + \lVert\mathbf{p} - \mathbf{y}\rVert^2 \ge \lVert\mathbf{b} - \mathbf{p}\rVert^2,$$ with equality only at $\mathbf{y} = \mathbf{p}$. The engine is the Pythagorean theorem: the perpendicular error $\mathbf{e}$ and the within-subspace detour $\mathbf{p} - \mathbf{y}$ are orthogonal legs of a right triangle, so the distance to $\mathbf{b}$ (the hypotenuse) is minimized by taking no detour at all. This is why "orthogonal error" forces "closest point."

Terms to know

orthogonal projection · projection onto a line · projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ · normal equations · least squares · closest-point property · residual / error · orthogonality of the error · idempotent · symmetric projector · full column rank · orthogonal complement · orthonormal basis · complementary projector $I - P$ · oblique projection · hat matrix

How this connects to the rest of the book

Backward — the four fundamental subspaces (Chapters 13–14). The projection $\mathbf{p}$ lives in the column space $C(A)$; the error $\mathbf{e}$ lives in the left null space $N(A^{\mathsf{T}})$ (since $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$). These are orthogonal complements that fill $\mathbb{R}^m$, which is precisely why the decomposition $\mathbf{b} = \mathbf{p} + \mathbf{e}$ exists and is unique. Orthogonality is the structural glue between the subspaces.
Backward — dot products and norms (Chapter 18). Everything here is built on $\mathbf{a}\cdot\mathbf{b} = \mathbf{a}^{\mathsf{T}}\mathbf{b}$, the norm $\lVert\cdot\rVert$, and the Pythagorean theorem. Projection is the dot product put to work.
Backward — linear regression (Chapter 17). We promised in Chapter 17 that least squares was a projection; this chapter delivered the rigorous proof and the projection matrix that makes it exact.
Forward — Gram–Schmidt and QR (Chapter 20). §19.9 showed that orthonormal bases make projection effortless ($P = QQ^{\mathsf{T}}$, no inverse). The very next chapter manufactures such bases from any starting basis via the Gram–Schmidt process, and packages the result as the QR factorization $A = QR$ — the numerically sound way to compute every projection and least-squares fit, avoiding the fragile $A^{\mathsf{T}}A$.
Forward — Fourier series (Chapter 22). A Fourier coefficient is a projection onto an orthogonal basis of sines and cosines — the same $\mathbf{q}_i\cdot\mathbf{b}$ formula, lifted to infinite-dimensional function space. Case Study 2's interference removal was finite-dimensional Fourier analysis in disguise.
Forward — SVD and PCA (Chapters 30–32). The closest-point principle scales up: PCA finds the subspace whose projection loses the least squared distance, and the best low-rank approximation of a matrix is its projection onto the top singular directions — closest-point problems where the subspace itself is the unknown.

Recurring themes touched

Geometry and algebra are two views of one object: "drop a perpendicular" (geometry) ⇔ $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$ (algebra) ⇔ $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ (computation).
Linear algebra is the most applied branch of pure mathematics: the same projection fits regression lines, removes signal interference, builds factor models in finance, casts shadows in graphics, and (next) extracts Fourier coefficients. Learn it once, use it everywhere.
Computation validates theory and theory guides computation: the normal equations expose the geometry, but the QR route of Chapter 20 is the right algorithm — understanding and computing are different skills, both necessary.