Chapter 19 — Key Takeaways

The big ideas

  • Projection is "drop a perpendicular," made exact in any dimension. The closest point in a subspace $S$ to a vector $\mathbf{b}$ is its orthogonal projection $\mathbf{p}$ — the unique point whose error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ is perpendicular to all of $S$. "Closest point" and "perpendicular error" are the same fact.

  • Projecting onto a line: $\;\mathbf{p} = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\mathbf{a}\cdot\mathbf{a}}\,\mathbf{a}$. The scalar $\hat c = (\mathbf{a}\cdot\mathbf{b})/(\mathbf{a}\cdot\mathbf{a})$ is forced by demanding $\mathbf{a}\cdot\mathbf{e} = 0$; it is the signed scalar projection $\lVert\mathbf{b}\rVert\cos\theta / \lVert\mathbf{a}\rVert$ — projection and the cosine of Chapter 18 measure the same thing.

  • Projecting onto a subspace $C(A)$ reduces to one linear system, the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$, which encode "the error is orthogonal to every column." The projection is $\mathbf{p} = A\hat{\mathbf{x}}$.

  • The projection matrix is $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$, so $\mathbf{p} = P\mathbf{b}$ for every $\mathbf{b}$. Its two defining properties are idempotence $P^2 = P$ (projecting twice changes nothing) and symmetry $P^{\mathsf{T}} = P$ (what makes the projection orthogonal rather than oblique). Its eigenvalues are all $0$ or $1$, and $\operatorname{tr}(P) = \dim C(A)$.

  • Least squares IS orthogonal projection. Fitting a model by minimizing $\lVert A\mathbf{x} - \mathbf{b}\rVert$ means finding the point of $C(A)$ closest to the data $\mathbf{b}$ — the projection. The fitted values are $\hat{\mathbf{b}} = P\mathbf{b}$ (the "hat matrix"); the residuals are the orthogonal error; "residuals uncorrelated with predictors" is exactly $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$. This is the rigorous grounding of Chapter 17's regression.

  • Orthonormal bases make projection trivial. If the columns form an orthonormal set $\mathbf{q}_1,\dots,\mathbf{q}_n$ (so $Q^{\mathsf{T}}Q = I$), the inverse vanishes: $P = QQ^{\mathsf{T}}$ and $\mathbf{p} = \sum_i(\mathbf{q}_i\cdot\mathbf{b})\mathbf{q}_i$ — just dot, scale, add, with no system to solve. This is the motivation for Gram–Schmidt in Chapter 20.

Skills you gained

  • Project a vector onto a line and onto a general subspace, by hand and with numpy.
  • Build the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ and verify $P^2 = P$ and $P^{\mathsf{T}} = P$.
  • Solve the normal equations to obtain the least-squares solution, and interpret the residual as an orthogonal error.
  • Decompose a vector orthogonally as $\mathbf{b} = \mathbf{p} + \mathbf{e}$ and use the complementary projector $I - P$ to remove a known component.
  • Project onto an orthonormal basis with the simplified dot-product formula.
  • State and apply the full-column-rank condition that $(A^{\mathsf{T}}A)^{-1}$ requires.
  • (Toolkit) Implement project_onto and projection_matrix from scratch in toolkit/projection.py, verified against numpy.

The one proof to remember

The closest-point theorem (§19.10): for any other point $\mathbf{y}$ of the subspace, $$\lVert\mathbf{b} - \mathbf{y}\rVert^2 = \lVert\mathbf{b} - \mathbf{p}\rVert^2 + \lVert\mathbf{p} - \mathbf{y}\rVert^2 \ge \lVert\mathbf{b} - \mathbf{p}\rVert^2,$$ with equality only at $\mathbf{y} = \mathbf{p}$. The engine is the Pythagorean theorem: the perpendicular error $\mathbf{e}$ and the within-subspace detour $\mathbf{p} - \mathbf{y}$ are orthogonal legs of a right triangle, so the distance to $\mathbf{b}$ (the hypotenuse) is minimized by taking no detour at all. This is why "orthogonal error" forces "closest point."

Terms to know

orthogonal projection · projection onto a line · projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ · normal equations · least squares · closest-point property · residual / error · orthogonality of the error · idempotent · symmetric projector · full column rank · orthogonal complement · orthonormal basis · complementary projector $I - P$ · oblique projection · hat matrix

How this connects to the rest of the book

  • Backward — the four fundamental subspaces (Chapters 13–14). The projection $\mathbf{p}$ lives in the column space $C(A)$; the error $\mathbf{e}$ lives in the left null space $N(A^{\mathsf{T}})$ (since $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$). These are orthogonal complements that fill $\mathbb{R}^m$, which is precisely why the decomposition $\mathbf{b} = \mathbf{p} + \mathbf{e}$ exists and is unique. Orthogonality is the structural glue between the subspaces.

  • Backward — dot products and norms (Chapter 18). Everything here is built on $\mathbf{a}\cdot\mathbf{b} = \mathbf{a}^{\mathsf{T}}\mathbf{b}$, the norm $\lVert\cdot\rVert$, and the Pythagorean theorem. Projection is the dot product put to work.

  • Backward — linear regression (Chapter 17). We promised in Chapter 17 that least squares was a projection; this chapter delivered the rigorous proof and the projection matrix that makes it exact.

  • Forward — Gram–Schmidt and QR (Chapter 20). §19.9 showed that orthonormal bases make projection effortless ($P = QQ^{\mathsf{T}}$, no inverse). The very next chapter manufactures such bases from any starting basis via the Gram–Schmidt process, and packages the result as the QR factorization $A = QR$ — the numerically sound way to compute every projection and least-squares fit, avoiding the fragile $A^{\mathsf{T}}A$.

  • Forward — Fourier series (Chapter 22). A Fourier coefficient is a projection onto an orthogonal basis of sines and cosines — the same $\mathbf{q}_i\cdot\mathbf{b}$ formula, lifted to infinite-dimensional function space. Case Study 2's interference removal was finite-dimensional Fourier analysis in disguise.

  • Forward — SVD and PCA (Chapters 30–32). The closest-point principle scales up: PCA finds the subspace whose projection loses the least squared distance, and the best low-rank approximation of a matrix is its projection onto the top singular directions — closest-point problems where the subspace itself is the unknown.

Recurring themes touched

  • Geometry and algebra are two views of one object: "drop a perpendicular" (geometry) ⇔ $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$ (algebra) ⇔ $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ (computation).
  • Linear algebra is the most applied branch of pure mathematics: the same projection fits regression lines, removes signal interference, builds factor models in finance, casts shadows in graphics, and (next) extracts Fourier coefficients. Learn it once, use it everywhere.
  • Computation validates theory and theory guides computation: the normal equations expose the geometry, but the QR route of Chapter 20 is the right algorithm — understanding and computing are different skills, both necessary.