Orthogonal Projection: The Closest Point and the Least Squares Solution

DataField.Dev

49 min read

> Learning paths. Math majors — read everything, especially the closest-point proof in §19.10 and the Math-Major Sidebar on the orthogonal complement; the projection theorem you prove here is the finite-dimensional ancestor of the projection theorem...

Prerequisites

chapter-18-dot-products-and-norms
chapter-17-application-linear-regression

Learning Objectives

Project a vector orthogonally onto a line and explain why the projection minimizes distance to that line.
Project a vector onto a general subspace C(A) and derive the projection from the orthogonality of the error.
Build the projection matrix P = A(AᵀA)⁻¹Aᵀ and prove its two defining properties, idempotence (P² = P) and symmetry (Pᵀ = P).
Prove the closest-point theorem: the orthogonal projection is the unique nearest point of the subspace, with error orthogonal to it.
State the full-column-rank condition that makes (AᵀA)⁻¹ exist, and explain rigorously why least squares solves the normal equations.
Project onto an orthonormal basis with the simpler formula p = Σ(qᵢ·b)qᵢ, motivating Gram–Schmidt in Chapter 20.

In This Chapter

19.1 What is the closest point in a subspace, and why is the answer a right angle?
19.2 How do you project a vector onto a line?
19.3 What is the projection matrix for a line?
19.4 How do you project a vector onto a general subspace?
19.5 How do you project onto a plane, start to finish? A full worked example
19.6 What is the projection matrix P = A(AᵀA)⁻¹Aᵀ?
19.7 Why does P² = P and Pᵀ = P? (the algebra of a projector)
19.8 Why is least squares exactly an orthogonal projection?
19.9 How does projecting onto an orthonormal basis make everything simpler?
19.10 The closest-point theorem: a proof that the projection is genuinely nearest
19.11 What does the projection look like for the whole-space and trivial cases?
19.12 A wider view: where projection sends you next

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Orthogonal Projection: The Closest Point and the Least Squares Solution

Learning paths. Math majors — read everything, especially the closest-point proof in §19.10 and the Math-Major Sidebar on the orthogonal complement; the projection theorem you prove here is the finite-dimensional ancestor of the projection theorem in any Hilbert space (Chapter 34). CS / Data Science — focus on the Geometric Intuition callouts, the projection matrix in §19.6, the numpy verifications, and the two case studies; this is the rigorous engine under the regression you met in Chapter 17 and under linear regression everywhere in practice. Physics / Engineering — focus on the geometry of "drop a perpendicular," the closest-point picture, and the signal-denoising case study, where projecting out a known component is a daily tool. This chapter assumes the dot product, norm, and orthogonality of Chapter 18 and the column-space view of regression from Chapter 17.

19.1 What is the closest point in a subspace, and why is the answer a right angle?

Stand a few steps off a straight road and you already know, without computing anything, how to reach it fastest: walk straight at it, along the path that meets the road at a right angle. Any other route is a detour — it covers ground along the road that does no good in getting you to the road. That childhood instinct, named in the Part IV introduction, is the entire content of this chapter, made exact and lifted out of the plane into any number of dimensions. The closest point in a flat to a vector off it is the foot of the perpendicular, and the line you walk — the error — meets the flat at ninety degrees.

This chapter takes that picture and turns it into one of the most useful machines in applied mathematics. We start with the simplest case: the closest point on a line through the origin. Then we lift it to a general subspace — a plane, a hyperplane, the column space of any matrix. The closest point is the orthogonal projection of the vector onto the subspace, and the recipe that produces it from a matrix $A$ is a single object we will build and name: the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$. By the end you will be able to drop a perpendicular onto any subspace, and you will see — finally, rigorously — why the least-squares regression of Chapter 17 is the right thing to do.

Geometric Intuition — Picture a vector $\mathbf{b}$ floating in space and a subspace $S$ — say a plane through the origin — sitting below it like a tabletop. Shine a light straight down (perpendicular to the table). The shadow $\mathbf{b}$ casts on the table is its orthogonal projection $\mathbf{p}$. Two facts about that shadow run the whole chapter: (1) $\mathbf{p}$ is the point of the table closest to $\mathbf{b}$ — no other point of the plane is nearer; and (2) the little vector $\mathbf{e} = \mathbf{b} - \mathbf{p}$ joining the shadow back up to $\mathbf{b}$ is perpendicular to the table. "Closest" and "perpendicular error" are not two coincidences; they are one fact seen from two sides, and §19.10 proves they are equivalent.

There is a reason this single idea deserves a whole chapter rather than a formula on a card. In Chapter 17 you met least squares as a prescription: when $A\mathbf{x} = \mathbf{b}$ has no solution because $\mathbf{b}$ lies off the column space $C(A)$, replace it with the nearest reachable target and solve that instead. We motivated the prescription with a picture and a promise. This chapter delivers the promise. "Nearest reachable target" means the orthogonal projection of $\mathbf{b}$ onto $C(A)$; the condition that the error be perpendicular to $C(A)$ is the normal equation $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$; and the proof that this point is genuinely closest is the closest-point theorem. Regression is not a heuristic. It is geometry.

We will keep faith with the book's order throughout: the picture first, then the algebra, then a proof of the part that matters, then a hand computation, then numpy that confirms every number, then applications from two different fields. Let us begin where the geometry is cleanest — projecting onto a single line.

19.2 How do you project a vector onto a line?

The simplest subspace is a line through the origin, the set of all multiples of one nonzero direction vector $\mathbf{a}$: the line is $\{\,c\,\mathbf{a} : c \in \mathbb{R}\,\}$. Given some other vector $\mathbf{b}$ pointing off that line, we want the point on the line nearest to $\mathbf{b}$. Because every point of the line is some scalar multiple $c\mathbf{a}$, the question is really: which scalar $c$ gives the closest multiple? Call the answer $\hat c$, and the closest point $\mathbf{p} = \hat c\,\mathbf{a}$.

The geometric instinct tells us the error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ must be perpendicular to the line — perpendicular, that is, to the direction $\mathbf{a}$. In Chapter 18 you learned that perpendicular means dot product zero. So the condition pinning down $\hat c$ is $$\mathbf{a} \cdot \mathbf{e} = \mathbf{a} \cdot (\mathbf{b} - \hat c\,\mathbf{a}) = 0.$$ Distribute the dot product (it is linear in each slot, Chapter 18): $\mathbf{a}\cdot\mathbf{b} - \hat c\,(\mathbf{a}\cdot\mathbf{a}) = 0$. Solve for the one unknown $\hat c$: $$\boxed{\;\hat c = \frac{\mathbf{a}\cdot\mathbf{b}}{\mathbf{a}\cdot\mathbf{a}}\;}\qquad\text{and so}\qquad \mathbf{p} = \frac{\mathbf{a}\cdot\mathbf{b}}{\mathbf{a}\cdot\mathbf{a}}\,\mathbf{a}.$$ That is the projection of $\mathbf{b}$ onto the line through $\mathbf{a}$. Notice the denominator $\mathbf{a}\cdot\mathbf{a} = \lVert\mathbf{a}\rVert^2$ is never zero as long as $\mathbf{a}\ne\mathbf{0}$ — a line needs a genuine direction. The whole formula came from one demand: the error is orthogonal to the direction.

The Key Insight — The scalar $\hat c = (\mathbf{a}\cdot\mathbf{b})/(\mathbf{a}\cdot\mathbf{a})$ is forced, not chosen. It is the unique multiple of $\mathbf{a}$ whose leftover $\mathbf{b} - \hat c\mathbf{a}$ has no component along $\mathbf{a}$. Every other multiple leaves some of $\mathbf{b}$'s along-$\mathbf{a}$ content unaccounted for, and — as §19.10 will prove — sits strictly farther from $\mathbf{b}$. One orthogonality equation, one solution.

It is worth connecting $\hat c$ to the angle between vectors from Chapter 18, because the link reveals what projection is really measuring. Recall $\mathbf{a}\cdot\mathbf{b} = \lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert\cos\theta$, where $\theta$ is the angle between $\mathbf{a}$ and $\mathbf{b}$. Substitute into the coefficient: $\hat c = \lVert\mathbf{b}\rVert\cos\theta / \lVert\mathbf{a}\rVert$, so the projection has length $$\lVert\mathbf{p}\rVert = |\hat c|\,\lVert\mathbf{a}\rVert = \lVert\mathbf{b}\rVert\,|\cos\theta|.$$ The quantity $\lVert\mathbf{b}\rVert\cos\theta$ is the signed scalar projection of $\mathbf{b}$ onto $\mathbf{a}$ — exactly the "shadow length" you would measure if you shone a light perpendicular to $\mathbf{a}$. When $\theta$ is acute the shadow falls in the $+\mathbf{a}$ direction ($\hat c > 0$); when obtuse it falls the other way ($\hat c < 0$); and when $\mathbf{b}\perp\mathbf{a}$ the shadow vanishes ($\hat c = 0$), because a vector perpendicular to the line projects to the origin. Projection and the cosine of Chapter 18 are measuring the same thing — how much of one vector points along another — which is why cosine similarity and projection are two faces of one idea throughout data science.

19.2.1 A worked line projection by hand

Let's drop a perpendicular in the plane. Take the direction $\mathbf{a} = (2, 1)$ and the vector $\mathbf{b} = (3, 4)$. The line through $\mathbf{a}$ is the set of multiples $\{c(2,1)\}$; we want the point on it nearest to $(3,4)$. Compute the two dot products: $$\mathbf{a}\cdot\mathbf{b} = 2\cdot 3 + 1\cdot 4 = 10, \qquad \mathbf{a}\cdot\mathbf{a} = 2^2 + 1^2 = 5.$$ So $\hat c = 10/5 = 2$, and the projection is $$\mathbf{p} = 2\,(2, 1) = (4, 2).$$ The error is $\mathbf{e} = \mathbf{b} - \mathbf{p} = (3,4) - (4,2) = (-1, 2)$. Check the orthogonality that built the formula: $\mathbf{a}\cdot\mathbf{e} = 2\cdot(-1) + 1\cdot 2 = 0$. The error is perpendicular to the line, exactly as the geometry demanded. And the distance from $\mathbf{b}$ to the line is the length of that error, $\lVert\mathbf{e}\rVert = \sqrt{(-1)^2 + 2^2} = \sqrt{5} \approx 2.236$.

# Project b onto the line through a, the 1-D case.  (Chapter 19)
import numpy as np
a = np.array([2., 1.])
b = np.array([3., 4.])
c_hat = (a @ b) / (a @ a)        # the scalar:  (a.b)/(a.a) = 10/5 = 2.0
p = c_hat * a                    # the projection p = c_hat * a
e = b - p                        # the error (residual)
print("c_hat =", c_hat)          # 2.0
print("p     =", p)              # [4. 2.]
print("e     =", e)              # [-1.  2.]
print("a . e =", a @ e)          # 0.0   <- error is orthogonal to the line
print("dist  =", np.linalg.norm(e))   # 2.2360679...  = sqrt(5)

Running it prints c_hat = 2.0, p = [4. 2.], e = [-1. 2.], a . e = 0.0, and dist = 2.23606.... Every hand number is confirmed: the foot of the perpendicular is $(4,2)$, and the perpendicular itself is $(-1,2)$, genuinely at a right angle to $(2,1)$.

Common Pitfall — Confusing the projection (a vector) with the coefficient (a scalar). The number $\hat c = 2$ is not the projection; it is the amount of $\mathbf{a}$ you take. The projection is the vector $\mathbf{p} = \hat c\,\mathbf{a} = (4,2)$. A second, subtler version of the same slip: writing $\mathbf{p} = (\mathbf{a}\cdot\mathbf{b})\,\mathbf{a}$ and forgetting to divide by $\mathbf{a}\cdot\mathbf{a}$. That dropped denominator is the normalization; it is only absent when $\mathbf{a}$ is already a unit vector ($\mathbf{a}\cdot\mathbf{a} = 1$), a convenience we will exploit hard in §19.9.

Check Your Understanding — Project $\mathbf{b} = (4, 0)$ onto the line through $\mathbf{a} = (1, 1)$ (the line $y = x$). What are $\mathbf{p}$ and the error $\mathbf{e}$, and is $\mathbf{e}$ perpendicular to the line?

Answer

$\mathbf{a}\cdot\mathbf{b} = 4$, $\mathbf{a}\cdot\mathbf{a} = 2$, so $\hat c = 2$ and $\mathbf{p} = 2(1,1) = (2,2)$ — the closest point on $y=x$ to $(4,0)$. The error is $\mathbf{e} = (4,0)-(2,2) = (2,-2)$, and $\mathbf{a}\cdot\mathbf{e} = 1\cdot 2 + 1\cdot(-2) = 0$: yes, perpendicular. Geometrically, $(4,0)$ sits below the line $y=x$, and the nearest point on that line is $(2,2)$, reached by sliding along the perpendicular direction $(1,-1)$.

19.3 What is the projection matrix for a line?

We just computed a projection as a recipe: dot, divide, scale. But there is a deeper object hiding here. The map $\mathbf{b} \mapsto \mathbf{p}$ that sends any vector to its projection onto a fixed line is itself a linear transformation — and every linear transformation is a matrix (Chapter 7). Let us find that matrix.

Start from $\mathbf{p} = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\mathbf{a}\cdot\mathbf{a}}\,\mathbf{a}$ and rewrite the dot product as a matrix product. Recall from Chapter 18 that $\mathbf{a}\cdot\mathbf{b} = \mathbf{a}^{\mathsf{T}}\mathbf{b}$ (a row times a column, giving a $1\times 1$ scalar). Then $$\mathbf{p} = \mathbf{a}\,\frac{\mathbf{a}^{\mathsf{T}}\mathbf{b}}{\mathbf{a}^{\mathsf{T}}\mathbf{a}} = \frac{\mathbf{a}\,\mathbf{a}^{\mathsf{T}}}{\mathbf{a}^{\mathsf{T}}\mathbf{a}}\,\mathbf{b}.$$ The key move is regrouping: $\mathbf{a}(\mathbf{a}^{\mathsf{T}}\mathbf{b}) = (\mathbf{a}\mathbf{a}^{\mathsf{T}})\mathbf{b}$, valid because matrix multiplication is associative (Chapter 8). The object $\mathbf{a}\mathbf{a}^{\mathsf{T}}$ is an outer product — a column times a row — and it is an $n\times n$ matrix, not a scalar. Divide it by the scalar $\mathbf{a}^{\mathsf{T}}\mathbf{a}$ and you have the projection matrix for the line: $$P = \frac{\mathbf{a}\,\mathbf{a}^{\mathsf{T}}}{\mathbf{a}^{\mathsf{T}}\mathbf{a}}.$$ Now $\mathbf{p} = P\mathbf{b}$ for every $\mathbf{b}$: one fixed matrix projects all of space onto the line.

For our example $\mathbf{a} = (2,1)$, the outer product is $$\mathbf{a}\mathbf{a}^{\mathsf{T}} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\begin{bmatrix} 2 & 1 \end{bmatrix} = \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix}, \qquad P = \frac{1}{5}\begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix} = \begin{bmatrix} 0.8 & 0.4 \\ 0.4 & 0.2 \end{bmatrix}.$$ Test it on $\mathbf{b} = (3,4)$: $P\mathbf{b} = (0.8\cdot 3 + 0.4\cdot 4,\ 0.4\cdot 3 + 0.2\cdot 4) = (4, 2)$ — the same projection we found by hand. The matrix encodes the entire projection operation.

# The projection MATRIX onto a line:  P = (a a^T) / (a^T a).   (Chapter 19)
import numpy as np
a = np.array([2., 1.])
P = np.outer(a, a) / (a @ a)     # outer product a a^T, divided by the scalar a.a
print("P =\n", P)                # [[0.8 0.4]
                                 #  [0.4 0.2]]
b = np.array([3., 4.])
print("P @ b   =", P @ b)        # [4. 2.]   same projection as before
print("P @ P close P?", np.allclose(P @ P, P))   # True  -> idempotent
print("P^T == P?      ", np.allclose(P, P.T))    # True  -> symmetric
print("eigenvalues:", np.round(np.linalg.eigvalsh(P), 6))   # [0. 1.]

The output confirms $P\mathbf{b} = (4,2)$, that $P^2 = P$, that $P^{\mathsf{T}} = P$, and that the eigenvalues are $0$ and $1$. Hold on to those last three facts — they are the defining properties of every orthogonal projection matrix, and we will prove them in general in §19.7.

Geometric Intuition — Why are the eigenvalues exactly $0$ and $1$? Because a projection does only two things to space (Chapter 23 will make "eigenvalue" precise). Vectors already on the line are left alone — $P$ scales them by $1$, so the line direction is an eigenvector with eigenvalue $1$. Vectors perpendicular to the line are flattened to zero — $P$ scales them by $0$, so the perpendicular direction is an eigenvector with eigenvalue $0$. A projection keeps the part of $\mathbf{b}$ inside the subspace and discards the part outside; "keep" is eigenvalue $1$, "discard" is eigenvalue $0$.

19.3.1 Seeing the line projection with the 2D visualizer

The recurring visualizer from Chapter 1 makes the flattening visible. It draws the unit square and its image under a $2\times 2$ matrix; feed it our projection matrix $P = \frac{1}{5}\begin{bmatrix}4&2\\2&1\end{bmatrix}$ and watch the entire square collapse onto the line $\{c(2,1)\}$ — a one-dimensional segment of zero area. That collapse is the geometric signature of a projection: a whole dimension (everything perpendicular to the line) is crushed to nothing, while the line itself survives untouched.

# Using the visualizer from Chapter 1: a projection flattens the square onto a line.
import numpy as np
from toolkit.visualizer import visualize_2d        # the FROZEN 2D visualizer (Ch.1)
P = np.outer([2., 1.], [2., 1.]) / 5.0             # projection onto the line through (2,1)
print("det(P) =", round(float(np.linalg.det(P)), 6))   # 0.0  -> area destroyed
ax = visualize_2d(P, title="Projection onto a line: square -> segment")
# plt.show()

Figure 19.1 — A projection matrix flattens the unit square onto a line. The blue dashed unit square is squashed by $P$ onto the orange segment along the direction $(2,1)$ — the line we project onto. The determinant is $0$ (area is destroyed: a full dimension collapses), matching the eigenvalues $\{0,1\}$ of §19.3. Alt-text: a unit square mapped by a 2-by-2 projection matrix onto a single line segment through the origin, both transformed basis vectors landing on that line. Compare this to the singular-matrix collapse you saw in Chapter 13 — a projection is one specific kind of singular matrix, the kind whose surviving directions are kept at full scale (eigenvalue $1$) and whose collapsed directions are sent exactly to zero.

The visualizer also shows why the projection of $(3,4)$ is $(4,2)$ and not something else. Drop $(3,4)$ straight down onto the orange line, perpendicular to it, and you land at $(4,2)$ — the only point of the line for which the connecting segment meets the line at a right angle. The picture and the formula $\hat c = (\mathbf{a}\cdot\mathbf{b})/(\mathbf{a}\cdot\mathbf{a})$ are the same statement.

19.4 How do you project a vector onto a general subspace?

A line is one-dimensional, spanned by a single direction. Real problems live in bigger subspaces — a plane in $\mathbb{R}^3$, a $50$-dimensional column space inside $\mathbb{R}^{1000}$. The leap from line to subspace is the heart of the chapter, and the geometry that guided us does not change one bit. The closest point in the subspace is still the one whose error is perpendicular to the subspace. All that grows is the bookkeeping, because "perpendicular to the subspace" now means perpendicular to every direction in it at once.

Set it up cleanly. Let the subspace be $S = C(A)$, the column space of an $m\times n$ matrix $A$ whose columns $\mathbf{a}_1, \dots, \mathbf{a}_n$ span $S$. (Any subspace can be written this way — just stack a spanning set as columns.) Every vector in $S$ is a combination of the columns, i.e. $A\mathbf{x}$ for some coefficient vector $\mathbf{x}\in\mathbb{R}^n$ (this is the column-space reading from Chapter 13). So the closest point in $S$ to a given $\mathbf{b}$ is $\mathbf{p} = A\hat{\mathbf{x}}$ for the right coefficients $\hat{\mathbf{x}}$ — and finding those coefficients is the whole game.

The condition is the same right-angle demand, now imposed against every column. The error $\mathbf{e} = \mathbf{b} - A\hat{\mathbf{x}}$ must be orthogonal to the subspace, which means orthogonal to each spanning direction: $$\mathbf{a}_1\cdot\mathbf{e} = 0,\quad \mathbf{a}_2\cdot\mathbf{e} = 0,\quad\dots,\quad \mathbf{a}_n\cdot\mathbf{e} = 0.$$ Here is the beautiful compression. Stacking these $n$ dot-product equations is exactly what $A^{\mathsf{T}}\mathbf{e}$ computes: the $i$-th entry of $A^{\mathsf{T}}\mathbf{e}$ is the $i$-th row of $A^{\mathsf{T}}$ — that is, the $i$-th column $\mathbf{a}_i$ of $A$ — dotted with $\mathbf{e}$. So all $n$ orthogonality conditions collapse into a single matrix equation: $$A^{\mathsf{T}}\mathbf{e} = \mathbf{0}, \qquad\text{i.e.}\qquad A^{\mathsf{T}}(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}.$$ Distribute and rearrange — push the $A^{\mathsf{T}}\mathbf{b}$ to one side — and you reach the central equation of the chapter: $$\boxed{\;A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}\;}$$ These are the normal equations. The name is geometric: they assert that the residual is normal (perpendicular) to the column space. Solve them for $\hat{\mathbf{x}}$, and the projection is $\mathbf{p} = A\hat{\mathbf{x}}$.

The Key Insight — Projecting onto a subspace reduces to one linear system, $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$, no matter how big the subspace is. The single requirement "error orthogonal to the subspace" — $n$ separate perpendicularity conditions — is packaged by the transpose into the one matrix equation $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$. The transpose is the machine that turns "perpendicular to every column" into a single, solvable system.

Notice what we have not assumed: nothing about $\mathbf{b}$ being reachable, nothing about the dimension. This works for any $\mathbf{b}$ and any subspace. When $\mathbf{b}$ happens to already lie in $S$, the projection is $\mathbf{b}$ itself and the error is zero (the system $A\mathbf{x}=\mathbf{b}$ had an exact solution all along). When $\mathbf{b}$ sticks out of $S$ — the usual case — the projection is the honest closest approximation.

19.4.1 When is there a unique answer? The full-column-rank condition

To solve $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ for a unique $\hat{\mathbf{x}}$, we need the $n\times n$ matrix $A^{\mathsf{T}}A$ to be invertible. When is it? Here is the precise condition, and it deserves to be stated as a theorem because everything downstream — the projection formula, least squares, regression — silently depends on it.

Theorem (invertibility of $A^{\mathsf{T}}A$). For an $m\times n$ matrix $A$, the matrix $A^{\mathsf{T}}A$ is invertible if and only if $A$ has full column rank — that is, its columns are linearly independent ($\operatorname{rank}(A) = n$).

The reason is a clean two-line argument that also previews the proof style of §19.10. The matrices $A$ and $A^{\mathsf{T}}A$ have the same null space. Indeed, if $A\mathbf{x} = \mathbf{0}$ then certainly $A^{\mathsf{T}}A\mathbf{x} = A^{\mathsf{T}}\mathbf{0} = \mathbf{0}$. Conversely, if $A^{\mathsf{T}}A\mathbf{x} = \mathbf{0}$, multiply on the left by $\mathbf{x}^{\mathsf{T}}$: then $\mathbf{x}^{\mathsf{T}}A^{\mathsf{T}}A\mathbf{x} = 0$, which is $(A\mathbf{x})^{\mathsf{T}}(A\mathbf{x}) = \lVert A\mathbf{x}\rVert^2 = 0$, forcing $A\mathbf{x} = \mathbf{0}$ (only the zero vector has zero length, Chapter 18). So $N(A^{\mathsf{T}}A) = N(A)$. Now, $A^{\mathsf{T}}A$ (an $n\times n$ matrix) is invertible exactly when its null space is trivial, which happens exactly when $N(A) = \{\mathbf{0}\}$ — i.e. exactly when $A$'s columns are independent. $\blacksquare$

Warning

— Every formula in the rest of this chapter that inverts $A^{\mathsf{T}}A$ — including the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ — requires $A$ to have full column rank. If two columns of $A$ are equal, or one is a combination of the others (collinear features, in regression terms — Chapter 6), then $A^{\mathsf{T}}A$ is singular, the inverse does not exist, and the normal equations have infinitely many solutions $\hat{\mathbf{x}}$. The projection $\mathbf{p}$ is still unique (the closest point never stops existing), but the coefficients are not. This is exactly the multicollinearity that destabilizes regression coefficients in linear regression, and the cure is to build an independent (orthonormal) set of columns — the Gram–Schmidt process of Chapter 20.

19.5 How do you project onto a plane, start to finish? A full worked example

Let's project onto a genuine two-dimensional subspace and confirm every number. Take the $3\times 2$ matrix $$A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix},$$ whose two columns $\mathbf{a}_1 = (1,1,1)$ and $\mathbf{a}_2 = (0,1,2)$ are not parallel, so they span a plane through the origin in $\mathbb{R}^3$. (This is no accident of choice: it is precisely the design matrix for fitting a straight line to the three data points $(0,\cdot),(1,\cdot),(2,\cdot)$ — column of ones plus column of $x$-values — which is why this example doubles as the regression connection in §19.8.) We will project the vector $\mathbf{b} = (6, 0, 0)$, which does not lie on the plane, onto it.

Step 1 — Form $A^{\mathsf{T}}A$ and $A^{\mathsf{T}}\mathbf{b}$. The Gram matrix is $$A^{\mathsf{T}}A = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{bmatrix}\begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix},$$ where the $(1,1)$ entry is $\mathbf{a}_1\cdot\mathbf{a}_1 = 3$, the off-diagonals are $\mathbf{a}_1\cdot\mathbf{a}_2 = 3$, and the $(2,2)$ entry is $\mathbf{a}_2\cdot\mathbf{a}_2 = 5$. Its determinant is $3\cdot 5 - 3\cdot 3 = 6 \ne 0$, so $A^{\mathsf{T}}A$ is invertible — confirming the columns are independent. The right-hand side is $$A^{\mathsf{T}}\mathbf{b} = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{bmatrix}\begin{bmatrix} 6 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 6 \\ 0 \end{bmatrix}.$$

Step 2 — Solve the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. That is $$\begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}\begin{bmatrix} \hat x_1 \\ \hat x_2 \end{bmatrix} = \begin{bmatrix} 6 \\ 0 \end{bmatrix}.$$ Subtracting the first equation from the second gives $2\hat x_2 = -6$, so $\hat x_2 = -3$; back-substituting, $3\hat x_1 + 3(-3) = 6$ gives $\hat x_1 = 5$. So $\hat{\mathbf{x}} = (5, -3)$.

Step 3 — Form the projection and the error. The projection is $$\mathbf{p} = A\hat{\mathbf{x}} = 5\begin{bmatrix} 1\\1\\1 \end{bmatrix} - 3\begin{bmatrix} 0\\1\\2 \end{bmatrix} = \begin{bmatrix} 5 \\ 2 \\ -1 \end{bmatrix},$$ and the error is $\mathbf{e} = \mathbf{b} - \mathbf{p} = (6,0,0) - (5,2,-1) = (1, -2, 1)$. The acid test: is $\mathbf{e}$ perpendicular to both columns? $\mathbf{a}_1\cdot\mathbf{e} = 1 - 2 + 1 = 0$ and $\mathbf{a}_2\cdot\mathbf{e} = 0 - 2 + 2 = 0$. Both zero — the error stands at a right angle to the entire plane, as the geometry promised. The distance from $\mathbf{b}$ to the plane is $\lVert\mathbf{e}\rVert = \sqrt{1 + 4 + 1} = \sqrt{6} \approx 2.449$.

# Project b onto the plane C(A) = span of A's columns.  (Chapter 19)
import numpy as np
A = np.array([[1., 0.],
              [1., 1.],
              [1., 2.]])
b = np.array([6., 0., 0.])
ATA = A.T @ A                    # Gram matrix  [[3,3],[3,5]]
ATb = A.T @ b                    # right-hand side  [6, 0]
x_hat = np.linalg.solve(ATA, ATb)   # solve the normal equations
p = A @ x_hat                    # the projection p = A x_hat
e = b - p                        # the error (residual)
print("x_hat =", x_hat)          # [ 5. -3.]
print("p     =", p)              # [ 5.  2. -1.]
print("e     =", e)              # [ 1. -2.  1.]
print("A^T e =", A.T @ e)        # [0. 0.]   <- error orthogonal to BOTH columns
print("dist  =", np.linalg.norm(e))   # 2.449489...  = sqrt(6)

The printout reads x_hat = [5. -3.], p = [5. 2. -1.], e = [1. -2. 1.], A^T e = [0. 0.], and dist = 2.449.... Every hand computation matches, and the line A^T e = [0. 0.] is the normal equation holding to the letter: the residual is orthogonal to the column space.

Computational Note — We wrote np.linalg.solve(ATA, ATb) rather than np.linalg.inv(ATA) @ ATb. They give the same answer here, but forming an explicit inverse is slower and numerically worse — and for least squares you should not even form $A^{\mathsf{T}}A$ in serious code, because squaring $A$ squares its condition number (Chapter 38) and can wreck precision. The professional tool is np.linalg.lstsq(A, b), which solves the least-squares problem via the QR factorization of Chapter 20 without ever forming $A^{\mathsf{T}}A$. We use the normal equations here because they make the geometry — orthogonality of the error — completely transparent. Right idea for understanding; not always the right algorithm for computing.

Check Your Understanding — In the worked example, what is $\mathbf{p} + \mathbf{e}$, and what does that tell you about how a projection splits a vector?

Answer

$\mathbf{p} + \mathbf{e} = (5,2,-1) + (1,-2,1) = (6,0,0) = \mathbf{b}$ — they sum back to the original vector. Every vector $\mathbf{b}$ splits uniquely into two perpendicular pieces: $\mathbf{p}$, the part lying in the subspace, and $\mathbf{e}$, the part orthogonal to it. This is the orthogonal decomposition $\mathbf{b} = \mathbf{p} + \mathbf{e}$ with $\mathbf{p}\in C(A)$ and $\mathbf{e}\perp C(A)$. The error $\mathbf{e}$ lives in the orthogonal complement of the subspace — the left null space $N(A^{\mathsf{T}})$ from Chapter 14, since $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$.

19.5.1 The orthogonal decomposition: splitting a vector in two

That Check-Your-Understanding result is important enough to state on its own, because it is the structural reason projection works at all. Every vector $\mathbf{b}\in\mathbb{R}^m$ splits into exactly two pieces relative to a subspace $S$: $$\mathbf{b} = \underbrace{\mathbf{p}}_{\in\,S} + \underbrace{\mathbf{e}}_{\perp\,S}, \qquad \mathbf{p}\cdot\mathbf{e} = 0.$$ The first piece $\mathbf{p}$ is the part of $\mathbf{b}$ that lives in $S$ — its shadow. The second piece $\mathbf{e}$ is the part that escapes $S$, pointing in a direction perpendicular to everything in $S$. This is the orthogonal decomposition of $\mathbf{b}$, and it is unique: there is one and only one way to write $\mathbf{b}$ as (something in $S$) plus (something perpendicular to $S$). Uniqueness follows from the closest-point theorem of §19.10 — if there were two such splits, their difference would be a nonzero vector both in $S$ and perpendicular to $S$, hence perpendicular to itself, forcing its length to zero, a contradiction.

This decomposition is the geometric completion of Strang's four-subspaces picture from Chapter 14. The shadow $\mathbf{p}$ sits in the column space $C(A)$; the escape $\mathbf{e}$ sits in its orthogonal complement, the left null space $N(A^{\mathsf{T}})$. Those two subspaces are perpendicular and together account for all of $\mathbb{R}^m$ — every output-space vector decomposes into a $C(A)$-part and an $N(A^{\mathsf{T}})$-part with nothing left over. In our example, $\mathbf{p} = (5,2,-1)$ is the closest reachable target and $\mathbf{e} = (1,-2,1)$ is the unavoidable miss, perpendicular to both columns. You can even read off the Pythagorean accounting: $\lVert\mathbf{b}\rVert^2 = \lVert\mathbf{p}\rVert^2 + \lVert\mathbf{e}\rVert^2$, here $36 = 30 + 6$, the squared length of $\mathbf{b}$ partitioned cleanly between what the subspace captures and what it cannot.

# The orthogonal decomposition b = p + e, and its Pythagorean accounting.
import numpy as np
A = np.array([[1., 0.], [1., 1.], [1., 2.]])
b = np.array([6., 0., 0.])
p = A @ np.linalg.solve(A.T @ A, A.T @ b)    # the shadow (in C(A))
e = b - p                                     # the escape (perpendicular to C(A))
print("p . e =", round(float(p @ e), 6))      # 0.0   -> the two pieces are orthogonal
print("||b||^2, ||p||^2 + ||e||^2 =",
      round(float(b @ b), 4), round(float(p @ p + e @ e), 4))   # 36.0  36.0

The output reads p . e = 0.0 and ||b||^2, ||p||^2 + ||e||^2 = 36.0 36.0: the shadow and the escape are perpendicular, and their squared lengths add up to the squared length of $\mathbf{b}$. That additivity — energy split cleanly between the kept and discarded parts — is exactly the property that makes projection the right tool for separating signal from noise, as the second case study exploits.

19.5.2 A second worked projection: onto a different plane

Practice on a fresh subspace cements the recipe, and varying the example (rather than reusing one toy matrix) is the book's standing rule. Project $\mathbf{b} = (1, 2, 7)$ onto the plane spanned by the columns of $$B = \begin{bmatrix} 1 & 1 \\ 1 & -1 \\ 1 & 0 \end{bmatrix},$$ whose columns $(1,1,1)$ and $(1,-1,0)$ are independent (not parallel), so they span a plane through the origin. The Gram matrix is $B^{\mathsf{T}}B = \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix}$ — and here is a pleasant surprise: it is diagonal, because the two columns happen to be orthogonal to each other ($(1,1,1)\cdot(1,-1,0) = 0$). The right-hand side is $B^{\mathsf{T}}\mathbf{b} = (1+2+7,\ 1-2+0) = (10, -1)$. Solving the (now trivial, diagonal) normal equations gives $\hat x_1 = 10/3$ and $\hat x_2 = -1/2$, so $$\mathbf{p} = \tfrac{10}{3}(1,1,1) - \tfrac{1}{2}(1,-1,0) = \big(\tfrac{17}{6}, \tfrac{23}{6}, \tfrac{10}{3}\big),$$ and the error $\mathbf{e} = \mathbf{b} - \mathbf{p} = \big(-\tfrac{11}{6}, -\tfrac{11}{6}, \tfrac{11}{3}\big)$ should be perpendicular to both columns.

# A second projection, onto a different plane (orthogonal columns -> diagonal Gram).
import numpy as np
B = np.array([[1., 1.], [1., -1.], [1., 0.]])
b = np.array([1., 2., 7.])
print("B^T B =\n", (B.T @ B).astype(int))     # [[3 0] [0 2]]  -> diagonal!
x_hat = np.linalg.solve(B.T @ B, B.T @ b)
p = B @ x_hat
e = b - p
print("x_hat =", np.round(x_hat, 6))          # [ 3.333333 -0.5    ]
print("p     =", np.round(p, 6))              # [2.833333 3.833333 3.333333]
print("B^T e =", np.round(B.T @ e, 6))        # [0. 0.]  -> error orthogonal to the plane

It prints the diagonal B^T B, then x_hat = [3.333333 -0.5], p = [2.833333 3.833333 3.333333] (which is $(\tfrac{17}{6}, \tfrac{23}{6}, \tfrac{10}{3})$), and B^T e = [0. 0.]. Two lessons land at once. First, the recipe is mechanical and identical regardless of the subspace. Second — and this foreshadows §19.9 — when the columns are orthogonal, the Gram matrix is diagonal and the normal equations decouple into independent one-variable equations, no real "solving" required. Orthogonality is already paying off; orthonormality (§19.9) will pay off even more.

19.6 What is the projection matrix P = A(AᵀA)⁻¹Aᵀ?

In §19.3 we found that projecting onto a line is multiplication by a fixed matrix $P$. The same is true for a subspace, and we can write the matrix down in closed form. Watch the projection assemble itself from the normal equations. We have $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ (solving the normal equations, using full column rank so the inverse exists), and the projection is $\mathbf{p} = A\hat{\mathbf{x}}$. Substitute: $$\mathbf{p} = A\hat{\mathbf{x}} = A\,(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\,\mathbf{b}.$$ The whole chain from $\mathbf{b}$ to $\mathbf{p}$ is multiplication by one matrix, sitting in the middle: $$\boxed{\;P = A\,(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\;}$$ so that $\mathbf{p} = P\mathbf{b}$ for every $\mathbf{b}$. This is the projection matrix onto the column space $C(A)$. It is an $m\times m$ matrix — it takes a vector in $\mathbb{R}^m$ and returns its shadow in $C(A)\subseteq\mathbb{R}^m$.

Let us sanity-check that this general formula collapses to the line case. If $A$ has a single column $\mathbf{a}$ (so $n=1$), then $A^{\mathsf{T}}A = \mathbf{a}^{\mathsf{T}}\mathbf{a}$ is a $1\times 1$ scalar, its inverse is $1/(\mathbf{a}^{\mathsf{T}}\mathbf{a})$, and $$P = \mathbf{a}\,\frac{1}{\mathbf{a}^{\mathsf{T}}\mathbf{a}}\,\mathbf{a}^{\mathsf{T}} = \frac{\mathbf{a}\mathbf{a}^{\mathsf{T}}}{\mathbf{a}^{\mathsf{T}}\mathbf{a}},$$ exactly the line formula of §19.3. The general projection matrix contains the line projection as its one-column special case. Good formulas do that — they degrade gracefully to the simple cases you already trust.

For our worked plane, compute $P$ explicitly: $$P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}} = \frac{1}{6}\begin{bmatrix} 5 & 2 & -1 \\ 2 & 2 & 2 \\ -1 & 2 & 5 \end{bmatrix}.$$ Check: $P\mathbf{b} = \frac{1}{6}(5\cdot 6,\ 2\cdot 6,\ -1\cdot 6) = (5, 2, -1) = \mathbf{p}$, the same projection as before. The matrix $P$ does in one multiplication what the three-step recipe did by hand.

# The projection MATRIX onto a subspace:  P = A (A^T A)^{-1} A^T.   (Chapter 19)
import numpy as np
A = np.array([[1., 0.], [1., 1.], [1., 2.]])
P = A @ np.linalg.inv(A.T @ A) @ A.T
print("6 * P =\n", np.round(6 * P, 6))   # [[ 5.  2. -1.]
                                         #  [ 2.  2.  2.]
                                         #  [-1.  2.  5.]]
b = np.array([6., 0., 0.])
print("P @ b =", P @ b)                  # [ 5.  2. -1.]   same projection
print("P @ P close P?", np.allclose(P @ P, P))   # True   <- idempotent
print("P^T == P?      ", np.allclose(P, P.T))    # True   <- symmetric
print("trace(P) =", round(float(np.trace(P)), 6), " rank(P) =", np.linalg.matrix_rank(P))

The output gives 6 * P as the integer matrix above, confirms P @ b = [5. 2. -1.], and reports P @ P close P? True, P^T == P? True, trace(P) = 2.0, rank(P) = 2. Two structural facts jump out and they are not coincidences. First, $P^2 = P$ and $P^{\mathsf{T}} = P$ — the defining algebraic signature we prove next. Second, the trace of $P$ equals the dimension of the subspace ($\operatorname{tr}(P) = 2 = \dim C(A)$); since a projection's eigenvalues are all $0$ or $1$ (§19.3), the trace simply counts the $1$'s, which is the number of independent directions kept. Both facts will recur in PCA and the SVD (Chapters 30–32).

19.7 Why does P² = P and Pᵀ = P? (the algebra of a projector)

The projection matrix has two properties so characteristic that they are often taken as the definition of an orthogonal projection. Both are short to prove and both are deeply geometric. We assume throughout that $A$ has full column rank, so $(A^{\mathsf{T}}A)^{-1}$ exists.

Idempotence: $P^2 = P$. Geometrically obvious before any algebra: projecting a vector onto the subspace lands it in the subspace; projecting an already-in-subspace vector does nothing, because it is already its own closest point. So applying $P$ twice is the same as applying it once. Algebraically, expand $P^2$ and watch a cancellation: $$P^2 = \big[A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\big]\big[A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\big] = A(A^{\mathsf{T}}A)^{-1}\underbrace{(A^{\mathsf{T}}A)(A^{\mathsf{T}}A)^{-1}}_{=\,I}A^{\mathsf{T}} = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}} = P.$$ The inner $(A^{\mathsf{T}}A)(A^{\mathsf{T}}A)^{-1}$ collapses to the identity, and what is left is $P$ again. $\blacksquare$

Symmetry: $P^{\mathsf{T}} = P$. This is the property that makes the projection orthogonal — that the error comes off at a right angle, rather than at some slant. Take the transpose, using the rules $(BC)^{\mathsf{T}} = C^{\mathsf{T}}B^{\mathsf{T}}$ and $(B^{-1})^{\mathsf{T}} = (B^{\mathsf{T}})^{-1}$ from Chapter 8: $$P^{\mathsf{T}} = \big[A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\big]^{\mathsf{T}} = (A^{\mathsf{T}})^{\mathsf{T}}\big[(A^{\mathsf{T}}A)^{-1}\big]^{\mathsf{T}}A^{\mathsf{T}} = A\big[(A^{\mathsf{T}}A)^{\mathsf{T}}\big]^{-1}A^{\mathsf{T}}.$$ Now $(A^{\mathsf{T}}A)^{\mathsf{T}} = A^{\mathsf{T}}(A^{\mathsf{T}})^{\mathsf{T}} = A^{\mathsf{T}}A$ — the Gram matrix is symmetric, so its inverse is too, and the bracket is unchanged. Therefore $P^{\mathsf{T}} = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}} = P$. $\blacksquare$

Warning

— Idempotence alone does not make a matrix an orthogonal projection. The matrix $M = \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix}$ satisfies $M^2 = M$, but $M^{\mathsf{T}} \ne M$ — it is an oblique projection. It does send everything onto the $x$-axis, but along the slanted direction $(1,-1)$ rather than straight down, so the "error" it produces is not perpendicular to the $x$-axis and the result is not the closest point. The symmetry condition $P^{\mathsf{T}} = P$ is exactly the extra requirement that forces the projection to be orthogonal — to drop a true perpendicular. Idempotent = projection of some kind; idempotent and symmetric = orthogonal projection.

Common Pitfall — Trying to invert the projection matrix. A projection $P$ (onto anything smaller than the whole space) is singular — $\det(P) = 0$ — so $P^{-1}$ does not exist. This is forced by idempotence: if $P$ were invertible, then $P^2 = P$ would give $P = P^{-1}P^2 = P^{-1}P\cdot P = P$... more cleanly, multiply $P^2 = P$ by $P^{-1}$ to get $P = I$, so the only invertible projection is the identity (which "projects" onto the whole space and changes nothing). A projection throws information away — the entire orthogonal complement collapses to zero — and thrown-away information cannot be recovered, which is precisely what non-invertibility means. You can project; you cannot un-project.

Math-Major Sidebar. The full converse holds and is worth knowing: a real matrix $P$ is the orthogonal projection onto some subspace if and only if $P^2 = P$ and $P^{\mathsf{T}} = P$. The subspace it projects onto is its column space $C(P)$, and the eigenvalues are all $0$ or $1$ (the $1$-eigenspace is $C(P)$, the $0$-eigenspace is $N(P) = C(P)^{\perp}$). One direction is what we proved; the converse is a clean exercise once you have the spectral theorem of Chapter 27, since a symmetric idempotent is orthogonally diagonalizable with $\{0,1\}$ on the diagonal. This characterization is the right definition of "orthogonal projection" in an abstract inner-product space (Chapter 34), where there may be no matrix $A$ to build $P$ from in the first place.

19.8 Why is least squares exactly an orthogonal projection?

Now we collect the payoff promised since Chapter 17. The least-squares problem is: given a tall, overdetermined system $A\mathbf{x} = \mathbf{b}$ that has no exact solution (because $\mathbf{b}$ lies off the column space $C(A)$, the typical situation for more equations than unknowns — Chapter 13), find the $\hat{\mathbf{x}}$ that comes closest, minimizing the squared error $\lVert A\mathbf{x} - \mathbf{b}\rVert^2$. In Chapter 17 we asserted that the answer solves the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. Now we see why, and the reason is the entire chapter in one sentence.

Here is the logic, laid bare. As $\mathbf{x}$ ranges over all of $\mathbb{R}^n$, the vector $A\mathbf{x}$ ranges over all of the column space $C(A)$ — that is the definition of the column space (Chapter 13). So minimizing $\lVert A\mathbf{x} - \mathbf{b}\rVert$ over all $\mathbf{x}$ is identical to asking: which point $\mathbf{p} = A\mathbf{x}$ of the subspace $C(A)$ is closest to $\mathbf{b}$? And that question we have now answered completely: the closest point is the orthogonal projection of $\mathbf{b}$ onto $C(A)$, characterized by the error being perpendicular to $C(A)$, which is the normal equation $A^{\mathsf{T}}(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}$. Least squares is orthogonal projection. The "least-squares solution" $\hat{\mathbf{x}}$ is just the coordinates, in the columns of $A$, of the projection $\mathbf{p}$.

The Key Insight — Regression does not approximate a projection or resemble one; it is one. Fitting a model by least squares = projecting the data vector $\mathbf{b}$ onto the column space of the design matrix $A$. The fitted values $\hat{\mathbf{b}} = A\hat{\mathbf{x}} = P\mathbf{b}$ are literally the projection $\mathbf{p}$; the residuals $\mathbf{e} = \mathbf{b} - \hat{\mathbf{b}}$ are literally the orthogonal error. "The residuals are uncorrelated with the predictors" — the statistician's mantra — is just $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$, the error being perpendicular to every column. The geometry of Chapter 18 and the statistics of regression are the same picture.

This is why the design matrix in our §19.5 example was $A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}$ — a column of ones (for the intercept) and a column of $x$-values $0,1,2$. Fitting a line $y = c_0 + c_1 x$ to data points means choosing $(c_0, c_1)$ so that the predictions $A\,(c_0,c_1)^{\mathsf{T}}$ land as close as possible to the observed responses $\mathbf{b}$. The projection finds that closest landing, and $\hat{\mathbf{x}} = (\hat c_0, \hat c_1)$ are the fitted intercept and slope. The matrix $P$ even has a name in statistics — the hat matrix — because $\hat{\mathbf{b}} = P\mathbf{b}$: it "puts the hat on $\mathbf{b}$," turning observations into fitted values. The first case study works a full sensor-calibration regression this way, watching the projection in action on real-feeling data.

Real-World Application — Computer graphics: shadows and contact. When a rendering engine drops a shadow of an object onto a flat floor under a directional light, it is applying a projection matrix to every vertex. A light pointing straight down gives the orthogonal projection onto the floor plane (our $P$); a light at an angle gives the oblique projection of §19.7's warning — same flattening, slanted along the light rays. The same projection logic snaps a dragged object to the nearest point on a surface (the closest-point property), and the orthogonal decomposition $\mathbf{b} = \mathbf{p} + \mathbf{e}$ separates a velocity into a component along a wall (which survives a collision) and a component into it (which is cancelled). Projection is how virtual objects make contact with virtual surfaces.

Real-World Application — Economics: factor models and explained variance. An asset's return over time is a vector $\mathbf{b}$; a handful of risk factors (the market return, a size factor, a value factor) are columns of a design matrix $A$. A factor model regresses the asset on the factors — that is, it projects $\mathbf{b}$ onto $C(A)$. The projection $\mathbf{p} = P\mathbf{b}$ is the part of the return explained by the common factors (the systematic risk), and the error $\mathbf{e}$ is the idiosyncratic return, orthogonal to the factors — the part specific to the company. The squared-length split $\lVert\mathbf{b}\rVert^2 = \lVert\mathbf{p}\rVert^2 + \lVert\mathbf{e}\rVert^2$ from §19.5.1 is exactly the decomposition of variance into systematic and idiosyncratic components, and the famous $R^2$ ("fraction of variance explained") is the ratio $\lVert\mathbf{p}\rVert^2/\lVert\mathbf{b}\rVert^2$ once the data is centered. Diversification, in this language, is the observation that idiosyncratic errors are nearly orthogonal across assets and so largely cancel in a portfolio — a projection statement at heart.

19.9 How does projecting onto an orthonormal basis make everything simpler?

The projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ carries an awkward passenger: that inverse $(A^{\mathsf{T}}A)^{-1}$. Inverting a matrix is work, and (as the Computational Note warned) it is numerically delicate. But watch what happens to the entire formula when the columns of $A$ are not merely independent but orthonormal — mutually perpendicular and each of unit length. This is the observation that motivates the whole next chapter.

Suppose the spanning vectors are an orthonormal set $\mathbf{q}_1, \dots, \mathbf{q}_n$, and assemble them as the columns of a matrix $Q$. Orthonormality says exactly that $$\mathbf{q}_i\cdot\mathbf{q}_j = \begin{cases} 1 & i = j \\ 0 & i \ne j \end{cases},$$ which is the statement that the Gram matrix is the identity: $Q^{\mathsf{T}}Q = I$. (Entry $(i,j)$ of $Q^{\mathsf{T}}Q$ is $\mathbf{q}_i\cdot\mathbf{q}_j$.) The troublesome inverse $(Q^{\mathsf{T}}Q)^{-1} = I^{-1} = I$ vanishes, and the projection matrix collapses to $$P = Q(Q^{\mathsf{T}}Q)^{-1}Q^{\mathsf{T}} = QQ^{\mathsf{T}}.$$ No inverse, no system to solve — just two matrix multiplications. Even better, the projection of a vector $\mathbf{b}$ has a transparent component-by-component form. Writing it out, $\mathbf{p} = QQ^{\mathsf{T}}\mathbf{b}$ becomes $$\boxed{\;\mathbf{p} = (\mathbf{q}_1\cdot\mathbf{b})\,\mathbf{q}_1 + (\mathbf{q}_2\cdot\mathbf{b})\,\mathbf{q}_2 + \dots + (\mathbf{q}_n\cdot\mathbf{b})\,\mathbf{q}_n.\;}$$ This is the formula to remember. To project onto an orthonormal basis, project onto each basis vector separately and add the results. Each term $(\mathbf{q}_i\cdot\mathbf{b})\mathbf{q}_i$ is precisely the line-projection of §19.2 onto $\mathbf{q}_i$ (with denominator $1$ because $\mathbf{q}_i$ is a unit vector). The orthogonality is what lets the pieces not interfere: because the directions are mutually perpendicular, the amount of $\mathbf{b}$ along $\mathbf{q}_1$ is computed without any reference to $\mathbf{q}_2$, and the separate one-dimensional projections simply sum.

The Key Insight — Orthonormality turns a coupled $n$-dimensional projection into $n$ independent one-dimensional projections. With a general basis, the columns "lean on" each other and you must solve $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ to untangle them; with an orthonormal basis, the coefficients are just the dot products $\mathbf{q}_i\cdot\mathbf{b}$, read off directly. This is the reason orthonormal bases are worth manufacturing — and Chapter 20's Gram–Schmidt process is the manufacturing.

Let's confirm this on the same plane as §19.5, now equipped with an orthonormal basis for it. One orthonormal basis is $\mathbf{q}_1 = \tfrac{1}{\sqrt 3}(1,1,1)$ and $\mathbf{q}_2 = \tfrac{1}{\sqrt 2}(-1,0,1)$ (you can check both are unit length, perpendicular to each other, and lie in the plane). Projecting $\mathbf{b} = (6,0,0)$: $$\mathbf{q}_1\cdot\mathbf{b} = \tfrac{6}{\sqrt 3} = 2\sqrt 3 \approx 3.464, \qquad \mathbf{q}_2\cdot\mathbf{b} = \tfrac{-6}{\sqrt 2} = -3\sqrt 2 \approx -4.243,$$ and $\mathbf{p} = (2\sqrt 3)\mathbf{q}_1 + (-3\sqrt 2)\mathbf{q}_2$, which works out to $(5, 2, -1)$ — exactly the same projection as before, as it must be (the closest point in a subspace does not depend on which basis you describe the subspace with).

# Projecting onto an ORTHONORMAL basis: just dot, scale, add.  (Chapter 19 -> 20)
import numpy as np
q1 = np.array([1., 1., 1.]) / np.sqrt(3)     # unit vector along (1,1,1)
q2 = np.array([-1., 0., 1.]) / np.sqrt(2)    # unit, perpendicular to q1, in the plane
b  = np.array([6., 0., 0.])
print("q1.q1, q2.q2, q1.q2 =", round(q1@q1, 6), round(q2@q2, 6), round(q1@q2, 6))  # 1 1 0
c1, c2 = q1 @ b, q2 @ b
print("coefficients:", round(c1, 6), round(c2, 6))   # 3.464102  -4.242641
p = c1 * q1 + c2 * q2
print("p =", np.round(p, 6))                 # [ 5.  2. -1.]   same point as in 19.5
Q = np.column_stack([q1, q2])
print("Q^T Q =\n", np.round(Q.T @ Q, 6))     # identity  -> no inverse needed
print("Q Q^T @ b =", np.round(Q @ Q.T @ b, 6))   # [ 5.  2. -1.]   P = Q Q^T

The output confirms q1.q1, q2.q2, q1.q2 = 1.0 1.0 0.0 (orthonormal), coefficients 3.464102 -4.242641, the projection p = [5. 2. -1.], that Q^T Q is the identity, and that Q Q^T @ b gives the same [5. 2. -1.]. Same plane, same closest point, but the computation needed no inverse — only dot products. That simplification is the destination of Chapter 20: given any basis, Gram–Schmidt manufactures an orthonormal one, after which projection (and least squares, and much else) becomes effortless.

Build Your Toolkit — Implement two functions in toolkit/projection.py, pure Python (numpy only to verify). First, project_onto(b, A) returns the orthogonal projection of vector b onto the column space of A: form the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ (reuse matmul/transpose from toolkit/matrices.py), solve them with your gaussian_elimination from Chapter 4, and return $A\hat{\mathbf{x}}$. Second, projection_matrix(A) builds $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ explicitly (get each column of $(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ by solving $A^{\mathsf{T}}A\,\mathbf{x} = (\text{a column of }A^{\mathsf{T}})$). Verify that project_onto([6,0,0], [[1,0],[1,1],[1,2]]) returns [5, 2, -1], that your P matches A @ np.linalg.inv(A.T @ A) @ A.T, and — the chapter's signature checks — that P @ P == P (idempotent) and P.T == P (symmetric). State the precondition in your docstring: $A$ must have full column rank, or $A^{\mathsf{T}}A$ is singular and the solve fails.

19.10 The closest-point theorem: a proof that the projection is genuinely nearest

We have leaned on a claim from the first page: that the orthogonal projection $\mathbf{p}$ — the point whose error is perpendicular to the subspace — is the unique closest point of the subspace to $\mathbf{b}$. It is intuitively obvious from the shadow picture, but intuition is not proof, and this theorem is the foundation under all of least squares. So we prove it, in the book's four-part shape.

1. Why we care. Everything in this chapter — and the rigor of regression, signal denoising, and approximation theory — rests on the equivalence of two descriptions of $\mathbf{p}$: the algebraic one ("the error is orthogonal to the subspace," giving the normal equations) and the geometric one ("$\mathbf{p}$ minimizes the distance to $\mathbf{b}$"). We have been using "orthogonal error" to compute and "closest point" to interpret. The theorem certifies they are the same point, so that solving the normal equations really does deliver the nearest approximation, and uniquely.

2. Key idea. Compare $\mathbf{p}$ to any other point $\mathbf{y}$ of the subspace. The difference $\mathbf{p} - \mathbf{y}$ lies in the subspace, while the error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ is perpendicular to it — so these two vectors are orthogonal, and the Pythagorean theorem (Chapter 18) splits the distance from $\mathbf{y}$ to $\mathbf{b}$ into the distance from $\mathbf{p}$ to $\mathbf{b}$ plus a strictly positive surplus whenever $\mathbf{y}\ne\mathbf{p}$.

3. Proof. Let $S$ be the subspace and let $\mathbf{p}\in S$ be the orthogonal projection of $\mathbf{b}$, meaning the error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ is orthogonal to every vector of $S$ (this is the normal-equation condition $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$, restated coordinate-free). Take any other point $\mathbf{y}\in S$. We compare the squared distances $\lVert\mathbf{b}-\mathbf{y}\rVert^2$ and $\lVert\mathbf{b}-\mathbf{p}\rVert^2$. Write the first by inserting and removing $\mathbf{p}$: $$\mathbf{b} - \mathbf{y} = (\mathbf{b} - \mathbf{p}) + (\mathbf{p} - \mathbf{y}) = \mathbf{e} + (\mathbf{p}-\mathbf{y}).$$ Now the crucial observation: $\mathbf{p}-\mathbf{y}$ lies in $S$ (it is a difference of two vectors of the subspace, and $S$ is closed under subtraction — it is a subspace, Chapter 6), while $\mathbf{e}$ is orthogonal to everything in $S$. Hence $\mathbf{e}\perp(\mathbf{p}-\mathbf{y})$, and their dot product is zero. Take squared norms and expand using $\lVert\mathbf{u}+\mathbf{v}\rVert^2 = \lVert\mathbf{u}\rVert^2 + 2\,\mathbf{u}\cdot\mathbf{v} + \lVert\mathbf{v}\rVert^2$ (Chapter 18): $$\lVert\mathbf{b}-\mathbf{y}\rVert^2 = \lVert\mathbf{e}\rVert^2 + 2\underbrace{\,\mathbf{e}\cdot(\mathbf{p}-\mathbf{y})\,}_{=\,0} + \lVert\mathbf{p}-\mathbf{y}\rVert^2 = \lVert\mathbf{e}\rVert^2 + \lVert\mathbf{p}-\mathbf{y}\rVert^2.$$ The cross term dies because of orthogonality — this is the Pythagorean theorem in action, with $\mathbf{e}$ and $\mathbf{p}-\mathbf{y}$ as the two legs. Since $\lVert\mathbf{e}\rVert^2 = \lVert\mathbf{b}-\mathbf{p}\rVert^2$, we have shown $$\lVert\mathbf{b}-\mathbf{y}\rVert^2 = \lVert\mathbf{b}-\mathbf{p}\rVert^2 + \lVert\mathbf{p}-\mathbf{y}\rVert^2 \;\ge\; \lVert\mathbf{b}-\mathbf{p}\rVert^2,$$ with equality if and only if $\lVert\mathbf{p}-\mathbf{y}\rVert^2 = 0$, i.e. $\mathbf{y} = \mathbf{p}$. So every other point of $S$ is strictly farther from $\mathbf{b}$ than $\mathbf{p}$ is, and $\mathbf{p}$ is the unique closest point. $\blacksquare$

4. What this means. The proof is the shadow picture made rigorous, and the engine is purely the Pythagorean theorem: the distance from any subspace point $\mathbf{y}$ to $\mathbf{b}$ is the hypotenuse of a right triangle whose legs are the fixed error $\mathbf{e}$ (from $\mathbf{p}$ straight up to $\mathbf{b}$) and the within-subspace detour $\mathbf{p}-\mathbf{y}$. You cannot shorten the hypotenuse below the leg $\lVert\mathbf{e}\rVert$; the best you can do is drive the detour to zero by choosing $\mathbf{y} = \mathbf{p}$. That is why "orthogonal error" forces "closest point": the perpendicular leg is the irreducible part of the distance, and any wandering within the subspace only adds to it. The theorem also hands us a bonus — the minimum distance from $\mathbf{b}$ to the subspace is exactly $\lVert\mathbf{e}\rVert$, the length of the error — which is what least squares is minimizing.

Geometric Intuition — The right triangle is the whole story. Pin one leg as the perpendicular drop $\mathbf{e}$ from the subspace up to $\mathbf{b}$; that leg's length is fixed the moment you fix $\mathbf{b}$ and the subspace. The other leg is your freedom to slide around within the subspace, from $\mathbf{p}$ over to some other $\mathbf{y}$. The distance to $\mathbf{b}$ is always the hypotenuse, and a hypotenuse is never shorter than either leg — so you cannot beat $\lVert\mathbf{e}\rVert$, and you tie it only by not sliding at all. Closest means "stop directly below the target."

# Empirical check of the closest-point theorem: random subspace points are FARTHER.
import numpy as np
A = np.array([[1., 0.], [1., 1.], [1., 2.]])
b = np.array([6., 0., 0.])
P = A @ np.linalg.inv(A.T @ A) @ A.T
p = P @ b
d_proj = np.linalg.norm(b - p)               # distance to the projection
print("distance to projection p =", round(d_proj, 6))   # 2.449490  = sqrt(6)
worse = []
for _ in range(5):                           # five random OTHER points of C(A)
    x = np.random.randn(2)
    y = A @ x                                # y = A x is some point in the subspace
    worse.append(np.linalg.norm(b - y) >= d_proj)
print("every random subspace point at least as far?", all(worse))   # True

It prints distance to projection p = 2.449490 and every random subspace point at least as far? True: across random points of the plane, none beats the projection — the closest-point theorem, witnessed numerically.

Real-World Application — Data science: the best low-rank approximation. The same minimize-the-orthogonal-distance principle, scaled up, is the engine of dimensionality reduction. Principal Component Analysis (Chapter 32) finds the subspace onto which projecting your data loses the least squared distance — it is a closest-point problem where the unknown is the subspace itself. The Eckart–Young theorem (Chapter 31) says the best rank-$k$ approximation of a matrix is its projection onto the top $k$ singular directions, again minimizing orthogonal error. Every time someone "compresses to the most informative dimensions," they are invoking the closest-point theorem you just proved, one projection at a time.

19.11 What does the projection look like for the whole-space and trivial cases?

Two boundary cases sharpen the picture and guard against errors. They are the projection-matrix analogues of "multiply by $1$" and "multiply by $0$."

Projecting onto the whole space. If the subspace $S$ is all of $\mathbb{R}^m$ — for instance when $A$ is a square, invertible $m\times m$ matrix, so its columns span everything (Chapter 9) — then every vector is already in $S$, nothing sticks out, and the projection of $\mathbf{b}$ is $\mathbf{b}$ itself. The formula agrees: with $A$ square and invertible, $$P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}} = A\,A^{-1}(A^{\mathsf{T}})^{-1}A^{\mathsf{T}} = I,$$ using $(A^{\mathsf{T}}A)^{-1} = A^{-1}(A^{\mathsf{T}})^{-1}$ (the inverse of a product reverses, Chapter 9). The projection matrix is the identity; the error is always zero. Projecting onto everything changes nothing — eigenvalues all $1$.

Projecting onto the zero subspace. At the other extreme, projecting onto $S = \{\mathbf{0}\}$ sends every vector to $\mathbf{0}$ (the only available point), so $P = 0$ (the zero matrix) and the error equals $\mathbf{b}$ in full. Eigenvalues all $0$: everything is discarded. Every genuine projection lives between these poles, keeping a $\operatorname{rank}(P)$-dimensional slice (eigenvalue $1$) and discarding the rest (eigenvalue $0$), which is why $\operatorname{tr}(P) = \dim S$ as we saw in §19.6.

The complementary projector $I - P$. If $P$ projects onto $S$, then $I - P$ projects onto the orthogonal complement $S^{\perp}$ — the subspace of all vectors perpendicular to $S$. Indeed $(I-P)\mathbf{b} = \mathbf{b} - \mathbf{p} = \mathbf{e}$, which is exactly the error, and the error lives in $S^{\perp}$. You can check $I - P$ is itself an orthogonal projection: it is symmetric ($(I-P)^{\mathsf{T}} = I - P^{\mathsf{T}} = I - P$) and idempotent ($(I-P)^2 = I - 2P + P^2 = I - 2P + P = I - P$). So $P$ and $I-P$ are a matched pair that split every vector into its in-$S$ shadow and its perpendicular error: $\mathbf{b} = P\mathbf{b} + (I-P)\mathbf{b}$. This $I - P$ is the workhorse of the denoising case study, where the point is to throw away a known component by projecting onto its complement.

Common Pitfall — Thinking $I - P$ "undoes" $P$. It does not — $P$ is not invertible, so nothing undoes it. Instead, $I-P$ is the complementary projection: $P$ keeps the part of $\mathbf{b}$ in $S$, and $I-P$ keeps the part of $\mathbf{b}$ in $S^{\perp}$, and these are different, perpendicular pieces. Together they reconstruct $\mathbf{b}$ ($P + (I-P) = I$), but neither recovers what the other discarded. They partition $\mathbf{b}$; they do not reverse each other. (Check: $P(I-P) = P - P^2 = P - P = 0$ — applying both in succession annihilates everything, the hallmark of complementary projectors, not inverses.)

19.11.1 How do projection and the four fundamental subspaces fit together?

Projection is where the four-subspaces picture of Part III finally snaps into a single geometric diagram, and it is worth drawing the connection explicitly because it reorganizes everything you have learned. Recall the four subspaces of an $m\times n$ matrix $A$ (Chapter 14): the column space $C(A)$ and left null space $N(A^{\mathsf{T}})$ live in the output space $\mathbb{R}^m$, while the row space $C(A^{\mathsf{T}})$ and null space $N(A)$ live in the input space $\mathbb{R}^n$. Chapter 18 added the missing ingredient — orthogonality — and the punchline is that the subspaces pair up into orthogonal complements: in the output space, $C(A)$ and $N(A^{\mathsf{T}})$ are perpendicular and together fill $\mathbb{R}^m$; in the input space, $C(A^{\mathsf{T}})$ and $N(A)$ are perpendicular and together fill $\mathbb{R}^n$.

Orthogonal projection is what operationalizes those pairings. When we project $\mathbf{b}\in\mathbb{R}^m$ onto $C(A)$, the projection matrix $P$ does the splitting: $P$ maps $\mathbf{b}$ to its $C(A)$-component, and $I-P$ maps it to its $N(A^{\mathsf{T}})$-component. The reason the error satisfies $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$ — the normal equation — is precisely that $\mathbf{e}$ lands in the left null space, the orthogonal complement of the column space. So the cryptic-sounding "left null space" from Chapter 14 turns out to be the concrete home of every regression residual you will ever compute. The orthogonality of the four subspaces is not an abstract curiosity; it is the guarantee that the decomposition $\mathbf{b} = \mathbf{p} + \mathbf{e}$ exists and is unique.

Geometric Intuition — Think of $\mathbb{R}^m$ as cleanly partitioned by $A$ into two perpendicular rooms: the column space (everything $A$ can reach) and the left null space (everything $A$ misses). Any vector $\mathbf{b}$ has a unique address in this two-room building — so much of it in the reachable room ($\mathbf{p}$), the rest in the unreachable room ($\mathbf{e}$). Projection is the act of reading off that address. The same building exists in the input space $\mathbb{R}^n$, partitioned into the row space and the null space, which is why the SVD of Chapter 30 can describe a transformation entirely by how it maps one building's rooms to the other's. The four subspaces are not four separate facts; they are two orthogonal splittings, and projection is how you compute them.

Check Your Understanding — A matrix $A$ is $5\times 2$ with independent columns. You project a vector $\mathbf{b}\in\mathbb{R}^5$ onto $C(A)$. What are the dimensions of the subspace the projection $\mathbf{p}$ lives in and the subspace the error $\mathbf{e}$ lives in, and what is $\operatorname{tr}(P)$?

Answer

Since the two columns are independent, $\dim C(A) = 2$, so $\mathbf{p}$ lives in a 2-dimensional column space inside $\mathbb{R}^5$. The error lives in the orthogonal complement $N(A^{\mathsf{T}})$, of dimension $5 - 2 = 3$. The trace of the projection matrix equals the dimension of the subspace it projects onto (§19.6), so $\operatorname{tr}(P) = 2$. The two subspaces' dimensions sum to $2 + 3 = 5 = m$, confirming they fill $\mathbb{R}^5$ — the orthogonal-complement pairing in action.

19.11.2 Removing a known component: projection as subtraction

So far we have used projection to keep the part of a vector inside a subspace. But the complementary projector $I - P$ flips the goal: it lets you remove a known component cleanly, and that turns out to be one of the most practical uses of everything in this chapter. The picture is the same right angle as always, read from the other side. If $\mathbf{p} = P\mathbf{b}$ is the shadow of $\mathbf{b}$ on a subspace $S$, then $\mathbf{e} = (I-P)\mathbf{b}$ is what is left after that shadow is subtracted away — the part of $\mathbf{b}$ that points in no direction of $S$ at all. To strip a component out of a vector, project it out.

Here is the canonical instance, and it is small enough to follow by hand. Suppose you record a signal at four equally spaced moments and read off $\mathbf{b} = (3, 5, 4, 8)$, but you suspect a constant baseline offset — a sensor drift, a DC bias, a steady background — has been added to whatever true, fluctuating signal you care about. The offending component lives along the constant direction $\mathbf{a} = (1,1,1,1)$: a pure baseline is the same number in every slot, hence a multiple of $\mathbf{a}$. So the subspace to remove is the line through $\mathbf{a}$, and the tool is its complementary projector $I - P$.

Build $P$ with the line formula of §19.3. The denominator is $\mathbf{a}\cdot\mathbf{a} = 4$ (four ones) and the outer product $\mathbf{a}\mathbf{a}^{\mathsf{T}}$ is the $4\times 4$ all-ones matrix, so $$P = \frac{\mathbf{a}\mathbf{a}^{\mathsf{T}}}{\mathbf{a}\cdot\mathbf{a}} = \frac{1}{4}\begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix}.$$ Watch what this $P$ does: multiplying any vector by $\frac{1}{4}\mathbf{a}\mathbf{a}^{\mathsf{T}}$ replaces every entry with the average of all four entries. The projection onto the constant line is precisely the operation "take the mean and broadcast it." For our signal, $\mathbf{a}\cdot\mathbf{b} = 3+5+4+8 = 20$, so $\hat c = 20/4 = 5$ and the captured baseline is $\mathbf{p} = 5\,\mathbf{a} = (5,5,5,5)$ — the mean value, sitting in every coordinate. That is no coincidence; it is the geometric meaning of an average. The mean of a list of numbers is the orthogonal projection of that list onto the constant direction, the single number closest (in squared distance) to all of them at once.

Now apply the complementary projector to remove that baseline: $$\mathbf{e} = (I - P)\mathbf{b} = \mathbf{b} - \mathbf{p} = (3,5,4,8) - (5,5,5,5) = (-2,\,0,\,-1,\,3).$$ This $\mathbf{e}$ is the de-drifted signal — the genuine fluctuation with the constant background subtracted out. It carries two signatures worth noticing. First, its entries sum to zero ($-2+0-1+3 = 0$): subtracting the mean always centers a signal, and "summing to zero" is exactly the statement $\mathbf{a}\cdot\mathbf{e} = 0$, the orthogonality that says $\mathbf{e}$ has no constant part left. Second, the squared lengths split by Pythagoras exactly as §19.5.1 promised: $\lVert\mathbf{b}\rVert^2 = 114$ partitions into $\lVert\mathbf{p}\rVert^2 = 100$ (the energy in the baseline) plus $\lVert\mathbf{e}\rVert^2 = 14$ (the energy in the fluctuation). The projection cleanly accounts for where every bit of the signal's size went.

# Remove a constant baseline (DC drift) by projecting it OUT with I - P.  (Chapter 19)
import numpy as np
a = np.array([1., 1., 1., 1.])           # the known component: the constant direction
b = np.array([3., 5., 4., 8.])           # measured signal = true fluctuation + baseline
P = np.outer(a, a) / (a @ a)             # projection onto the constant line; entries all 0.25
p = P @ b                                # the baseline captured = the mean, broadcast
e = (np.eye(4) - P) @ b                  # I - P removes it: the centered (de-drifted) signal
print("p (baseline) =", p)               # [5. 5. 5. 5.]   = mean(b) in every slot
print("e (centered) =", e)               # [-2.  0. -1.  3.]
print("sum(e), a.e  =", round(e.sum(), 6), round(a @ e, 6))   # 0.0 0.0  -> no constant left
print("||b||^2 = ||p||^2 + ||e||^2:",
      round(b @ b, 1), "=", round(p @ p + e @ e, 1))          # 114.0 = 114.0

The output prints p = [5. 5. 5. 5.], e = [-2. 0. -1. 3.], confirms sum(e), a.e = 0.0 0.0, and reports the energy split 114.0 = 114.0. Every hand number checks. The same move scales without changing one idea: to remove a known waveform rather than a constant — a 60 Hz hum from a recording, a market-wide trend from a stock's return, a slow temperature ramp from an instrument reading — you make that waveform (or a few of them) the columns of $A$, build $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$, and keep the residual $(I-P)\mathbf{b}$. You are projecting onto the complement of the nuisance subspace, discarding exactly the part you can explain and retaining the part you cannot.

Real-World Application — Signals and statistics: detrending and the geometry of the mean. "Center your data before computing covariance" (Chapter 32) and "remove the trend before analyzing residuals" are the same operation: project onto a nuisance subspace (the constant direction, or a low-order polynomial in time) and keep $(I-P)\mathbf{b}$. The reason centering is mandatory for PCA is geometric — the principal directions should describe how the data varies, not where its mean happens to sit, so you first project the mean out. Even the humble sample mean is a projection: it is the closest constant to a set of measurements, the foot of the perpendicular onto the all-ones line. Whenever an analyst "subtracts off" a baseline, a seasonal cycle, or a market factor, they are computing $(I-P)\mathbf{b}$ — using the closest-point theorem to define precisely what "the part we cannot explain" means.

19.12 A wider view: where projection sends you next

Step back and see how far one childhood instinct has carried us. "The shortest path to a line is a perpendicular" became, in order: a formula for projecting onto a line; a projection matrix $\mathbf{a}\mathbf{a}^{\mathsf{T}}/(\mathbf{a}^{\mathsf{T}}\mathbf{a})$; the same idea for any subspace via the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$; the closed-form projector $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ with its signature properties $P^2 = P$ and $P^{\mathsf{T}} = P$; a rigorous proof that the projection is the unique closest point; the realization that least squares is this projection; and the orthonormal-basis shortcut $\mathbf{p} = \sum(\mathbf{q}_i\cdot\mathbf{b})\mathbf{q}_i$ that begs for the next chapter. That is the book's recurring theme made vivid: geometry and algebra are two views of one object, and the most applied corner of mathematics grows from a single picture.

The chapter also quietly completed the four-subspaces story of Part III. The projection lives in the column space $C(A)$; the error lives in the left null space $N(A^{\mathsf{T}})$ (that is what $A^{\mathsf{T}}\mathbf{e} = \mathbf{0}$ says); and these two subspaces are orthogonal complements that together fill $\mathbb{R}^m$ — which is why every $\mathbf{b}$ splits uniquely as $\mathbf{p} + \mathbf{e}$. Orthogonality, introduced in Chapter 18 as an angle being ninety degrees, turns out to be the structural glue between the fundamental subspaces.

Three roads lead out. Chapter 20 answers the plea of §19.9: given any basis, the Gram–Schmidt process manufactures an orthonormal one, and packaging its output yields the QR factorization $A = QR$ — the numerically sound way to compute every projection and least-squares fit in this chapter, sidestepping the fragile $A^{\mathsf{T}}A$. Chapter 22 makes the boldest leap: treat functions as vectors and integrals as dot products, and a Fourier coefficient becomes nothing but a projection onto an orthogonal basis of sines and cosines — the same formula $\mathbf{q}_i\cdot\mathbf{b}$, now in infinite dimensions. And Chapters 30–32 scale projection into the singular value decomposition and PCA, where you project onto the subspace that best fits your data. The right angle you have just made exact is about to organize regression, signals, and the geometry of data itself.