Chapter 17 Quiz — Linear Regression as Projection
Twelve quick checks on least squares, the normal equations, and the projection picture. Try each before opening the answer. Most are conceptual; a couple need a little arithmetic. Notation is locked: $A^{\mathsf{T}}A$, $\lVert\cdot\rVert$, $C(A)$, $\hat{\mathbf{x}}$, residual $\mathbf{r}$.
Q1. In one sentence, what is the single geometric idea behind linear regression?
Answer
The least-squares fit is the **orthogonal projection of the data vector $\mathbf{b}$ onto the column space $C(A)$** — the closest point in $C(A)$ to $\mathbf{b}$ — so the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ is perpendicular to $C(A)$.Q2. Why does an overdetermined system $A\mathbf{x} = \mathbf{b}$ (tall $A$, $m > n$) usually have no exact solution?
Answer
A solution exists iff $\mathbf{b} \in C(A)$. For a tall matrix, $C(A)$ has dimension at most $n$ but sits inside the much larger $\mathbb{R}^m$, so it is a *thin* flat in a *fat* space. A generic data vector $\mathbf{b}$ — especially noisy real data — does not lie on that thin flat, so no exact solution exists. We project instead.Q3. Write the normal equations and state the condition for a unique solution.
Answer
$A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. The solution is unique iff $A$ has **full column rank** (linearly independent columns), which is exactly when $A^{\mathsf{T}}A$ is invertible. Then $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ (a formula for understanding, not for computing — see Q11).Q4. Where does the equation $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ come from? (Not "memorize it" — what is the reason?)
Answer
From orthogonality. The residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ must be perpendicular to $C(A)$, i.e. orthogonal to every column of $A$. "Orthogonal to every column" is the single matrix statement $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, i.e. $A^{\mathsf{T}}(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}$, which rearranges to the normal equations.Q5. What is the design matrix for fitting a line $y = c_0 + c_1 x$ to data points $(x_i, y_i)$, and what is its first column?
Answer
$A$ has one row per data point and two columns: the first column is **all ones** (it multiplies the intercept $c_0$), the second column holds the $x_i$ values (it multiplies the slope $c_1$). The all-ones column is what allows a nonzero intercept; drop it and you force the line through the origin.Q6. True or false: fitting a quadratic $y = c_0 + c_1 x + c_2 x^2$ to data is a nonlinear problem.
Answer
**False.** It is still *linear* least squares. "Linear" refers to linearity in the unknown *coefficients* $c_0, c_1, c_2$, not in $x$. The model is a linear combination of the fixed columns $1, x, x^2$, so we just add an $x^2$ column to the design matrix and solve the same normal equations (now $3\times 3$).Q7. The residual is orthogonal to something. Is it orthogonal to the data vector $\mathbf{b}$, or to the column space $C(A)$?
Answer
To the **column space $C(A)$** (equivalently, to the fit $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$, which lies in $C(A)$). It is generally *not* orthogonal to $\mathbf{b}$ itself. In fact $\mathbf{b} = \hat{\mathbf{b}} + \mathbf{r}$ with $\hat{\mathbf{b}} \perp \mathbf{r}$, so $\lVert\mathbf{b}\rVert^2 = \lVert\hat{\mathbf{b}}\rVert^2 + \lVert\mathbf{r}\rVert^2$ (Pythagoras).Q8. For the anchor data the minimum sum of squared residuals is $\lVert\mathbf{r}\rVert^2 = 3.6$ and $\mathrm{SS}_{\text{tot}} = 10$. What is $R^2$, and what does it mean?
Answer
$R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} = 1 - 3.6/10 = 0.64$. The fitted line explains **64%** of the variation in the $y$-values; the remaining 36% is unexplained scatter (the residual). $R^2 = 1$ would be a perfect fit, $R^2 = 0$ no better than predicting the mean.Q9. Fit a horizontal line $y = c_0$ (one parameter) to data $\mathbf{b}$. What is the best $c_0$?
Answer
The **mean** $\bar{y}$. The design matrix is the single all-ones column $\mathbf{a} = (1,\dots,1)$; the normal equation is $\mathbf{a}^{\mathsf{T}}\mathbf{a}\,c_0 = \mathbf{a}^{\mathsf{T}}\mathbf{b}$, i.e. $m\,c_0 = \sum b_i$, so $c_0 = \bar{y}$. Projecting onto the all-ones direction *is* averaging — a clean special case of regression-as-projection.Q10. What does it mean, geometrically, when $A^{\mathsf{T}}A$ is singular? Is the fit still well defined?
Answer
$A^{\mathsf{T}}A$ singular means $A$ lacks full column rank — at least two columns are linearly dependent (e.g. collinear features). The coefficient vector $\hat{\mathbf{x}}$ is then **not unique**: infinitely many $\hat{\mathbf{x}}$ give the same output, differing by vectors in $N(A)$. But the **fit $\hat{\mathbf{b}} = $ projection of $\mathbf{b}$ onto $C(A)$ is still unique** — the closest point in a subspace always exists and is unique. Only its *coordinates* are ambiguous.Q11. Why is computing $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ a bad idea for real, large, or messy data?
Answer
Forming $A^{\mathsf{T}}A$ **squares the condition number**: $\kappa(A^{\mathsf{T}}A) = \kappa(A)^2$. A mildly ill-conditioned $A$ becomes severely ill-conditioned $A^{\mathsf{T}}A$, and inverting it loses many digits to floating-point error. Real software uses the **QR factorization** (Chapter 20) or the **SVD** (Chapter 30) on $A$ directly — which is exactly why `np.linalg.lstsq` uses SVD internally. The normal equations are for *understanding*, not industrial computation.Q12. A degree-5 polynomial fit to 6 data points achieves $R^2 = 1$ (zero residual). Is this a great model?