Chapter 17 Quiz — Linear Regression as Projection

Q: In one sentence, what is the single geometric idea behind linear regression?

The least-squares fit is the orthogonal projection of the data vector onto the column space — the closest point in to — so the residual is perpendicular to .

Q: Why does an overdetermined system (tall , ) usually have no exact solution?

A solution exists iff . For a tall matrix, has dimension at most but sits inside the much larger , so it is a thin flat in a fat space. A generic data vector — especially noisy real data — does not lie on that thin flat, so no exact solution exists. We project instead.

Q: Write the normal equations and state the condition for a unique solution.

. The solution is unique iff has full column rank (linearly independent columns), which is exactly when is invertible. Then (a formula for understanding, not for computing — see Q11).

Q: Where does the equation come from? (Not "memorize it" — what is the *reason*?)

From orthogonality. The residual must be perpendicular to , i.e. orthogonal to every column of . "Orthogonal to every column" is the single matrix statement , i.e. , which rearranges to the normal equations.

Q: True or false: fitting a quadratic to data is a *nonlinear* problem.

False. It is still linear least squares. "Linear" refers to linearity in the unknown coefficients , not in . The model is a linear combination of the fixed columns , so we just add an column to the design matrix and solve the same normal equations (now ).

Q: The residual is orthogonal to *something*. Is it orthogonal to the data vector , or to the column space ?

To the column space (equivalently, to the fit , which lies in ). It is generally not orthogonal to itself. In fact with , so (Pythagoras).

Q: For the anchor data the minimum sum of squared residuals is and . What is , and what does it mean?

. The fitted line explains 64% of the variation in the -values; the remaining 36% is unexplained scatter (the residual). would be a perfect fit, no better than predicting the mean.

Q: Fit a horizontal line (one parameter) to data . What is the best ?

The mean . The design matrix is the single all-ones column ; the normal equation is , i.e. , so . Projecting onto the all-ones direction is averaging — a clean special case of regression-as-projection.

DataField.Dev

Chapter 17 Quiz — Linear Regression as Projection

Twelve quick checks on least squares, the normal equations, and the projection picture. Try each before opening the answer. Most are conceptual; a couple need a little arithmetic. Notation is locked: $A^{\mathsf{T}}A$, $\lVert\cdot\rVert$, $C(A)$, $\hat{\mathbf{x}}$, residual $\mathbf{r}$.

Q1. In one sentence, what is the single geometric idea behind linear regression?

Answer

The least-squares fit is the **orthogonal projection of the data vector $\mathbf{b}$ onto the column space $C(A)$** — the closest point in $C(A)$ to $\mathbf{b}$ — so the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ is perpendicular to $C(A)$.

Q2. Why does an overdetermined system $A\mathbf{x} = \mathbf{b}$ (tall $A$, $m > n$) usually have no exact solution?

Answer

A solution exists iff $\mathbf{b} \in C(A)$. For a tall matrix, $C(A)$ has dimension at most $n$ but sits inside the much larger $\mathbb{R}^m$, so it is a *thin* flat in a *fat* space. A generic data vector $\mathbf{b}$ — especially noisy real data — does not lie on that thin flat, so no exact solution exists. We project instead.

Q3. Write the normal equations and state the condition for a unique solution.

Answer

$A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. The solution is unique iff $A$ has **full column rank** (linearly independent columns), which is exactly when $A^{\mathsf{T}}A$ is invertible. Then $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ (a formula for understanding, not for computing — see Q11).

Q4. Where does the equation $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ come from? (Not "memorize it" — what is the reason?)

Answer

From orthogonality. The residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ must be perpendicular to $C(A)$, i.e. orthogonal to every column of $A$. "Orthogonal to every column" is the single matrix statement $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, i.e. $A^{\mathsf{T}}(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}$, which rearranges to the normal equations.

Q5. What is the design matrix for fitting a line $y = c_0 + c_1 x$ to data points $(x_i, y_i)$, and what is its first column?

Answer

$A$ has one row per data point and two columns: the first column is **all ones** (it multiplies the intercept $c_0$), the second column holds the $x_i$ values (it multiplies the slope $c_1$). The all-ones column is what allows a nonzero intercept; drop it and you force the line through the origin.

Q6. True or false: fitting a quadratic $y = c_0 + c_1 x + c_2 x^2$ to data is a nonlinear problem.

Answer

**False.** It is still *linear* least squares. "Linear" refers to linearity in the unknown *coefficients* $c_0, c_1, c_2$, not in $x$. The model is a linear combination of the fixed columns $1, x, x^2$, so we just add an $x^2$ column to the design matrix and solve the same normal equations (now $3\times 3$).

Q7. The residual is orthogonal to something. Is it orthogonal to the data vector $\mathbf{b}$, or to the column space $C(A)$?

Answer

To the **column space $C(A)$** (equivalently, to the fit $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$, which lies in $C(A)$). It is generally *not* orthogonal to $\mathbf{b}$ itself. In fact $\mathbf{b} = \hat{\mathbf{b}} + \mathbf{r}$ with $\hat{\mathbf{b}} \perp \mathbf{r}$, so $\lVert\mathbf{b}\rVert^2 = \lVert\hat{\mathbf{b}}\rVert^2 + \lVert\mathbf{r}\rVert^2$ (Pythagoras).

Q8. For the anchor data the minimum sum of squared residuals is $\lVert\mathbf{r}\rVert^2 = 3.6$ and $\mathrm{SS}_{\text{tot}} = 10$. What is $R^2$, and what does it mean?

Answer

$R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} = 1 - 3.6/10 = 0.64$. The fitted line explains **64%** of the variation in the $y$-values; the remaining 36% is unexplained scatter (the residual). $R^2 = 1$ would be a perfect fit, $R^2 = 0$ no better than predicting the mean.

Q9. Fit a horizontal line $y = c_0$ (one parameter) to data $\mathbf{b}$. What is the best $c_0$?

Answer

The **mean** $\bar{y}$. The design matrix is the single all-ones column $\mathbf{a} = (1,\dots,1)$; the normal equation is $\mathbf{a}^{\mathsf{T}}\mathbf{a}\,c_0 = \mathbf{a}^{\mathsf{T}}\mathbf{b}$, i.e. $m\,c_0 = \sum b_i$, so $c_0 = \bar{y}$. Projecting onto the all-ones direction *is* averaging — a clean special case of regression-as-projection.

Q10. What does it mean, geometrically, when $A^{\mathsf{T}}A$ is singular? Is the fit still well defined?

Answer

$A^{\mathsf{T}}A$ singular means $A$ lacks full column rank — at least two columns are linearly dependent (e.g. collinear features). The coefficient vector $\hat{\mathbf{x}}$ is then **not unique**: infinitely many $\hat{\mathbf{x}}$ give the same output, differing by vectors in $N(A)$. But the **fit $\hat{\mathbf{b}} = $ projection of $\mathbf{b}$ onto $C(A)$ is still unique** — the closest point in a subspace always exists and is unique. Only its *coordinates* are ambiguous.

Q11. Why is computing $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ a bad idea for real, large, or messy data?

Answer

Forming $A^{\mathsf{T}}A$ **squares the condition number**: $\kappa(A^{\mathsf{T}}A) = \kappa(A)^2$. A mildly ill-conditioned $A$ becomes severely ill-conditioned $A^{\mathsf{T}}A$, and inverting it loses many digits to floating-point error. Real software uses the **QR factorization** (Chapter 20) or the **SVD** (Chapter 30) on $A$ directly — which is exactly why `np.linalg.lstsq` uses SVD internally. The normal equations are for *understanding*, not industrial computation.

Q12. A degree-5 polynomial fit to 6 data points achieves $R^2 = 1$ (zero residual). Is this a great model?

Answer

No — it is **overfitting**. With 6 coefficients for 6 points, the column space has grown to fill all of $\mathbb{R}^6$, so $\mathbf{b}$ is reachable exactly and the residual is zero by construction. The curve has memorized the noise and will wiggle wildly between points, predicting *new* data badly. Low training residual is not the same as good prediction; choosing model size (the bias–variance tradeoff) is the next problem, beyond pure linear algebra.