Chapter 17 Exercises — Linear Regression as Projection
How to use these. Work the ⭐ problems first to lock in the geometry (no heavy computation). The ⭐⭐ problems are by-hand least-squares: build the design matrix, form the normal equations, solve, check orthogonality. The ⭐⭐⭐ problems split into proofs (the A track) and coding with
numpy(the C track); do the ones that match your path — the strongest students do both. The ⭐⭐⭐⭐ problems are applied: find the regression hiding in a real situation. Tags: [hand] = pencil only, [code] = needsnumpy, [proof] = rigorous argument, [essay] = written explanation. Notation is locked: $A^{\mathsf{T}}A$ for the Gram matrix, $\lVert\cdot\rVert$ for the norm, $C(A)$ for the column space, $\hat{\mathbf{x}}$ for the least-squares solution, $\mathbf{r}$ for the residual.
Tier ⭐ — Conceptual (what is / why)
17.1 [hand] In one sentence each, state (a) what makes a system $A\mathbf{x} = \mathbf{b}$ overdetermined, and (b) why such a system usually has no exact solution. Use the words "column space."
17.2 [hand] Complete and explain: "The least-squares fit $A\hat{\mathbf{x}}$ is the __ of $\mathbf{b}$ onto _, and the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ is ___ to it." Fill the three blanks and give the one-line geometric reason.
17.3 [hand] Write down the normal equations for a general $A\mathbf{x} = \mathbf{b}$, and state the condition on $A$ under which they have a unique solution. What goes wrong (geometrically) when that condition fails?
17.4 [essay] Explain in your own words why "least squares minimizes squared error" and "least squares finds the closest point in a subspace" are the same statement, not two different facts.
17.5 [hand] True or false, one-line reason each: (a) The residual is orthogonal to the data vector $\mathbf{b}$. (b) The residual is orthogonal to every column of $A$. (c) If $\mathbf{b} \in C(A)$, the residual is $\mathbf{0}$ and the fit is exact. (d) Fitting a parabola to data is a nonlinear least-squares problem.
17.6 [hand] A design matrix $A$ for fitting a line $y = c_0 + c_1 x$ has two columns. What is the first column, and why? What changes about the fitted line if you delete it?
17.7 [essay] Why does ordinary least squares minimize the vertical gaps between points and the line, rather than the perpendicular distances? In what situation would you want the perpendicular-distance fit instead?
17.8 [hand] State what $R^2 = 1$ and $R^2 = 0$ each mean about a fit. Can $R^2$ be negative for a model that includes an intercept fit by least squares? Why or why not?
Tier ⭐⭐ — Computation by hand
17.9 [hand] Fit a line to the three points $(0, 1)$, $(1, 0)$, $(2, 2)$. (a) Write the design matrix $A$ and vector $\mathbf{b}$. (b) Form $A^{\mathsf{T}}A$ and $A^{\mathsf{T}}\mathbf{b}$. (c) Solve the $2\times 2$ normal equations for $\hat{\mathbf{x}} = (c_0, c_1)$. (d) Compute the residual and verify $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$.
17.10 [hand] You measure $A^{\mathsf{T}}A = \begin{bmatrix} 4 & 8 \\ 8 & 20 \end{bmatrix}$ and $A^{\mathsf{T}}\mathbf{b} = \begin{bmatrix} 12 \\ 28 \end{bmatrix}$ for some line fit. Find $\hat{\mathbf{x}}$ by solving the normal equations. (Check: the determinant of $A^{\mathsf{T}}A$ is $16$.)
17.11 [hand] For the anchor data $(1,2),(2,1),(3,4),(4,3),(5,5)$, the normal equations are $\begin{bmatrix} 5 & 15 \\ 15 & 55\end{bmatrix}\hat{\mathbf{x}} = \begin{bmatrix} 15 \\ 53 \end{bmatrix}$. Solve them by hand and confirm you get $\hat{\mathbf{x}} = (0.6, 0.8)$. Then compute the fitted value $\hat{y}$ at $x = 3$.
17.12 [hand] Fit a horizontal line $y = c_0$ (one parameter, no slope) to the four values $\mathbf{b} = (3, 7, 5, 9)$. (a) What is the design matrix $A$? (b) Solve the (now $1\times 1$) normal equation. (c) Show that the best constant fit is exactly the mean $\bar{y}$, and explain why projection onto the all-ones column is averaging.
17.13 [hand] For the fit in Exercise 17.11, the residual is $\mathbf{r} = (0.6, -1.2, 1.0, -0.8, 0.4)$. Compute $\lVert\mathbf{r}\rVert^2$ and verify it equals $3.6$. Then compute $\mathrm{SS}_{\text{tot}} = \sum(y_i - \bar{y})^2$ and use it to find $R^2$.
17.14 [hand] Set up (do not fully solve) the normal equations to fit a quadratic $y = c_0 + c_1 x + c_2 x^2$ to the three points $(-1, 2), (0, 0), (1, 2)$. Write the $3\times 3$ matrix $A^{\mathsf{T}}A$ and the vector $A^{\mathsf{T}}\mathbf{b}$. (Note: with 3 points and 3 parameters, the fit will be exact — explain why in one line.)
17.15 [hand] Two columns of a design matrix are $\mathbf{a}_1 = (1,1,1)$ and $\mathbf{a}_2 = (2,2,2)$. Show that $A^{\mathsf{T}}A$ is singular, and explain what this says about the uniqueness of the least-squares coefficients. Is the fit $\hat{\mathbf{b}}$ still unique?
17.16 [hand] Given the line fit $\hat{y} = 1.5 + 0.5x$ to points $(0,1),(1,3),(2,2)$ (worked in §17.6), verify directly that swapping to any other line — say $\hat{y} = 1 + x$ — gives a strictly larger sum of squared residuals. Compute both SSE values.
Tier ⭐⭐⭐ — Proof (A track) and Coding (C track)
17.17 [proof] Prove that the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ of the least-squares solution is orthogonal to $C(A)$, starting from the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. (Show $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, then argue why that means $\mathbf{r}\perp\mathbf{v}$ for every $\mathbf{v}\in C(A)$.)
17.18 [proof] Derive the normal equations from the geometry (the reverse direction of 17.17): assume $\mathbf{r}$ is orthogonal to every column of $A$ and deduce $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$.
17.19 [proof] Prove the Gram-matrix theorem of §17.5: for real $A$, $N(A^{\mathsf{T}}A) = N(A)$, and hence $A^{\mathsf{T}}A$ is invertible iff $A$ has full column rank. (The crux: if $A^{\mathsf{T}}A\mathbf{x} = \mathbf{0}$, dot with $\mathbf{x}$ to get $\lVert A\mathbf{x}\rVert^2 = 0$.)
17.20 [proof] Prove the Pythagorean optimality of the projection: if $\mathbf{r} = \mathbf{b} - \mathbf{p}$ is orthogonal to $C(A)$ and $\mathbf{q}$ is any other vector in $C(A)$, then $\lVert\mathbf{b} - \mathbf{q}\rVert^2 = \lVert\mathbf{b} - \mathbf{p}\rVert^2 + \lVert\mathbf{p} - \mathbf{q}\rVert^2 \ge \lVert\mathbf{b}-\mathbf{p}\rVert^2$. Conclude $\mathbf{p}$ is the unique minimizer.
17.21 [proof] Show that the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$ (full column rank $A$) is symmetric ($P^{\mathsf{T}} = P$) and idempotent ($P^2 = P$). What is $P\mathbf{v}$ for a vector $\mathbf{v}$ already in $C(A)$, and why?
17.22 [code] Write a function fit_line(x, y) that builds the design matrix [1, x], forms and solves the normal equations with np.linalg.solve, and returns (c0, c1). Test it on the anchor data and confirm (0.6, 0.8). Then verify against np.linalg.lstsq and np.polyfit(x, y, 1).
17.23 [code] Write r_squared(x, y, coeffs) for a polynomial fit (coeffs in increasing-power order). Use it to compute $R^2$ for a linear and a quadratic fit to the points $x = [-2,-1,0,1,2]$, $y = [4.5, 1.0, 0.5, 1.5, 5.0]$. Confirm the quadratic gives $R^2 \approx 0.993$ and the linear is much worse.
17.24 [code] Demonstrate the §17.9 conditioning lesson numerically: build a design matrix with two nearly-collinear columns, print np.linalg.cond(A) and np.linalg.cond(A.T @ A), and confirm the second is approximately the square of the first. In one sentence, state the practical consequence.
17.25 [code] Implement least_squares(A, b) from scratch (the toolkit task): form $A^{\mathsf{T}}A$ and $A^{\mathsf{T}}\mathbf{b}$ using your own transpose/matmul (or nested loops), solve the square system with your Chapter 4 Gaussian elimination, and raise an error if the pivot is zero (singular Gram matrix). Verify against np.linalg.lstsq on both the 3-point and 5-point examples of §17.6.
Tier ⭐⭐⭐⭐ — Application
17.26 [code] House prices. Using the eight-house dataset of Case Study 1 (features: intercept, size in 1000 sqft, bedrooms, age in years), fit the multiple-regression model with np.linalg.lstsq. Report the four coefficients and $R^2$, interpret the sign of the age coefficient in plain English, and predict the price of a 1.6k-sqft, 3-bedroom, 12-year-old house.
17.27 [code] Sensor calibration. A thermistor outputs voltages $V = [0.50, 0.95, 1.55, 2.05, 2.60, 3.10]$ at known temperatures $T = [10.2, 20.5, 35.1, 47.8, 61.0, 73.5]$ °C. Fit a linear calibration $T = c_0 + c_1 V$ by least squares, report the slope (°C per volt) and $R^2$, and use the fit to convert a new reading of $V = 1.80$ V to a temperature. Then fit a quadratic and say whether the extra term is worth it.
17.28 [essay] Moore's-law-style trend. You have yearly data on a quantity that grows roughly exponentially (e.g., transistors per chip). Explain why fitting a straight line to $(\text{year}, \log y)$ — rather than to $(\text{year}, y)$ — turns an exponential trend into a linear least-squares problem. What do the resulting slope and intercept mean back in the original units?
17.29 [code] Overfitting demo. Take any 6 data points and fit polynomials of degree 1, 2, 3, 4, and 5 by least squares. Tabulate the residual sum of squares $\lVert\mathbf{r}\rVert^2$ for each degree. Confirm it decreases monotonically and hits (essentially) zero at degree 5, then explain in two sentences why "lowest training residual" is the wrong criterion for choosing the model.
17.30 [essay] Pick one field you care about (sports, medicine, finance, climate, music) and describe a real prediction problem in it as a least-squares projection: name $\mathbf{b}$, name the columns of $A$, say what $C(A)$ represents, and say what the residual would mean. Two paragraphs.