Chapter 17 — Further Reading
Annotated pointers for going deeper on least squares, the normal equations, and the geometry of regression as projection. The three "anchor" textbooks below are referenced throughout this book; we map each chapter to the relevant sections so you can read in parallel. Section numbers follow the most widely circulated editions and may shift slightly between printings.
The three anchor textbooks
- Gilbert Strang, Introduction to Linear Algebra (5th ed.), §4.3 (Least Squares Approximations) and §4.2 (Projections) — the essential companion to this chapter. Strang develops least squares exactly in this chapter's spirit: he draws the projection of $\mathbf{b}$ onto the column space first, derives the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ from orthogonality, and works the line-fit and parabola-fit by hand. §4.2 (projection onto a subspace, the projection matrix $P = A(A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}$) is the formal version of our §17.6.2 and is completed in our Chapter 19. §4.4 then gives Gram–Schmidt and QR — the numerically sound solver our §17.9 points you toward (our Chapter 20). If you read one outside source for this material, read §4.2–4.3 of Strang; his order is this chapter's order.
- Stephen Boyd & Lieven Vandenberghe, Introduction to Applied Linear Algebra (VMLS), Chapters 12–13 (Least Squares, Least Squares Data Fitting). The applied, data-first treatment, and the closest match to this chapter's case studies. Chapter 12 sets up the least-squares problem and the normal equations; Chapter 13 is entirely about fitting models to data — straight-line fit, polynomial fit, feature engineering, and $R^2$ — with the same design-matrix viewpoint we use. Best matched to the CS/data-science learning path. Freely and legally downloadable as a PDF from the authors, with companion Python notebooks (see below).
- Sheldon Axler, Linear Algebra Done Right (4th ed.), §6B (Orthogonal Complements and Minimization Problems). The rigorous, proof-first complement. Axler proves the projection-minimizes-distance theorem (our §17.3 Pythagorean argument) abstractly in any inner-product space, and frames the best approximation as the orthogonal projection onto a subspace — the coordinate-free heart of this chapter. Math majors should read §6B alongside our §17.5 derivation and the Math-Major Sidebar on the pseudoinverse; it shows why "regression is projection" holds far beyond $\mathbb{R}^n$ (it powers Fourier series in Chapter 22, where the same projection picks out the best trigonometric approximation).
Free online resources
- MIT OpenCourseWare, 18.06 Linear Algebra (Gilbert Strang), Lectures 15–16. The definitive video companion. Lecture 15 ("Projections onto Subspaces") builds the projection matrix and the geometry; Lecture 16 ("Projection Matrices and Least Squares") derives the normal equations and fits a line to data on the blackboard, narrating exactly our §17.3–17.6. Strang's hand-drawn picture of $\mathbf{b}$, its projection $\mathbf{p}$ in the column space, and the perpendicular error $\mathbf{e} = \mathbf{b} - \mathbf{p}$ is the picture this chapter is built around. Full video, transcripts, and problem sets, free.
- 3Blue1Brown, Essence of Linear Algebra. Grant Sanderson's series does not have a dedicated least-squares episode, but the "Dot products and duality" and the projection segments give the geometric foundation — what it means to drop a perpendicular onto a line or plane — that makes "least squares is projection" click. Watch before re-reading §17.3 if the perpendicular-is-closest idea has not yet become visual.
- Boyd & Vandenberghe, VMLS free PDF and Python companion. The full textbook and its
numpy-friendly companion notebooks are posted by the authors at no cost. The data-fitting notebooks show straight-line and polynomial fits, $R^2$, and the design-matrix construction directly in code, reinforcing the C-track exercises and the toolkit function of §17.10. - scikit-learn documentation, "Linear Models" (user guide) and
numpy.linalg.lstsqreference. For the working data scientist:sklearn.linear_model.LinearRegressionis exactly this chapter's least squares with a friendly interface, and its docs explain the SVD-based solver (the §17.9 lesson). Thenp.linalg.lstsqreference documents the 4-tuple return (solution, residual sum of squares, rank, singular values) used in our §17.10 Computational Note.
On the applications in this chapter
- Predicting house prices (Case Study 1). The hedonic-pricing model — price as a linear function of attributes — is a standard topic in econometrics and real-estate analytics; search "hedonic regression" or see any applied-regression text. Production automated-valuation models add hundreds of features, regularization (ridge/LASSO, which modify the normal equations by adding to $A^{\mathsf{T}}A$), and careful train/test evaluation — the data-science layer at linear regression. The multicollinearity hazard (when feature columns are nearly dependent and $A^{\mathsf{T}}A$ is near-singular) is the practical face of our full-column-rank condition; see any regression text's treatment of the variance inflation factor, and Chapter 38 on the condition number.
- Sensor calibration (Case Study 2). Least-squares calibration of instruments is covered in any measurement-and-instrumentation or experimental-physics methods text; look for "calibration curve" and "polynomial calibration." The thermistor and thermocouple examples are classic — thermocouples in particular use high-order polynomial calibrations (NIST publishes the standard coefficients), a real case where the Vandermonde ill-conditioning of §17.9 forces QR or orthogonal-polynomial fitting. Curve-fitting in the physical sciences, including weighted least squares (when measurement errors differ point to point), is treated in Bevington & Robinson, Data Reduction and Error Analysis for the Physical Sciences.
- The statistics behind the geometry. This chapter is deliberately the linear-algebra view; the inferential view — standard errors on the coefficients, confidence and prediction intervals, hypothesis tests, the Gauss–Markov theorem (why least squares is the best linear unbiased estimator under standard noise assumptions) — lives in regression in statistics and any mathematical-statistics text. The bridge is the assumption that the residuals are independent Gaussian noise, under which the least-squares projection coincides with the maximum-likelihood estimate.
A note on where this is going
The projection idea you used informally here is formalized in Chapter 19 (the orthogonal-projection operator and the matrix $P$), made computationally safe in Chapter 20 (QR), and generalized to every matrix in Chapter 30 (the SVD and the Moore–Penrose pseudoinverse, which solve least squares even when $A^{\mathsf{T}}A$ is singular). The same "best approximation in a subspace" reappears in Chapter 22 as Fourier series — projecting a function onto a basis of sines and cosines — and underlies the matrix-factorization recommenders and final-layer regressions of Chapter 33. Least squares is the first and most important place the four-subspaces picture turns into a tool you will use for the rest of your quantitative life.