Chapter 32 — Further Reading
Annotated pointers for going deeper on principal component analysis, mapped to the three standard texts this book tracks (Strang, Axler, Boyd–Vandenberghe) plus free online resources. PCA sits at the intersection of linear algebra, statistics, and data science, so the references span all three.
In the standard linear-algebra texts
-
Gilbert Strang, Introduction to Linear Algebra (and Linear Algebra and Learning from Data). Strang treats PCA as the headline application of the SVD, which matches this chapter's view exactly. Learning from Data (2019) is the better fit here: its opening chapters build PCA directly from the SVD of the centered data matrix, with Strang's characteristic emphasis on the geometry of the four subspaces and the "data ellipse." Read it after this chapter to see the same two-routes story (covariance eigenvectors = right singular vectors) developed at book length, with real datasets. Strang's MIT OpenCourseWare lectures on the SVD and PCA are the canonical free video treatment.
-
Sheldon Axler, Linear Algebra Done Right (4th ed.). Axler is the place to go for the theory underneath PCA — the spectral theorem for self-adjoint operators (his Chapter 7) is the abstract heart of §32.4. Axler proves the spectral theorem via the operator-theoretic, determinant-free route and develops the singular value decomposition and the minimax/variational characterization of eigenvalues that justifies the "next-best perpendicular direction" description of principal components. Axler does not cover PCA as an application (the book is resolutely pure), but it is the rigorous foundation for why the maximum-variance direction is the top eigenvector.
-
Stephen Boyd & Lieven Vandenberghe, Introduction to Applied Linear Algebra (VMLS). The most application-minded of the three, and freely available online. Boyd–Vandenberghe frame PCA through least squares, low-rank approximation, and clustering, and their treatment of the data matrix, k-means, and the geometry of fitting is the perfect companion to this chapter's "PCA is the SVD of the centered data" message. Their notation (data points as rows, the emphasis on the practical workflow) closely matches a working data scientist's, and the book's exercises are computational and concrete.
On the SVD and the spectral theorem (the foundations)
- Trefethen & Bau, Numerical Linear Algebra. The definitive treatment of why the SVD route is numerically preferred (§32.8.1). Lectures 4–5 develop the SVD and its geometry; the discussion of conditioning and the squaring of the condition number when forming $A^{\mathsf{T}}A$ is exactly the warning behind "never form the covariance matrix." This is the reference for the Chapter 38 material that this chapter leans on.
- Strang's "Four Fundamental Subspaces" and SVD lectures (MIT OCW 18.06). Free, and the clearest visual explanation of the SVD geometry that PCA inherits.
On PCA in statistics and data science
- Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements of Statistical Learning (ESL), §14.5. The standard graduate-level treatment of PCA as unsupervised learning: principal components, the connection to the SVD, principal-component regression, and the relationship to clustering. Freely available from the authors. ESL also covers the nonlinear extensions (kernel PCA, principal curves) that address PCA's linearity limitation (§32.10).
- Gareth James et al., An Introduction to Statistical Learning (ISL), §12.2. The gentler, undergraduate sibling of ESL, also free. Its PCA chapter is the most accessible introduction to choosing the number of components, the scree plot, and the proportion-of-variance-explained — the §32.7 material — with worked examples in R and Python.
- I. T. Jolliffe, Principal Component Analysis (2nd ed.). The encyclopedic monograph devoted entirely to PCA: every variant, every diagnostic, every application, including the Pearson-versus-Hotelling history (the
[verify]claims in this chapter's Historical Note trace back here). The reference of last resort when you need to know whether a PCA question has a known answer.
The original papers and key applications
- Karl Pearson, "On lines and planes of closest fit to systems of points in space," Philosophical Magazine (1901). The founding paper, framing PCA as the best-fit line/plane minimizing perpendicular distances — the reconstruction-error view of §32.7.2. Short and readable.
- Harold Hotelling, "Analysis of a complex of statistical variables into principal components," Journal of Educational Psychology (1933). The paper that named the method and developed the maximum-variance formulation. The two papers together are the duality this chapter describes.
- Matthew Turk & Alex Pentland, "Eigenfaces for recognition," Journal of Cognitive Neuroscience (1991). The landmark application of PCA to computer vision (the §32.7.1 case and the Real-World Application callout). Shows PCA turning face recognition from intractable to practical.
- John Novembre et al., "Genes mirror geography within Europe," Nature (2008). The striking population-genetics result behind Case Study 1: the top two principal components of European genotypes reproduce a map of Europe. A vivid demonstration that PCA finds real structure.
Free online resources
- scikit-learn documentation,
sklearn.decomposition.PCA. The reference implementation this chapter's toolkit verifies against. The user guide explains why scikit-learn computes PCA via the SVD of the centered data (never the covariance matrix), confirming §32.8.1 in production code, and documents the sign convention, theexplained_variance_ratio_, and the whitening option. - 3Blue1Brown, Essence of Linear Algebra (YouTube). For rebuilding the geometric intuition — the change-of-basis and eigenvector videos are the visual foundation for "PCA is a rotation to the natural axes of the data."
- StatQuest with Josh Starmer, "PCA, Step-by-Step" (YouTube). The most popular gentle walkthrough of PCA for a data-science audience, covering centering, the scree plot, and loadings in plain language.
Where to go next in this book
- Chapter 33 (Application: Machine Learning) — PCA as preprocessing, embeddings, and the matrix-factorization recommenders that are the same low-rank idea.
- Chapter 38 (Numerical Linear Algebra) — the full story of conditioning and why squaring the data (forming the covariance) is numerically dangerous.
- Chapter 34 (Inner Product Spaces) — the infinite-dimensional generalization, where PCA becomes the Karhunen–Loève expansion and the spectral theorem lives in Hilbert space.
- Chapters 27 and 30 — the spectral theorem and the SVD, the two pillars this entire chapter rests on. If any step here felt shaky, the foundation is there.