Chapter 32 — Key Takeaways

The one idea

A cloud of data has the shape of an ellipsoid, and PCA finds its axes by diagonalizing the covariance matrix. Center the data, build the covariance, and its eigenvectors are the perpendicular directions of maximum variance (the principal components), while its eigenvalues are the variances along them. Dimensionality reduction is keeping the longest axes of the cloud and discarding the shortest.

The big ideas

  • Center first, always. Subtract the column means so the cloud sits at the origin. Variance is spread about the mean, and the covariance formula only measures shape once location is removed. Forgetting to center is the most common PCA mistake — the dominant "component" then points at the cloud's center of mass.

  • The covariance matrix is the hinge. $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$ (for centered $\tilde X$) is symmetric (so the spectral theorem of Chapter 27 applies — real eigenvalues, perpendicular eigenvectors) and positive semidefinite (so the eigenvalues, which are variances, are $\ge 0$, Chapter 28). Its quadratic form $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ is the variance of the data projected onto $\mathbf{w}$.

  • PC1 is the direction of maximum variance, and the eigenvalue is that variance. Maximizing $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ over unit directions is a Rayleigh-quotient problem whose answer is the top eigenvector, with maximum value the top eigenvalue. The components, in order, are the best perpendicular directions one after another — the spectral theorem read as a sequence of variance maximizations.

  • Two routes, one answer. The principal components are the eigenvectors of the covariance matrix and the right singular vectors of the centered data matrix ($\tilde X = U\Sigma V^{\mathsf{T}}$), with eigenvalues $\lambda_i = \sigma_i^2/(n-1)$. They coincide because $C = V\bigl(\tfrac{1}{n-1}\Sigma^2\bigr)V^{\mathsf{T}}$ is the spectral decomposition of $C$. The SVD route is numerically preferred — it never forms the covariance matrix, so it never squares the condition number (Chapters 20, 38).

  • Explained variance chooses $k$. Each component explains a fraction $\lambda_k / \sum_i \lambda_i$ of the total variance (the trace of $C$). Keep the smallest $k$ whose cumulative explained variance clears your threshold (often 90–95%), or look for the scree-plot elbow. Projecting onto the top $k$ components reduces dimension; the squared reconstruction error equals $(n-1)$ times the discarded eigenvalues.

  • PCA has limits. It finds linear structure (flat subspaces, not curved manifolds), it is sensitive to feature scaling (standardize features on different scales, or use the correlation matrix), and its components are not always interpretable (they optimize variance, not meaning).

Skills you gained

  • Center a data matrix and build its covariance matrix by hand and in numpy.
  • Find principal components two ways — eigh of the covariance and svd of the centered data — and verify they agree.
  • Compute explained-variance ratios, read a scree plot, and choose the number of components $k$.
  • Project data onto the top components for dimensionality reduction, reconstruct it, and quantify the reconstruction error as discarded variance.
  • Diagnose and fix the two classic failures: forgetting to center, and running PCA on unscaled features.

Terms to know

principal component analysis (PCA), centering, covariance matrix, principal component, score, loading, explained-variance ratio, scree plot, dimensionality reduction, variance maximization, Rayleigh quotient, right singular vector, whitening, reconstruction error, eigen-image / eigenface.

How this ties to the recurring themes

  • Eigenvalues and eigenvectors reveal what a matrix really does (Theme 6): the covariance matrix's eigenvectors are the data's natural axes, and its eigenvalues are the spread along them. PCA is the most consequential everyday use of this theme.
  • Geometry and algebra are two views of one object (Theme 2): "covariance matrix," "data ellipse," and "directions of maximum variance" are the same thing in three languages. The Chapter 27 stretch-along-perpendicular-axes picture, the Chapter 28 elliptical level set, and this chapter's data cloud are one geometry.
  • Linear algebra is the most applied branch of pure mathematics (Theme 4): the same spectral theorem / SVD that compresses images (Chapter 31) and solves least squares (Chapter 19) finds the principal components of data. One decomposition, used everywhere.
  • The four fundamental subspaces (Theme 5): the principal components span the column space of the centered data's "best" low-rank approximation; dropping small components projects onto a lower-dimensional subspace of feature space.

Where this leads

Chapter 33 (Application: Machine Learning) is the climax of Part VI, and PCA is one of its load-bearing pillars. Dimensionality reduction is standard preprocessing before training a model — feeding a learner the top few components instead of hundreds of correlated raw features speeds training and fights overfitting. More deeply, the embeddings at the heart of modern machine learning are dimensionality reductions in spirit, and the matrix-factorization recommenders behind streaming and shopping are the same low-rank idea as PCA and Chapter 31's image compression — approximate a giant data matrix by the few directions that capture most of its structure. The covariance matrix's eigenvectors you learned to find here are the front door to the systems reshaping the world.

The thread to hold onto: PCA is not a new technique to memorize — it is the spectral theorem (Chapter 27) and the SVD (Chapter 30) applied to a covariance matrix. The spectral theorem supplies the meaning (perpendicular directions of maximum variance); the SVD supplies the computation (never square your data). Covariance, curvature, and stretch are one geometry. Rotate to the natural axes of the data, and keep the long ones.