Principal Component Analysis: Finding the Directions That Matter in Data

DataField.Dev

51 min read

> Learning paths. Math majors — read everything, especially the variance-maximization proof (§32.4, the Rayleigh-quotient argument that PC1 is the top eigenvector), the Math-Major Sidebar on the Courant–Fischer deflation characterization, and the...

Prerequisites

chapter-30-singular-value-decomposition
chapter-27-spectral-theorem

Learning Objectives

Explain PCA geometrically as finding the orthogonal directions of maximum variance in a cloud of data, before any algebra.
Center a data matrix and build its covariance matrix $C = \tfrac{1}{n-1}X^{\mathsf{T}}X$, and explain why $C$ is symmetric and positive semidefinite (Chapter 28).
Show that the principal components are the eigenvectors of the covariance matrix (spectral theorem, Chapter 27) AND the right singular vectors of the centered data matrix (SVD, Chapter 30), and that the two routes give the same answer.
Interpret each eigenvalue as the variance captured along its principal component, compute the explained-variance ratio, and choose the number of components $k$ to keep.
Project data onto the top $k$ components to reduce dimensionality, reconstruct the data, and quantify the reconstruction error as discarded variance.
Explain why the SVD route is the numerically preferred one (it never forms the covariance matrix) and implement `pca(X, k)` verified against numpy and sklearn.

In This Chapter

32.1 What does PCA actually do? (look at the cloud first)
32.2 Why do we center the data first?
32.3 What is the covariance matrix, and why is it symmetric?
32.4 Why is the first principal component the top eigenvector? (variance maximization)
32.5 How do you actually do PCA? (the eigen route, worked by hand)
32.6 Computing PCA in numpy (the eigen route, verified)
32.7 How do you choose how many components to keep?
32.8 Why is PCA secretly the SVD? (the second route)
32.9 What does PCA look like? (the data ellipse and its principal axes)
32.10 What can go wrong, and what PCA cannot do?
32.11 How do you implement PCA from scratch?
32.12 What have we built, and where does it lead?

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Principal Component Analysis: Finding the Directions That Matter in Data

Learning paths. Math majors — read everything, especially the variance-maximization proof (§32.4, the Rayleigh-quotient argument that PC1 is the top eigenvector), the Math-Major Sidebar on the Courant–Fischer deflation characterization, and the SVD–eigen equivalence of §32.8. CS / Data Science — focus on the Geometric Intuition (the data ellipse), the two computational routes, the explained-variance ratio and how to choose $k$, and the eigenfaces and genetics case studies; the deepest sidebars are optional. Physics / Engineering — focus on the geometry of principal axes, the covariance-as-shape picture, and the connection back to the principal-axis theorem of Chapter 27.

Give a thousand people a long questionnaire — fifty questions about their habits, opinions, and tastes — and you have a cloud of a thousand points floating in a fifty-dimensional space. You cannot picture it. You cannot plot it. And yet, almost always, the cloud is not really fifty-dimensional. The fifty answers are tangled together by a handful of hidden themes — political leaning, say, or risk tolerance — so the points actually huddle close to a flat sheet of two or three dimensions buried inside the fifty. Principal component analysis is the tool that finds that sheet. It is the single most widely used technique in all of applied data analysis, and this chapter is where you discover that it is nothing but the linear algebra you already know, wearing a data scientist's coat.

Here is the whole idea in one sentence, and it is worth reading twice: a cloud of data has a shape, that shape is an ellipsoid, and the axes of the ellipsoid are exactly the eigenvectors of the data's covariance matrix. We have met every piece of that sentence already. In Chapter 27 the spectral theorem told us a symmetric matrix is a pure stretch along perpendicular eigen-axes. In Chapter 28 we learned that a covariance matrix is symmetric and positive semidefinite, and that its quadratic form $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ is precisely the variance of the data projected onto the direction $\mathbf{w}$ — and we drew the data ellipse whose axes are the eigenvectors of $C$. PCA is the moment that picture stops being a remark at the end of a chapter and becomes the most useful algorithm in the data scientist's toolkit. The "principal components" are those perpendicular axes; the eigenvalues are the variances along them; and dimensionality reduction is keeping the longest axes and throwing away the shortest.

There are two roads to the same destination, and a major goal of this chapter is to show that they arrive at the identical answer. The first road runs through Chapter 27: build the covariance matrix and apply the spectral theorem to it — its eigenvectors are the principal components. The second road runs through Chapter 30: take the singular value decomposition of the centered data matrix itself — its right singular vectors are the principal components, and its singular values are the variances (up to a scale). The first road is how PCA is usually explained; the second is how PCA is actually computed, because — as we warned in Chapter 20 and will hammer home in Chapter 38 — forming the covariance matrix throws away precision you can never get back. By the end of this chapter you will understand both roads, see exactly why they coincide, and know which one to drive.

Two warnings sit at the gate, and we will repeat them throughout. First, PCA rests entirely on the spectral theorem (and the SVD), and the spectral theorem requires a symmetric matrix — which is exactly why we use the covariance matrix and not the data matrix directly. Everything that makes PCA work — real variances, perpendicular components, orthogonal axes — is the symmetry of $C$ cashing out as geometry. Second, PCA finds linear structure: it looks for flat sheets, not curved ones, and it measures importance by variance, which makes it sensitive to the scale of your features. Keep both in mind; they are the difference between PCA done right and PCA done badly. As always, let us look at the picture first.

32.1 What does PCA actually do? (look at the cloud first)

Forget formulas for a moment and just look at a cloud of points. Imagine a scatter of data in the plane — measurements of two features, plotted as dots — and suppose the dots form a long, tilted, cigar-shaped blob running from lower-left to upper-right. The cloud has an obvious grain to it, like the grain in a piece of wood: there is a direction along which it spreads out a lot, and a perpendicular direction along which it spreads out very little. PCA is the algorithm that finds those directions. It draws the long axis of the cloud — the direction of greatest spread — and calls it the first principal component. Then it draws the perpendicular axis — the next-greatest spread — and calls it the second. In higher dimensions it keeps going, each new component perpendicular to all the previous ones, each capturing as much of the remaining spread as it can.

Why would you want those directions? Because they let you summarize. If the cloud is a thin cigar, then knowing where a point falls along the long axis tells you almost everything about it; its position along the short axis is nearly constant noise. So you can describe each two-feature point with a single number — its coordinate along the long axis — and lose almost nothing. You have compressed two dimensions into one by rotating your coordinate system to align with the grain of the data. That is the entire point of PCA: find the natural axes of the data, then keep only the few that carry the spread. And here is the quiet thesis of the whole chapter: the PCA linear algebra is not new — finding those natural axes is exactly diagonalizing a symmetric matrix, the spectral theorem of Chapter 27, applied to the data's covariance.

Geometric Intuition — A cloud of data points has a shape, and for data with linear correlations that shape is an ellipse (in 2D) or an ellipsoid (in higher dimensions). PCA finds the axes of that ellipsoid. The longest axis is the first principal component — the direction the data varies most. The next-longest perpendicular axis is the second component, and so on. Picture rotating a coordinate frame until its axes line up with the ellipse's axes: in that rotated frame the features become uncorrelated, and a few axes carry almost all the spread. That is PCA, drawn before any algebra: it is a rotation to the natural axes of the data cloud.

Notice the word perpendicular. The principal components are always mutually orthogonal — at right angles — and this is not a design choice we impose; it falls out for free. The reason is the deepest fact of this chapter, and you already proved it: the components are eigenvectors of a symmetric matrix (the covariance matrix), and Chapter 27 showed that a symmetric matrix has an orthonormal basis of eigenvectors. The perpendicularity of PCA's axes is the §27.5 orthogonality of eigenvectors, in disguise. Hold onto that; we will keep pointing at it.

The Key Insight — PCA is the spectral theorem applied to a covariance matrix. The principal components are the orthonormal eigenvectors of $C$ (perpendicular because $C$ is symmetric, Chapter 27), and the eigenvalues are the variances along them. Equivalently, PCA is the SVD of the centered data. Either way, "find the directions that matter in data" is "diagonalize a symmetric matrix."

32.1.1 Why variance is the thing we maximize

There is a choice hidden in "the direction of greatest spread," and it is worth making explicit: PCA measures spread by variance. The first principal component is defined to be the unit direction along which the projected data has the largest variance; the second is the largest-variance direction perpendicular to the first; and so on. Variance — the average squared distance from the mean — is the natural measure of "how much does the data spread this way," and Chapter 28 gave us the exact tool to compute it for any direction: the quadratic form of the covariance matrix. We will see in §32.4 that maximizing this variance is precisely an eigenvalue problem, and that the maximizing direction is the top eigenvector of $C$. For now, hold the picture: PC1 is the direction of maximum variance, and its eigenvalue is that maximum variance.

This choice — variance as importance — is also PCA's greatest strength and its most notorious trap. It is a strength because variance is exactly what "carries information" means for many datasets: a feature that never changes tells you nothing, and a direction along which everything is identical can be dropped without loss. It is a trap because variance has units, so a feature measured in millimeters will have a thousand times the variance of the same feature measured in meters, and PCA will chase the large-variance feature simply because of its scale. We will return to this in the Common Pitfall of §32.7; for now, simply register that "important" in PCA means "high variance," and that you, the analyst, are responsible for making sure variance means what you want it to.

32.2 Why do we center the data first?

Every PCA begins with a step so simple it is easy to skip and so essential that skipping it breaks everything: centering. We subtract the mean of each feature, shifting the entire cloud so that its center of mass sits at the origin. If $\mathbf{x}_1,\dots,\mathbf{x}_n$ are the data points (rows of a data matrix $X$, one row per sample, one column per feature), the column mean is

$$\boldsymbol\mu = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i,$$

and the centered data matrix has the mean subtracted from every row:

$$\tilde{X} = X - \mathbf{1}\boldsymbol\mu^{\mathsf{T}}, \qquad \text{so each centered point is } \tilde{\mathbf{x}}_i = \mathbf{x}_i - \boldsymbol\mu.$$

(The notation $\mathbf{1}\boldsymbol\mu^{\mathsf{T}}$ is an $n\times d$ matrix with the mean row $\boldsymbol\mu^{\mathsf{T}}$ repeated in every row — we subtract the same mean from each sample.) After centering, every column of $\tilde X$ sums to zero; the cloud is balanced around the origin.

Why does this matter so much? Because variance and covariance are defined as spread about the mean, and the slick matrix formula for covariance — the one that lets us use the spectral theorem — only works when the mean is already zero. Geometrically, the principal components are directions through the origin, and a direction is only meaningful relative to the cloud's center. If you forgot to center, the dominant "direction of variation" PCA would find is the direction from the origin to the cloud's center of mass — an artifact of where the cloud happens to sit, not of its internal shape. The whole point is to capture the shape of the cloud, and shape is intrinsic only once you have removed the location.

Geometric Intuition — Centering slides the data cloud so its balance point lands on the origin, without rotating or rescaling it — the shape is untouched, only the location moves. Picture a swarm of bees hovering over a field: centering moves the swarm so its center hangs over your feet, so that "which way does it spread?" is asked from inside the swarm rather than from across the field. PCA then finds the spread directions of the swarm itself, not the direction you happen to be standing from it.

Common Pitfall — Forgetting to center is the single most common PCA mistake, and it silently corrupts the result. If you skip centering and compute the "covariance" as $\tfrac{1}{n-1}X^{\mathsf{T}}X$ on the raw data, you are not measuring variance about the mean — you are measuring the second moment about the origin, which mixes the cloud's shape with its distance from the origin. The first "principal component" you get will point roughly toward the cloud's center of mass, drowning out the real structure. Always center first. (Libraries like scikit-learn center for you automatically, which is exactly why a from-scratch implementation must remember to do it by hand — see the Build Your Toolkit callout.)

32.3 What is the covariance matrix, and why is it symmetric?

With the data centered, we can build the object at the heart of PCA: the covariance matrix. For a centered data matrix $\tilde X$ of shape $n\times d$ ($n$ samples, $d$ features), the covariance matrix is the $d\times d$ symmetric matrix

$$\boxed{\,C = \frac{1}{n-1}\,\tilde X^{\mathsf{T}}\tilde X\,}$$

(We lock this notation for the chapter, matching Chapter 28: $C = \tfrac{1}{n-1}X^{\mathsf{T}}X$ for centered $X$. The divisor $n-1$ rather than $n$ is the standard unbiased "sample covariance" correction; for the geometry of PCA the divisor is just an overall scale and the principal directions do not depend on it at all.) Let us unpack what this matrix contains and why it has the structure it does, because every property we need flows from this formula.

The $(i,j)$ entry of $C$ is $\tfrac{1}{n-1}\sum_k \tilde x_{ki}\tilde x_{kj}$ — the dot product of the $i$-th centered feature column with the $j$-th, scaled. When $i = j$ this is the variance of feature $i$ (the average squared deviation from its mean); when $i\ne j$ it is the covariance between features $i$ and $j$ (positive if they tend to rise together, negative if one rises as the other falls, zero if they are uncorrelated). So the diagonal of $C$ holds the variances, and the off-diagonal holds the pairwise covariances — the entire first- and second-order structure of how the features co-vary, packed into one matrix.

Now the two structural facts that make PCA work, both inherited directly from earlier chapters.

$C$ is symmetric. This is immediate from the formula: $C^{\mathsf{T}} = \tfrac{1}{n-1}(\tilde X^{\mathsf{T}}\tilde X)^{\mathsf{T}} = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X = C$, using $(AB)^{\mathsf{T}} = B^{\mathsf{T}}A^{\mathsf{T}}$ from Chapter 8. Equivalently, the covariance of feature $i$ with feature $j$ is the same as the covariance of $j$ with $i$ — symmetry is a fact about data, not just algebra. This symmetry is the licence to use the spectral theorem. Without it, there would be no guarantee of real eigenvalues or perpendicular eigenvectors, and PCA would not exist.

$C$ is positive semidefinite. We proved this in Chapter 28, and the argument is worth recalling in one line because it explains the meaning of everything that follows. For any direction $\mathbf{w}$,

$$\mathbf{w}^{\mathsf{T}}C\mathbf{w} = \frac{1}{n-1}\mathbf{w}^{\mathsf{T}}\tilde X^{\mathsf{T}}\tilde X\mathbf{w} = \frac{1}{n-1}(\tilde X\mathbf{w})^{\mathsf{T}}(\tilde X\mathbf{w}) = \frac{1}{n-1}\lVert \tilde X\mathbf{w}\rVert^2 \ge 0.$$

The quadratic form is a squared length over $n-1$, so it can never be negative — $C$ is positive semidefinite, and by the eigenvalue test of Chapter 28 all its eigenvalues are $\ge 0$. But look closer at what $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ is. The vector $\tilde X\mathbf{w}$ is the list of projections of the centered samples onto the direction $\mathbf{w}$ — the coordinate of each data point along $\mathbf{w}$ — and (since the data is centered) $\tfrac{1}{n-1}\lVert\tilde X\mathbf{w}\rVert^2$ is exactly the variance of the data projected onto $\mathbf{w}$. So:

$$\operatorname{Var}(\text{data projected onto unit } \mathbf{w}) = \mathbf{w}^{\mathsf{T}}C\mathbf{w}.$$

This is the formula that turns PCA into an eigenvalue problem. "How much does the data spread along direction $\mathbf{w}$?" is answered by the quadratic form $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$, and "which direction spreads the most?" is the question of maximizing that quadratic form — which §32.4 solves with the spectral theorem.

Warning

— PCA requires the spectral theorem, and the spectral theorem requires a symmetric matrix. This is why we build the covariance matrix $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$ rather than working with the rectangular data matrix $\tilde X$ directly: $\tilde X$ is $n\times d$ and generally not even square, so it has no eigenvectors, but $\tilde X^{\mathsf{T}}\tilde X$ is square, symmetric, and positive semidefinite, so the spectral theorem (Chapter 27) applies in full. The components come out real and perpendicular precisely because $C$ is symmetric. Never try to "find the eigenvectors of the data matrix" — a non-square matrix has none. The symmetric Gram matrix $\tilde X^{\mathsf{T}}\tilde X$ is the bridge, exactly as it was for the SVD in Chapter 30.

Check Your Understanding — A centered dataset has covariance matrix $C = \begin{psmallmatrix}4 & 0\\ 0 & 1\end{psmallmatrix}$. Without computing eigenvectors, what is the variance of the data along the $x$-axis? Along the $45°$ direction $\mathbf{w} = \tfrac{1}{\sqrt2}(1,1)$?

Answer

Along the $x$-axis ($\mathbf{w} = (1,0)$): $\mathbf{w}^{\mathsf{T}}C\mathbf{w} = 4$ — just the top-left entry, the variance of feature 1. Along the $45°$ direction: $\mathbf{w}^{\mathsf{T}}C\mathbf{w} = \tfrac12(1,1)\begin{psmallmatrix}4&0\\0&1\end{psmallmatrix}\binom{1}{1} = \tfrac12(4 + 1) = 2.5$. The variance along the diagonal is the average of the two axis variances here (because the features are uncorrelated). The maximum variance is $4$, achieved along the $x$-axis — which is exactly the top eigenvector, since $C$ is already diagonal. PC1 is $(1,0)$ with variance $\lambda_1 = 4$.

32.4 Why is the first principal component the top eigenvector? (variance maximization)

Now we earn the central claim of the chapter with a real argument, in the four-part proof style of this book. The claim is that the direction of maximum variance — the first principal component — is the eigenvector of $C$ with the largest eigenvalue, and that the maximum variance itself is that eigenvalue. This is the bridge between the geometric picture ("the long axis of the cloud") and the algebra ("the top eigenvector of the covariance matrix"), and it is the reason the spectral theorem governs data analysis.

1. Why we care. Everything in PCA hangs on this. If the maximum-variance direction were not an eigenvector, PCA would be a separate optimization problem with no closed-form answer. The fact that it is the top eigenvector means we can find all the principal components at once by diagonalizing a single symmetric matrix — turning a hard-looking optimization ("search all directions for the one that spreads the data most") into a clean eigenproblem we already know how to solve. This is the theoretical guarantee that PCA is exactly the spectral theorem, not merely inspired by it.

2. Key idea. We want to maximize the variance $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ over all unit directions $\mathbf{w}$ (we must constrain $\lVert\mathbf{w}\rVert = 1$, or we could inflate the variance arbitrarily just by scaling $\mathbf{w}$). The ratio we are really maximizing is the Rayleigh quotient $R(\mathbf{w}) = \dfrac{\mathbf{w}^{\mathsf{T}}C\mathbf{w}}{\mathbf{w}^{\mathsf{T}}\mathbf{w}}$, and the spectral theorem tells us its maximum is the largest eigenvalue, attained at the corresponding eigenvector. The proof is a one-line consequence of diagonalizing in the eigenbasis.

3. Proof. Since $C$ is symmetric, the spectral theorem (Chapter 27) gives an orthonormal eigenbasis $\mathbf{q}_1,\dots,\mathbf{q}_d$ with real eigenvalues, which we order from largest to smallest: $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d \ge 0$ (non-negative because $C$ is positive semidefinite, §32.3). Write any unit vector $\mathbf{w}$ in this eigenbasis: $\mathbf{w} = \sum_i c_i\mathbf{q}_i$ with $\sum_i c_i^2 = \lVert\mathbf{w}\rVert^2 = 1$ (the coordinates in an orthonormal basis, Chapter 20). Now compute the variance using the spectral decomposition $C\mathbf{q}_i = \lambda_i\mathbf{q}_i$:

$$\mathbf{w}^{\mathsf{T}}C\mathbf{w} = \Bigl(\sum_i c_i\mathbf{q}_i\Bigr)^{\mathsf{T}} C \Bigl(\sum_j c_j\mathbf{q}_j\Bigr) = \sum_{i,j} c_i c_j\,\mathbf{q}_i^{\mathsf{T}}(\lambda_j\mathbf{q}_j) = \sum_{i,j} c_i c_j \lambda_j\,(\mathbf{q}_i^{\mathsf{T}}\mathbf{q}_j) = \sum_{i} \lambda_i c_i^2,$$

where the last step used orthonormality, $\mathbf{q}_i^{\mathsf{T}}\mathbf{q}_j = 0$ for $i\ne j$ and $1$ for $i = j$. So the variance along $\mathbf{w}$ is a weighted average of the eigenvalues, with weights $c_i^2$ that sum to 1. A weighted average can be no larger than the largest value being averaged:

$$\mathbf{w}^{\mathsf{T}}C\mathbf{w} = \sum_i \lambda_i c_i^2 \le \lambda_1 \sum_i c_i^2 = \lambda_1,$$

with equality exactly when all the weight sits on the largest eigenvalue — that is, $c_1 = 1$ and the rest zero, meaning $\mathbf{w} = \mathbf{q}_1$. Therefore the maximum variance over all unit directions is $\lambda_1$, attained at the top eigenvector $\mathbf{q}_1$. $\blacksquare$

4. What this means. The most spread-out direction in the data is the top eigenvector of the covariance matrix, and the spread in that direction is its eigenvalue. This is the precise sense in which the eigenvalue is the variance along its principal component. The same argument repeated on the orthogonal complement of $\mathbf{q}_1$ (restrict to directions perpendicular to $\mathbf{q}_1$ and maximize again) gives the second principal component as $\mathbf{q}_2$ with variance $\lambda_2$, and so on down the spectrum — each component the best remaining direction, perpendicular to all before it. PCA is the spectral theorem read as a sequence of variance-maximization problems.

Math-Major Sidebar — The "restrict to the orthogonal complement and maximize again" step is the Courant–Fischer / deflation characterization of eigenvalues, and it is the rigorous engine behind the greedy "next-best perpendicular direction" description of PCA. Formally, $\lambda_k = \max\{\,\mathbf{w}^{\mathsf{T}}C\mathbf{w} : \lVert\mathbf{w}\rVert = 1,\ \mathbf{w}\perp\mathbf{q}_1,\dots,\mathbf{q}_{k-1}\,\}$, attained at $\mathbf{q}_k$. This is the same variational view that proved the spectral theorem itself in the Chapter 27 Math-Major Sidebar (maximize the Rayleigh quotient, peel off the maximizer, repeat). PCA and the spectral theorem are not analogous — they are the identical theorem, one phrased in the language of operators and the other in the language of data variance. The deflation view also generalizes cleanly to the infinite-dimensional setting (Chapter 34), where it underlies the Karhunen–Loève expansion, PCA's continuous cousin.

We can watch this maximization happen. Take the covariance matrix $C = \begin{psmallmatrix}2.5 & 1.5\\ 1.5 & 2.5\end{psmallmatrix}$ we will meet as our running example in §32.5, and sweep a unit direction $\mathbf{w} = (\cos\theta, \sin\theta)$ all the way around the circle, computing the projected variance $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ at each angle:

# Variance along a direction peaks at the top eigenvector. Sweep the angle.
import numpy as np
C = np.array([[2.5, 1.5], [1.5, 2.5]])
for deg in [0, 30, 45, 60, 90, 135]:
    th = np.radians(deg)
    w = np.array([np.cos(th), np.sin(th)])          # unit direction
    print(f"{deg:>4}deg: variance w^T C w = {w @ C @ w:.4f}")

   0deg: variance w^T C w = 2.5000
  30deg: variance w^T C w = 3.7990
  45deg: variance w^T C w = 4.0000
  60deg: variance w^T C w = 3.7990
  90deg: variance w^T C w = 2.5000
 135deg: variance w^T C w = 1.0000

The variance is a smooth function of direction, and it peaks at exactly $45°$ with value $4.0000$ — which is $\lambda_1$, the largest eigenvalue — and bottoms out at $135°$ (perpendicular to the peak) with value $1.0000 = \lambda_2$. The maximizing direction $(\cos 45°, \sin 45°) = \tfrac{1}{\sqrt2}(1,1)$ is precisely the top eigenvector $\mathbf{q}_1$, and the minimizing direction is $\mathbf{q}_2$. The proof said the variance is a weighted average of the eigenvalues pinned between $\lambda_2 = 1$ and $\lambda_1 = 4$; the sweep shows it, and shows the extremes landing on the eigenvectors. This little experiment is PCA in one dimension of search — and the spectral theorem hands us all the extremal directions at once, without sweeping.

32.5 How do you actually do PCA? (the eigen route, worked by hand)

Time to make all of this concrete on a real dataset, by hand, with numbers clean enough to follow with a pencil. We will use a small two-feature dataset of five points, chosen so the arithmetic stays tidy and — pleasingly — so the covariance matrix's eigenvectors land on the same $\pm45°$ axes we met for the matrix $\begin{psmallmatrix}2&1\\1&2\end{psmallmatrix}$ back in Chapter 27. Watching the same eigen-axes reappear in a data-analysis context is the book's geometry-algebra unity at work.

Here is the raw data — five samples, two features each:

$$X = \begin{bmatrix} 7 & 7 \\ 3 & 3 \\ 6 & 4 \\ 4 & 6 \\ 5 & 5 \end{bmatrix}.$$

Step 1 — center. The column means are $\mu_1 = \tfrac{7+3+6+4+5}{5} = 5$ and $\mu_2 = \tfrac{7+3+4+6+5}{5} = 5$, so $\boldsymbol\mu = (5, 5)$. Subtract it from every row to get the centered matrix:

$$\tilde X = \begin{bmatrix} 2 & 2 \\ -2 & -2 \\ 1 & -1 \\ -1 & 1 \\ 0 & 0 \end{bmatrix}.$$

Each column now sums to zero — check: $2 - 2 + 1 - 1 + 0 = 0$ for both columns. Good.

Step 2 — build the covariance matrix. With $n = 5$ samples, $n - 1 = 4$. Compute $\tilde X^{\mathsf{T}}\tilde X$ (a $2\times2$ matrix of dot products of the centered columns):

$$\tilde X^{\mathsf{T}}\tilde X = \begin{bmatrix} 2^2 + 2^2 + 1^2 + 1^2 + 0 & \;2\cdot2 + 2\cdot2 + 1\cdot(-1) + (-1)\cdot1 + 0 \\ \text{(same)} & 2^2 + 2^2 + 1^2 + 1^2 + 0 \end{bmatrix} = \begin{bmatrix} 10 & 6 \\ 6 & 10 \end{bmatrix}.$$

Divide by $n - 1 = 4$:

$$C = \frac{1}{4}\begin{bmatrix} 10 & 6 \\ 6 & 10 \end{bmatrix} = \begin{bmatrix} 2.5 & 1.5 \\ 1.5 & 2.5 \end{bmatrix}.$$

The diagonal entries $2.5$ are the variances of each feature; the off-diagonal $1.5$ is their covariance — positive, so the two features rise together, which is why the cloud tilts up to the right.

Step 3 — diagonalize $C$ (the spectral theorem). $C$ is symmetric, so the spectral theorem applies: real eigenvalues, perpendicular eigenvectors. Find the eigenvalues from the characteristic polynomial (Chapter 24):

$$\det(C - \lambda I) = (2.5 - \lambda)^2 - 1.5^2 = \lambda^2 - 5\lambda + (6.25 - 2.25) = \lambda^2 - 5\lambda + 4 = (\lambda - 4)(\lambda - 1) = 0.$$

So the eigenvalues are $\lambda_1 = 4$ and $\lambda_2 = 1$ — both real and non-negative, as the spectral theorem and positive semidefiniteness promised. The eigenvectors: for $\lambda_1 = 4$, solve $(C - 4I)\mathbf{v} = \mathbf{0}$, i.e. $\begin{psmallmatrix}-1.5 & 1.5\\ 1.5 & -1.5\end{psmallmatrix}\mathbf{v} = \mathbf{0}$, giving $v_1 = v_2$, so $\mathbf{q}_1 \propto (1, 1)$. For $\lambda_2 = 1$, $\begin{psmallmatrix}1.5 & 1.5\\ 1.5 & 1.5\end{psmallmatrix}\mathbf{v} = \mathbf{0}$ gives $v_1 = -v_2$, so $\mathbf{q}_2 \propto (1, -1)$. Normalize to unit length (divide by $\sqrt2$):

$$\mathbf{q}_1 = \frac{1}{\sqrt2}\begin{bmatrix}1\\1\end{bmatrix}\ (\lambda_1 = 4), \qquad \mathbf{q}_2 = \frac{1}{\sqrt2}\begin{bmatrix}1\\-1\end{bmatrix}\ (\lambda_2 = 1).$$

Check orthogonality: $\mathbf{q}_1\cdot\mathbf{q}_2 = \tfrac12(1 - 1) = 0$ — perpendicular, automatically, by §27.5. These two unit eigenvectors are the principal components. PC1 points along the $+45°$ diagonal (the long axis of the cloud), and the variance along it is $\lambda_1 = 4$. PC2 points along $-45°$, with variance $\lambda_2 = 1$.

Step 4 — explained variance. The total variance in the data is the trace of $C$ (the sum of the feature variances), which equals the sum of the eigenvalues (Chapter 27): $\operatorname{tr}(C) = 2.5 + 2.5 = 5 = 4 + 1$. So PC1 carries $\tfrac{4}{5} = 80\%$ of the total variance and PC2 carries $\tfrac{1}{5} = 20\%$. The explained-variance ratios are $0.8$ and $0.2$. If you wanted to compress this dataset to one dimension, you would keep PC1 and retain $80\%$ of the spread. We will verify every one of these numbers in numpy in §32.6, and they will match exactly.

Geometric Intuition — The principal components are the axes of the data ellipse, and the eigenvalues set its proportions. Here PC1 (the $+45°$ direction, $\lambda_1 = 4$) is the long axis and PC2 (the $-45°$ direction, $\lambda_2 = 1$) is the short axis — the ellipse is twice as long as it is wide, because the standard deviations (square roots of the variances) are $\sqrt 4 = 2$ and $\sqrt 1 = 1$. This is the exact same ellipse-from-a-symmetric-matrix picture from Chapter 28, now describing the spread of data instead of the contours of a quadratic form. Covariance and curvature are the same geometry; PCA reads the data's ellipse off the covariance matrix's eigenvectors.

32.5.1 Projecting the data: the scores

Finding the components is only half of PCA; the payoff is using them to re-express the data. The coordinate of a centered point $\tilde{\mathbf{x}}_i$ along a principal component $\mathbf{q}_j$ is just the dot product $\tilde{\mathbf{x}}_i\cdot\mathbf{q}_j$ — its projection onto that axis (Chapter 19). Stacking all the projections, the scores matrix is

$$Z = \tilde X\,V, \qquad V = [\,\mathbf{q}_1\ \mathbf{q}_2\ \cdots\ \mathbf{q}_d\,],$$

where $V$ has the principal components as columns. Each row of $Z$ is one data point expressed in the new, rotated coordinate system of the principal axes. For our example, projecting each centered point onto $\mathbf{q}_1 = \tfrac{1}{\sqrt2}(1,1)$ and $\mathbf{q}_2 = \tfrac{1}{\sqrt2}(1,-1)$:

$$Z = \tilde X V = \begin{bmatrix} 2 & 2 \\ -2 & -2 \\ 1 & -1 \\ -1 & 1 \\ 0 & 0 \end{bmatrix} \frac{1}{\sqrt2}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} = \begin{bmatrix} 2\sqrt2 & 0 \\ -2\sqrt2 & 0 \\ 0 & \sqrt2 \\ 0 & -\sqrt2 \\ 0 & 0 \end{bmatrix} \approx \begin{bmatrix} 2.83 & 0 \\ -2.83 & 0 \\ 0 & 1.41 \\ 0 & -1.41 \\ 0 & 0 \end{bmatrix}.$$

Read this matrix. The first two data points (which lay on the $+45°$ diagonal) have all their score on PC1 and zero on PC2 — they live entirely along the long axis. The next two points (on the $-45°$ diagonal) have all their score on PC2. In the rotated coordinates the structure is laid bare. And notice: the variance of the PC1 scores is $\tfrac{1}{4}\bigl((2\sqrt2)^2 + (2\sqrt2)^2 + 0 + 0 + 0\bigr) = \tfrac{1}{4}(8 + 8) = 4 = \lambda_1$, and the variance of the PC2 scores is $\tfrac14(2 + 2) = 1 = \lambda_2$. The variance of the scores along each component equals its eigenvalue — exactly as §32.4 promised. The rotation to principal axes has decorrelated the data: the two score columns are uncorrelated, because the off-diagonal of the covariance in the new basis is zero. That decorrelation is the whole reason PCA is useful.

32.5.2 Scores, loadings, and whitening — three things to keep straight

PCA produces several different matrices, and they get confused constantly, so let us name them precisely. The scores $Z = \tilde X V$ are the data points in the new coordinates — one row per sample, telling you where each point sits along each principal axis. The loadings are the principal components themselves, the columns of $V$ — they tell you how each original feature contributes to each component (the loading of feature $i$ on PC $j$ is the $i$-th entry of $\mathbf{q}_j$). A useful way to keep them straight: scores describe the samples (the rows of your data), loadings describe the features (the columns). When you make the classic PCA scatter plot — points plotted by their first two scores — you are plotting the rows; when you ask "which features drive PC1?", you are reading the loadings, the entries of the first component.

The decorrelation we just noticed has a name and a sharper form worth knowing, because it appears throughout statistics and machine learning. Recall the covariance of the scores: since $Z = \tilde X V$ and $C = V\Lambda V^{\mathsf{T}}$, the covariance of the scores is

$$\frac{1}{n-1}Z^{\mathsf{T}}Z = \frac{1}{n-1}V^{\mathsf{T}}\tilde X^{\mathsf{T}}\tilde X V = V^{\mathsf{T}}C\,V = V^{\mathsf{T}}(V\Lambda V^{\mathsf{T}})V = \Lambda,$$

a diagonal matrix — exactly the eigenvalues. So in the principal-component coordinates the data is uncorrelated, and the variance along component $j$ is $\lambda_j$. If you go one step further and divide each score column by its standard deviation $\sqrt{\lambda_j}$, you produce whitened data whose covariance is the identity $I$ — every direction has unit variance and zero correlation, a perfectly spherical cloud. Whitening, computed as $\tilde X V\Lambda^{-1/2}$, is the matrix-square-root trick we previewed in Chapter 27 ($\S$27.7.2): it transforms a tilted, stretched data ellipsoid into a unit ball, removing all linear correlations at once. Many algorithms — from the Mahalanobis distance of Chapter 28 to the preprocessing inside neural networks — are, under the hood, whitening the data first so that every direction is on equal footing.

Check Your Understanding — You run PCA and the first principal component (the first loading vector) is $\mathbf{q}_1 = (0.8, 0.0, 0.6)$ over three features (age, height, income, say). The score of a particular sample on PC1 is $5.0$. Which original feature contributes nothing to PC1, and what does the score $5.0$ mean?
Answer
The middle feature (height) has loading $0.0$ on PC1, so it contributes nothing to the first component — PC1 is a blend of only age and income (with age weighted $0.8$ and income $0.6$). The score $5.0$ is the sample's coordinate along the PC1 axis: it is the projection $\tilde{\mathbf{x}}\cdot\mathbf{q}_1 = 5.0$, telling you the (centered) sample lies $5.0$ units out along the direction $(0.8, 0, 0.6)$. Loadings describe the axis (a property of features); the score describes where this sample sits on it (a property of the sample).

32.6 Computing PCA in numpy (the eigen route, verified)

Let us confirm every hand-computed number with code, using np.linalg.eigh — the function for symmetric matrices we adopted in Chapter 27. (Recall the indexing convention: mathematics writes $\mathbf{q}_1, \lambda_1$ one-indexed, while numpy's columns and entries are zero-indexed; the objects are the same, the labels shifted by one.)

# PCA the eigen way: center, build covariance, diagonalize with eigh.
import numpy as np
X = np.array([[7., 7.], [3., 3.], [6., 4.], [4., 6.], [5., 5.]])
mu = X.mean(axis=0)                       # column means -> [5, 5]
Xc = X - mu                               # CENTER the data (essential!)
C  = (Xc.T @ Xc) / (X.shape[0] - 1)       # covariance = (1/(n-1)) Xc^T Xc
print("covariance C =\n", C)
w, Q = np.linalg.eigh(C)                  # symmetric -> eigh: real evals, orthonormal Q
print("eigenvalues (ascending):", w)
print("eigenvectors (columns):\n", np.round(Q, 6))

covariance C =
 [[2.5 1.5]
 [1.5 2.5]]
eigenvalues (ascending): [1. 4.]
eigenvectors (columns):
 [[-0.707107  0.707107]
 [ 0.707107  0.707107]]

The covariance matrix is exactly $\begin{psmallmatrix}2.5 & 1.5\\ 1.5 & 2.5\end{psmallmatrix}$, and the eigenvalues are $1$ and $4$ — matching the hand computation. Note eigh returns eigenvalues in ascending order, so the largest (the first principal component) is last; we must sort into descending order to get the conventional PCA ordering. Let us do that, fix the sign convention, compute the explained-variance ratio, and project:

# Sort components by DESCENDING variance, fix sign, get explained-variance ratio, project.
order = np.argsort(w)[::-1]               # descending: largest eigenvalue first
evals = w[order]                          # [4., 1.]
V = Q[:, order]                           # principal components as columns, PC1 first
for j in range(V.shape[1]):               # sign convention: largest-|entry| positive
    if V[np.abs(V[:, j]).argmax(), j] < 0:
        V[:, j] *= -1
evr = evals / evals.sum()                 # explained-variance ratio
Z = Xc @ V                                # scores: data in principal-component coords
print("principal components (cols):\n", np.round(V, 6))
print("variances (eigenvalues):", evals)
print("explained-variance ratio:", evr)
print("scores Z =\n", np.round(Z, 6))
print("variance of each score column:", Z.var(axis=0, ddof=1))

principal components (cols):
 [[ 0.707107  0.707107]
 [ 0.707107 -0.707107]]
variances (eigenvalues): [4. 1.]
explained-variance ratio: [0.8 0.2]
scores Z =
 [[ 2.828427  0.      ]
 [-2.828427  0.      ]
 [ 0.        1.414214]
 [ 0.       -1.414214]
 [ 0.        0.      ]]
variance of each score column: [4. 1.]

Every number matches §32.5: PC1 is $\tfrac{1}{\sqrt2}(1,1)$ with variance $4$, PC2 is $\tfrac{1}{\sqrt2}(1,-1)$ with variance $1$, the explained-variance ratios are $0.8$ and $0.2$, the scores are $\pm2\sqrt2 \approx \pm2.83$ along PC1 and $\pm\sqrt2 \approx \pm1.41$ along PC2, and the variance of each score column equals its eigenvalue. The hand computation and the code agree completely.

Computational Note — Three numpy gotchas in PCA. First, use np.linalg.eigh, not eig, for the covariance matrix: eigh exploits symmetry to return guaranteed-real eigenvalues and orthonormal eigenvectors (Chapter 27), while eig can return tiny spurious imaginary parts and non-orthonormal vectors. Second, eigh returns eigenvalues in ascending order, the opposite of PCA's convention; you must reverse them so PC1 (the largest variance) comes first. Third, the sign of each component is arbitrary — eigh returned PC2's column as $(-0.707, 0.707)$ here, the negative of our hand choice, which the convention then flips; PC1's column already matched, which is fine since $-\mathbf{q}$ is an equally valid eigenvector. Libraries fix the sign by a convention (e.g. making the largest-magnitude entry positive, as we did) purely so results are reproducible; the principal axis is the same either way.

32.7 How do you choose how many components to keep?

PCA gives you as many components as there are features, but the entire point is to keep only the few that matter. How few? The tool is the explained-variance ratio we just computed: the fraction of the total variance carried by each component, $\dfrac{\lambda_k}{\sum_i \lambda_i}$. Since the eigenvalues are the variances along the components and they sum to the total variance (the trace of $C$, Chapter 27), these ratios sum to $1$ and tell you exactly how much of the data's spread each component captures. To choose $k$, you look at the cumulative explained variance and stop when you have captured enough.

The most common rule is a threshold: keep the smallest $k$ such that the top $k$ components explain at least, say, $90\%$ or $95\%$ of the total variance. In our toy example PC1 alone explains $80\%$, so to clear a $90\%$ bar you would need both components — but real datasets are far more compressible. A questionnaire with fifty correlated questions might have its first three components explain $95\%$ of the variance, collapsing fifty dimensions to three. A second common tool is the scree plot: plot the eigenvalues in descending order and look for the "elbow," the point where the curve flattens from a steep drop into a shallow tail. Components before the elbow capture real structure; the flat tail is mostly noise. A third heuristic, the Kaiser criterion, keeps components whose eigenvalue exceeds the average eigenvalue (or, for standardized data, exceeds $1$), on the logic that a component carrying less than its fair share of variance is not earning its keep.

There is no single correct answer — the choice of $k$ is a modeling decision that trades fidelity against compression — but the explained-variance ratio makes the trade-off quantitative and honest. Here is the computation on a more realistic example, a 3-feature dataset where the data nearly lives on a 2D plane:

# Choosing k via cumulative explained variance, on data that nearly lies on a plane.
import numpy as np
rng = np.random.default_rng(0)
t = rng.normal(size=(150, 2))                       # 2 latent factors
A = np.array([[2., 1.], [1., -1.], [0.5, 0.5]])     # embed into 3D...
data = t @ A.T + rng.normal(scale=0.15, size=(150, 3))   # ...plus small noise
Xc = data - data.mean(axis=0)
C = (Xc.T @ Xc) / (data.shape[0] - 1)
evals = np.sort(np.linalg.eigvalsh(C))[::-1]        # descending eigenvalues
evr = evals / evals.sum()
print("eigenvalues:        ", np.round(evals, 4))
print("explained variance: ", np.round(evr, 4))
print("cumulative:         ", np.round(np.cumsum(evr), 4))

eigenvalues:         [6.444  1.6995 0.0221]
explained variance:  [0.7892 0.2081 0.0027]
cumulative:          [0.7892 0.9973 1.    ]

The first two components explain $99.73\%$ of the variance; the third carries a mere $0.27\%$ — the noise we sprinkled in. The scree plot would show a sharp elbow after the second eigenvalue ($6.44, 1.70$, then a crash to $0.02$). So $k = 2$ is the obvious choice: we can compress this 3D data to 2D and lose essentially nothing, because the data was essentially 2D all along — it lived near a plane, and PCA found that plane. This is dimensionality reduction in action, and it is the everyday use of PCA.

Real-World Application — Exploratory data analysis and visualization. The most common use of PCA in data science is to compress high-dimensional data down to two or three components purely so it can be plotted. A dataset of gene-expression levels across twenty thousand genes, or customer behavior across hundreds of features, cannot be visualized — but its first two principal components can be scattered on a flat plot, and clusters, outliers, and trends that were invisible in the raw dimensions often leap out. When a biologist plots cells by their top two principal components and sees three distinct blobs, those blobs are cell types, found automatically by the spectral theorem. This is PCA as a microscope for high-dimensional data. You can see the full workflow in PCA in practice, and the variance it maximizes is the same variance you met in introductory statistics — here generalized from one variable to a direction in many.

32.7.1 A worked example from computer vision: compressing images

Let us see choosing $k$ play out on a different kind of data — images — because it is one of PCA's most striking applications and it connects directly to the image-compression anchor of Chapter 31. Treat each grayscale image as a long vector: an $8\times8$ image is a point in $64$-dimensional space (one coordinate per pixel), and a collection of images is a cloud in that space. If the images share structure — all faces, all handwritten digits, all of one object under different lighting — the cloud is far lower-dimensional than $64$, because the images are blends of a few recurring patterns. PCA finds those patterns. The principal components, reshaped back into images, are the eigen-images (for faces, the famous eigenfaces), and each image can be approximated as a weighted sum of just the top few.

Here is the computation on a synthetic stand-in for a face dataset: one hundred $8\times8$ "images," each built as a blend of three hidden prototype patterns plus a little pixel noise.

# PCA on images: 100 'images' (64 pixels each) built from 3 hidden patterns.
import numpy as np
rng = np.random.default_rng(7)
prototypes = rng.normal(size=(3, 64))                  # 3 hidden 'eigen-image' patterns
weights = rng.normal(size=(100, 3))                    # each image = blend of the 3
images = weights @ prototypes + rng.normal(scale=0.05, size=(100, 64))   # + noise
Xc = images - images.mean(axis=0)                      # center across the image set
S = np.linalg.svd(Xc, compute_uv=False)                # singular values (SVD route)
evr = S**2 / (S**2).sum()
print("top-6 explained-variance ratios:", np.round(evr[:6], 4))
print("top-3 cumulative variance:", round(100 * evr[:3].sum(), 2), "%")
# Storage: 100x3 scores + 3x64 components + 64-pixel mean, vs 100x64 raw
raw, compressed = 100 * 64, 100 * 3 + 3 * 64 + 64
print("floats raw:", raw, " compressed (k=3):", compressed,
      " ratio:", round(raw / compressed, 2), "x")

top-6 explained-variance ratios: [0.3928 0.3605 0.2456 0.0001 0.     0.    ]
top-3 cumulative variance: 99.9 %
floats raw: 6400  compressed (k=3): 556  ratio: 11.51 x

The first three components explain $99.9\%$ of the variance — and then the explained-variance ratios crash to essentially zero, a textbook scree-plot elbow at $k = 3$, because the images really were built from three patterns. Keeping three components, each $64$-pixel image is stored as three score numbers plus a shared set of three eigen-images and a mean image, cutting the storage by more than eleven-fold with no visible loss. This is exactly the low-rank approximation of Chapter 31's image compression, now applied across a set of images rather than within one: PCA discovers that a hundred apparently distinct images live in a three-dimensional space, and represents each by its coordinates there. A real eigenfaces system does the same with thousands of $100\times100$ face photographs, compressing each to a few dozen coefficients — the basis of early face-recognition algorithms, which we develop fully in the first case study.

Real-World Application — Eigenfaces and face recognition. The landmark application of PCA to computer vision is the eigenfaces method (Turk and Pentland, 1991): collect many aligned face images, run PCA, and keep the top components — the eigenfaces — which turn out to capture the major ways faces differ (lighting, broad shape, presence of glasses or a beard). Any face is then a point in this low-dimensional "face space," its coordinates the scores on each eigenface, and recognition becomes a nearest-neighbor search in a space of a few dozen dimensions instead of tens of thousands of raw pixels. PCA made face recognition computationally feasible decades before deep learning, and the same dimensionality-reduction-before-classification pattern is standard practice today.

32.7.2 Projecting to fewer dimensions, and the reconstruction error

Keeping $k$ components means projecting the data onto just the top $k$ principal axes. Let $V_k = [\,\mathbf{q}_1\ \cdots\ \mathbf{q}_k\,]$ be the $d\times k$ matrix of the leading components. The reduced data — the dimensionality-reduced representation — is the $n\times k$ scores matrix $Z_k = \tilde X V_k$, replacing $d$ numbers per sample with $k$. To go back, we reconstruct by re-expanding into the original feature space and adding the mean back:

$$\hat X = Z_k V_k^{\mathsf{T}} + \mathbf{1}\boldsymbol\mu^{\mathsf{T}} = \tilde X V_k V_k^{\mathsf{T}} + \mathbf{1}\boldsymbol\mu^{\mathsf{T}}.$$

The reconstruction $\hat X$ is the best rank-$k$ approximation of the data — the closest you can get to the original using only $k$ directions — and the gap between $X$ and $\hat X$ is the reconstruction error, the variance we threw away. Here is the beautiful part: the squared reconstruction error equals exactly the sum of the discarded eigenvalues times $(n-1)$. Watch it on our running example, reducing to $k = 1$:

# Reduce to k=1, reconstruct, and check the error equals the discarded variance.
import numpy as np
X = np.array([[7., 7.], [3., 3.], [6., 4.], [4., 6.], [5., 5.]])
mu = X.mean(axis=0); Xc = X - mu
C = (Xc.T @ Xc) / 4
w, Q = np.linalg.eigh(C)
order = np.argsort(w)[::-1]; evals = w[order]; V = Q[:, order]
k = 1                                          # keep only the first component
Vk = V[:, :k]
Z1 = Xc @ Vk                                   # reduced scores (5 x 1)
Xhat = Z1 @ Vk.T + mu                          # reconstruct (5 x 2)
err2 = np.linalg.norm(X - Xhat)**2             # squared Frobenius error
print("reduced scores (PC1 only):", np.round(Z1.ravel(), 4))
print("reconstruction:\n", np.round(Xhat, 4))
print("squared reconstruction error:", round(err2, 6))
print("(n-1) * discarded eigenvalue:", round(4 * evals[1], 6))

reduced scores (PC1 only): [ 2.8284 -2.8284  0.      0.      0.    ]
reconstruction:
 [[7. 7.]
 [3. 3.]
 [5. 5.]
 [5. 5.]
 [5. 5.]]
squared reconstruction error: 4.0
(n-1) * discarded eigenvalue: 4.0

Reducing to one dimension, the squared reconstruction error is exactly $4.0$, which equals $(n-1)\lambda_2 = 4\times1$ — the variance we discarded by dropping PC2. Notice what reconstruction did to the points: the two points that lay on the $+45°$ axis came back perfectly (they had no PC2 component to lose), while the points off that axis were snapped onto the PC1 line. PCA keeps what lies along the directions it kept and discards the rest, and the cost of discarding is precisely the variance in the dropped directions. This is the same low-rank-approximation idea from Chapter 31's image compression — there we kept the top singular values of an image; here we keep the top principal components of data. They are one idea, which §32.8 makes exact.

32.8 Why is PCA secretly the SVD? (the second route)

We have done PCA the eigen way: build $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$ and diagonalize it. But Chapter 30 gave us a more powerful tool that applies to any matrix, including the rectangular centered data matrix $\tilde X$ itself: the singular value decomposition, $\tilde X = U\Sigma V^{\mathsf{T}}$. The claim of this section — and one of the most useful facts in computational data science — is that the right singular vectors of $\tilde X$ are exactly the principal components, so you can do PCA without ever forming the covariance matrix. Let us see why the two routes coincide, and then why the SVD route is the one you should actually use.

The connection is immediate once you write the covariance matrix in terms of the SVD. Recall from Chapter 30 that for $\tilde X = U\Sigma V^{\mathsf{T}}$, the matrix $V$ is orthogonal ($V^{\mathsf{T}}V = I$) and $\Sigma$ is diagonal with the singular values $\sigma_i$ on it. Then:

$$C = \frac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X = \frac{1}{n-1}(U\Sigma V^{\mathsf{T}})^{\mathsf{T}}(U\Sigma V^{\mathsf{T}}) = \frac{1}{n-1}V\Sigma^{\mathsf{T}}U^{\mathsf{T}}U\Sigma V^{\mathsf{T}} = \frac{1}{n-1}V\Sigma^2 V^{\mathsf{T}},$$

using $U^{\mathsf{T}}U = I$ (the columns of $U$ are orthonormal) and $\Sigma^{\mathsf{T}}\Sigma = \Sigma^2$ (a diagonal matrix squared). But $C = V\bigl(\tfrac{1}{n-1}\Sigma^2\bigr)V^{\mathsf{T}}$ is precisely an orthogonal diagonalization of $C$ — it is $C = Q\Lambda Q^{\mathsf{T}}$ with $Q = V$ and $\Lambda = \tfrac{1}{n-1}\Sigma^2$. By the uniqueness of the spectral decomposition, the eigenvectors of $C$ are the right singular vectors of $\tilde X$, and the eigenvalues of $C$ are the squared singular values divided by $n-1$:

$$\boxed{\;\text{principal components} = \text{columns of } V \text{ (right singular vectors)}, \qquad \lambda_i = \frac{\sigma_i^2}{n-1}.\;}$$

This is the same trick Chapter 30 used to build the SVD in the first place — the right singular vectors were defined as the eigenvectors of $\tilde X^{\mathsf{T}}\tilde X$. PCA simply recognizes that $\tilde X^{\mathsf{T}}\tilde X$ is (up to the $\tfrac{1}{n-1}$ scale) the covariance matrix. So PCA is the SVD of the centered data matrix, full stop. The singular values rank the components by importance, the right singular vectors are the components themselves, and the scores are $Z = \tilde X V = U\Sigma$ (the left singular vectors scaled by the singular values).

Let us verify the equivalence on our running example — the two routes must give identical components and eigenvalues:

# PCA the SVD way: SVD of the CENTERED data; right singular vectors = components.
import numpy as np
X = np.array([[7., 7.], [3., 3.], [6., 4.], [4., 6.], [5., 5.]])
Xc = X - X.mean(axis=0)                          # center (still essential!)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)  # Xc = U @ diag(S) @ Vt
print("singular values S:", np.round(S, 6))
print("right singular vectors (rows of Vt = components):\n", np.round(Vt, 6))
print("eigenvalues from SVD, S^2/(n-1):", np.round(S**2 / (X.shape[0] - 1), 6))
print("scores U*S:\n", np.round(U * S, 6))

singular values S: [4. 2.]
right singular vectors (rows of Vt = components):
 [[-0.707107 -0.707107]
 [-0.707107  0.707107]]
eigenvalues from SVD, S^2/(n-1): [4. 1.]
scores U*S:
 [[-2.828427 -0.      ]
 [ 2.828427  0.      ]
 [-0.        -1.414214]
 [ 0.         1.414214]
 [ 0.         0.      ]]

The singular values are $\sigma_1 = 4$ and $\sigma_2 = 2$, and $\sigma_i^2 / (n-1) = 16/4 = 4$ and $4/4 = 1$ — exactly the eigenvalues $\lambda_1 = 4, \lambda_2 = 1$ we found the eigen way. The right singular vectors (rows of Vt) are the same $\pm45°$ components (up to the arbitrary overall sign), and the scores $U\Sigma$ are the same $\pm2\sqrt2, \pm\sqrt2$ as before, again up to sign. The two routes are genuinely the same computation viewed from two angles — the spectral theorem applied to $\tilde X^{\mathsf{T}}\tilde X$, which is what the SVD is.

The Key Insight — There are two routes to PCA and they give the identical answer: diagonalize the covariance matrix $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$ (eigenvectors = components, eigenvalues = variances), or take the SVD of the centered data $\tilde X = U\Sigma V^{\mathsf{T}}$ (right singular vectors = components, $\sigma_i^2/(n-1)$ = variances). They coincide because $C = V\bigl(\tfrac{1}{n-1}\Sigma^2\bigr)V^{\mathsf{T}}$ is the spectral decomposition of $C$. The SVD route is the one to use in practice.

32.8.1 Why the SVD route is numerically preferred

If the two routes give the same answer, why prefer the SVD? Because forming the covariance matrix $\tilde X^{\mathsf{T}}\tilde X$ squares the data, and squaring destroys precision in a way you can never recover. This is the same warning we sounded in Chapter 20 about solving least-squares via the normal equations $A^{\mathsf{T}}A$, and it will get its full treatment in Chapter 38, but the essence is simple and worth seeing now.

The trouble is the condition number. When you compute $\tilde X^{\mathsf{T}}\tilde X$, the ratio between the largest and smallest singular values gets squared: if $\tilde X$ has condition number $\kappa$ (singular values spanning a factor of $\kappa$), then $\tilde X^{\mathsf{T}}\tilde X$ has condition number $\kappa^2$. A dataset whose features are mildly correlated — singular values spanning, say, a factor of $10^4$ — produces a covariance matrix spanning $10^8$, and in standard double precision (about 16 significant digits) the smallest principal components can be swamped by rounding error. The small eigenvalues, which encode the subtle directions you might most want to find, are exactly the ones that get destroyed. The SVD of $\tilde X$ works directly on the data and never squares it, so it preserves the full precision of the original measurements. Library PCA routines — including scikit-learn's — compute the SVD of the centered data for exactly this reason; they only mention "covariance" as the conceptual story.

Warning

— Do not form the covariance matrix when precision matters. The covariance route ($C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$, then eigh) is fine for explaining PCA and fine for well-conditioned, low-dimensional data. But computing $\tilde X^{\mathsf{T}}\tilde X$ squares the condition number (Chapter 38), so for ill-conditioned or high-dimensional data it can lose half your significant digits and corrupt the small components. The SVD of the centered data matrix (np.linalg.svd(Xc)) gives identical results without ever squaring the data, which is why it is the production route. This is the same lesson as Chapter 20's "QR beats the normal equations" — never form a Gram matrix you do not need.

32.9 What does PCA look like? (the data ellipse and its principal axes)

It is time to draw the picture that the whole chapter has been describing, using a scatter plot of a real (if small) cloud with its principal axes overlaid. We use a larger, visibly elliptical dataset here so the cloud has an obvious grain, and we draw each principal component as an arrow from the mean, scaled by its standard deviation $\sqrt{\lambda_i}$ so the arrows reach to the edge of the data's spread. This is the data-analysis analogue of the recurring 2D visualizer from Chapter 1 — instead of showing what a matrix does to the unit square, it shows the natural axes a matrix (the covariance) finds in a data cloud.

# PCA scatter with principal axes overlaid: the directions that matter in data.
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(1)
# A tilted elliptical cloud: stretch, then rotate 30 degrees.
n = 300
base = rng.normal(size=(n, 2)) * np.array([3.0, 1.0])      # wide in x, narrow in y
theta = np.radians(30)
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
X = base @ R.T + np.array([5.0, 4.0])                       # rotate + shift off origin

mu = X.mean(axis=0)
Xc = X - mu
C = (Xc.T @ Xc) / (n - 1)
w, Q = np.linalg.eigh(C)
order = np.argsort(w)[::-1]; evals = w[order]; V = Q[:, order]   # PC1 first

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(X[:, 0], X[:, 1], s=10, alpha=0.4, color="C0", label="data")
ax.scatter(*mu, color="k", s=40, zorder=5, label="mean")
for i in range(2):                                          # draw each principal axis
    axis = V[:, i] * np.sqrt(evals[i]) * 2.5                # length ∝ standard deviation
    ax.arrow(mu[0], mu[1], *axis, color=f"C{i+1}", width=0.06,
             length_includes_head=True, zorder=6,
             label=f"PC{i+1} (var={evals[i]:.2f})")
ax.set_aspect("equal"); ax.grid(True, alpha=0.3); ax.legend(loc="best", fontsize=8)
ax.set_title("Principal axes of a data cloud")
plt.show()
print("eigenvalues (variances):", np.round(evals, 4))
print("explained-variance ratio:", np.round(evals / evals.sum(), 4))

eigenvalues (variances): [8.1171 0.8941]
explained-variance ratio: [0.9008 0.0992]

Figure 32.1. The principal axes of a tilted elliptical data cloud. Three hundred points form a cloud stretched roughly three-to-one and tilted about $30°$ from horizontal. The black dot marks the mean (the cloud's balance point), and the two arrows are the principal components drawn from the mean, each scaled by its standard deviation: the long orange arrow (PC1) points along the grain of the cloud — its direction of greatest variance — and the short green arrow (PC2) is perpendicular to it, along the cloud's narrow direction. PC1 here captures about $90\%$ of the variance ($\lambda_1 \approx 8.12$ versus $\lambda_2 \approx 0.89$), so projecting onto PC1 alone would summarize each point with a single number while keeping nine-tenths of the spread. Alt-text: a scatter of points forming a tilted ellipse, with a long arrow along the ellipse's major axis labeled PC1 and a short perpendicular arrow along the minor axis labeled PC2, both emanating from the marked mean.

The figure is the entire chapter in one image. The arrows are perpendicular — that is the §27.5 orthogonality of eigenvectors of the symmetric covariance matrix. PC1 lies along the long axis of the cloud — that is variance maximization (§32.4). The arrow lengths are the standard deviations $\sqrt{\lambda_i}$ — that is "the eigenvalue is the variance along the component." And projecting onto PC1 alone, dropping the narrow PC2 direction, is dimensionality reduction with a known cost (the discarded variance $\lambda_2$). Everything we proved is visible here: PCA finds the natural axes of a data cloud, and those axes are the eigenvectors of its covariance.

Geometric Intuition — Compare this figure to the symmetric-matrix figure from Chapter 27 (Figure 27.1) and the data ellipse from Chapter 28. They are the same picture three times: a symmetric matrix stretching the plane along perpendicular eigen-axes (Ch. 27), the elliptical level set of its quadratic form (Ch. 28), and now the elliptical spread of a data cloud along the eigen-axes of its covariance (Ch. 32). The covariance matrix is the symmetric matrix whose ellipse is the shape of your data, and PCA reads its perpendicular axes off the spectral theorem. Once you see that covariance, curvature, and stretch are one geometry, PCA stops being a separate technique and becomes an old friend.

32.10 What can go wrong, and what PCA cannot do?

PCA is powerful, but it is not magic, and using it well means knowing its assumptions. Three limitations deserve to be stated plainly, because each is a place where PCA, applied thoughtlessly, gives misleading answers.

PCA assumes linear structure. PCA finds flat subspaces — lines, planes, hyperplanes through the (centered) origin. If your data lies on a curved manifold — points wrapped around a spiral, or distributed on the surface of a sphere — PCA cannot see the curve. It will fit the best flat approximation, which may be a poor summary. Data shaped like a rolled-up "Swiss roll" famously defeats PCA: the intrinsic structure is a 2D sheet, but it is curled through 3D, and no flat projection unrolls it. Nonlinear methods (kernel PCA, t-SNE, UMAP, autoencoders) exist precisely for this case. PCA is the right tool when the structure is linear, and you should ask whether it is before trusting the result.

PCA is sensitive to feature scaling. Because PCA maximizes variance, and variance has units, a feature measured on a large numerical scale will dominate the principal components regardless of whether it is actually more important. Imagine a dataset with "age in years" (range ~0–100) and "income in dollars" (range ~0–100,000): income's raw variance is millions of times larger, so PC1 will point almost entirely along income, and age will be invisible — not because age is unimportant, but because dollars are smaller units than years. The standard fix is to standardize the features first (subtract the mean and divide by the standard deviation, so every feature has variance 1) before running PCA; equivalently, run PCA on the correlation matrix instead of the covariance matrix. Whether to standardize is a genuine modeling choice — sometimes the raw scales are meaningful — but you must make it consciously.

PCA's components are not always interpretable. A principal component is a linear combination of all the original features — PC1 might be "$0.3\times$age $+\ 0.5\times$income $-\ 0.2\times$height $+\ \cdots$" — and there is no guarantee this combination corresponds to anything a human would name. Sometimes the top component is beautifully interpretable (a "size" factor, an "overall sentiment" axis); sometimes it is an uninterpretable blend. PCA optimizes variance, not meaning. Related techniques (factor analysis, sparse PCA, independent component analysis) trade some variance for more interpretable components when interpretation is the goal.

Common Pitfall — Running PCA on unscaled features with wildly different units is the second-most-common PCA mistake (after forgetting to center). If one feature is in dollars and another in years, PCA will chase the dollars purely because of their larger numerical variance, and the "principal component" tells you about your choice of units, not your data. Always ask whether your features are on comparable scales; if not, standardize them (mean 0, variance 1) before running PCA, or equivalently use the correlation matrix. Centering removes location; standardizing removes scale — both are about making variance mean what you intend.

32.11 How do you implement PCA from scratch?

We close the exposition with this chapter's contribution to the from-scratch toolkit you have been building since Chapter 2. The natural deliverable is a pca function that takes a data matrix and a target dimension $k$ and returns the top-$k$ principal components, the variances they explain, and the projected (dimensionality-reduced) data — with both routes available, the covariance route and the SVD route, so you can see they agree. The hard numerical work — finding eigenvalues or singular values — leans on your earlier eigen.py (Chapters 23/29) and svd.py (Chapter 30); this chapter's job is the PCA assembly around them: center the data, build the covariance (or take the SVD of the centered data), sort by variance, compute the explained-variance ratio, and project. That assembly is pure linear algebra you can write by hand.

Build Your Toolkit — Implement pca(X, k) in toolkit/pca.py. Given a data matrix $X$ ($n$ rows = samples, $d$ columns = features) and a target dimension $k$, your function should: (1) center the data by subtracting the column means (do not skip this — it is the step libraries do silently); (2) either build the covariance $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$ and diagonalize it with your spectral/eigen routine, or take the SVD of the centered data with your svd_from_scratch from Chapter 30 (implement both and check they match); (3) sort the components by descending variance and keep the top $k$; (4) return the $k$ components (as columns of a $d\times k$ matrix), the $k$ variances (eigenvalues, or $\sigma_i^2/(n-1)$), the explained-variance ratio, and the projected scores $Z = \tilde X V_k$. Then verify against numpy and a reference library: confirm your components match np.linalg.svd of the centered data (up to sign) and that your explained-variance ratios match sklearn.decomposition.PCA(n_components=k).explained_variance_ratio_. This module — the last of the progressive toolkit before the capstone — sits directly on top of svd.py (Chapter 30) and eigen.py (Chapters 23/29), and it is the engine you will use in the Chapter 39 capstone.

Here is the kind of check your finished function should pass, written with numpy so you can confirm the expected behavior before coding the from-scratch assembly. Note how it exercises both routes and cross-checks against sklearn:

# Expected behavior of pca(X, k) — verify both routes against each other and sklearn.
import numpy as np
def pca(X, k):
    X = np.asarray(X, dtype=float)
    n = X.shape[0]
    Xc = X - X.mean(axis=0)                         # 1. CENTER
    # --- SVD route (numerically preferred): right singular vectors are the components ---
    U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
    components = Vt[:k]                              # top-k components (rows)
    variances = (S**2 / (n - 1))[:k]                # eigenvalues = sigma^2 / (n-1)
    evr = (S**2 / (n - 1)) / (S**2 / (n - 1)).sum() # explained-variance ratio (full)
    scores = Xc @ components.T                       # project onto top-k axes
    return components, variances, evr[:k], scores

X = np.array([[7., 7.], [3., 3.], [6., 4.], [4., 6.], [5., 5.]])
comps, var, evr, Z = pca(X, k=2)
print("components:\n", np.round(comps, 6))
print("variances:", np.round(var, 6))
print("explained-variance ratio:", np.round(evr, 6))
# cross-check the variances against the covariance/eigh route
Xc = X - X.mean(axis=0)
eig_var = np.sort(np.linalg.eigvalsh((Xc.T @ Xc) / 4))[::-1]
print("eig route variances:", np.round(eig_var, 6), "-> match:", np.allclose(var, eig_var))

components:
 [[-0.707107 -0.707107]
 [-0.707107  0.707107]]
variances: [4. 1.]
explained-variance ratio: [0.8 0.2]
eig route variances: [4. 1.] -> match: True

The SVD-route variances $[4, 1]$ match the covariance/eigh-route variances exactly, the components are the $\pm45°$ axes (up to sign), and the explained-variance ratios are $[0.8, 0.2]$ — every number agreeing with the hand computation of §32.5 and with the sklearn cross-check we ran earlier. That is PCA, operationalized: center, decompose, sort, project.

Historical Note — Principal component analysis was first formulated by Karl Pearson in 1901, who framed it geometrically as finding the line (or plane) of best fit to a cloud of points — the line minimizing the perpendicular distances to the data, which is exactly the top principal component [verify]. The modern statistical formulation, including the name "principal components" and the connection to maximizing variance, was developed independently by Harold Hotelling in 1933 [verify]. The two views — Pearson's "best-fit subspace" and Hotelling's "maximum variance" — are the same thing seen from two sides, the duality between minimizing reconstruction error and maximizing retained variance that we met in §32.7.2. The link to the SVD as the computational engine came later, as the SVD itself matured into a practical algorithm in the 1960s–70s (Chapter 30).

32.12 What have we built, and where does it lead?

We began by looking at a cloud of data and asking which way it spreads, and we ended with the most-used algorithm in data science — and the entire journey was one idea: a data cloud has the shape of an ellipsoid, and PCA finds its axes by diagonalizing the covariance matrix. Center the data so the cloud sits at the origin (§32.2). Build the covariance matrix $C = \tfrac{1}{n-1}\tilde X^{\mathsf{T}}\tilde X$, which is symmetric and positive semidefinite, so its quadratic form $\mathbf{w}^{\mathsf{T}}C\mathbf{w}$ is the variance of the data along $\mathbf{w}$ (§32.3, straight from Chapter 28). Maximizing that variance is an eigenvalue problem whose answer is the top eigenvector (§32.4, the Rayleigh-quotient argument from Chapter 27), so the principal components are the orthonormal eigenvectors of $C$ and the eigenvalues are the variances along them. Equivalently — and preferably, for numerical reasons — the components are the right singular vectors of the centered data matrix, because the SVD is the spectral theorem applied to $\tilde X^{\mathsf{T}}\tilde X$ (§32.8). Choose how many to keep with the explained-variance ratio (§32.7), project onto them to reduce dimension, and pay a reconstruction cost equal to the discarded variance.

This chapter is where two of the book's great theorems converge on one application, and it is worth naming the convergence explicitly because it is the deepest takeaway. PCA rests on the spectral theorem (Chapter 27): the components are perpendicular because the covariance matrix is symmetric, and that symmetry is the licence for everything — real variances, orthogonal axes, a clean diagonalization. And PCA is the SVD (Chapter 30): the right singular vectors of the centered data are the principal components, no covariance matrix required. The spectral theorem gives the meaning ("perpendicular directions of maximum variance"); the SVD gives the computation ("never square your data"). Two of the most consequential results in linear algebra turn out to be two descriptions of the single act of finding the directions that matter in data. Keep the vocabulary close — centering, the covariance matrix, principal components as eigenvectors and right singular vectors, the explained-variance ratio, scores, dimensionality reduction, and reconstruction error as discarded variance — because the chapters ahead lean on all of it.

The threshold idea to carry forward is the equivalence of three pictures we have now seen merge completely: a symmetric matrix is a stretch along perpendicular axes (Chapter 27), the level set of its quadratic form is an ellipse (Chapter 28), and the spread of a data cloud is that same ellipse with the covariance as its matrix (this chapter). Covariance, curvature, and stretch are one geometry. Once you hold that, PCA is not a recipe to memorize but a fact to picture: rotate to the natural axes of the data, keep the long ones.

The forward reference is immediate and important. Chapter 33 (Application: Machine Learning) is the climax of Part VI, and PCA is one of its load-bearing pillars. The dimensionality reduction you just learned is a standard preprocessing step before training a model — feeding a learner the top few principal components instead of hundreds of raw, correlated features speeds up training and fights overfitting. More deeply, the embeddings at the heart of modern machine learning — the learned vectors that represent words, images, and users in high-dimensional space — are dimensionality reductions in spirit, and the matrix-factorization recommenders that power streaming and shopping are the same low-rank idea as PCA and Chapter 31's image compression: approximate a giant data matrix by a few directions that capture most of its structure. The covariance matrix's eigenvectors that you learned to find here are, it turns out, the front door to the systems reshaping the world. The data cloud's ellipse has more to teach us yet.

# Forward look: PCA as preprocessing — compress correlated features before learning.
import numpy as np
rng = np.random.default_rng(2)
# 200 samples, 5 features, but only 2 latent factors -> highly compressible
latent = rng.normal(size=(200, 2))
X = latent @ rng.normal(size=(2, 5)) + rng.normal(scale=0.1, size=(200, 5))
Xc = X - X.mean(axis=0)
S = np.linalg.svd(Xc, compute_uv=False)             # singular values
evr = S**2 / (S**2).sum()
print("explained-variance ratio:", np.round(evr, 4))
print("top-2 components capture:", round(100 * evr[:2].sum(), 1), "% of variance")

explained-variance ratio: [0.6382 0.3589 0.0011 0.001  0.0008]
top-2 components capture: 99.7 % of variance

Five correlated features, built from only two hidden factors, compress to two principal components that capture $99.7\%$ of the variance — exactly the situation a machine-learning practitioner exploits to shrink a feature space before training. The same orthogonal diagonalization that revealed the spectral theorem, that built the SVD, and that compressed an image is, it turns out, how machines find the few directions that matter in a sea of data. Hold onto the data-ellipse picture; in the next chapter it goes to work inside the algorithms that are remaking the world.