Case Study 1 — A Better Basis for Data: The Idea Behind PCA

DataField.Dev

Case Study 1 — A Better Basis for Data: The Idea Behind PCA

Field: data science / machine learning. This case study ties directly to the chapter anchor — re-gridding a problem into a coordinate system where its structure becomes obvious — and previews Principal Component Analysis (Chapter 32).

The problem: coordinates that fight the data

Imagine a fitness study that records, for each of a few thousand people, two numbers: a strength score and an endurance score, both standardized so larger means fitter. You plot the cloud of points and immediately see something the raw coordinates do not say cleanly: the cloud is a long, tilted ellipse. People who score high on strength tend to score high on endurance — the two measurements are correlated, and the cloud stretches along a diagonal running up-and-to-the-right.

Here is the friction. Your data lives in the standard basis — the strength axis and the endurance axis — but the data's actual structure does not line up with those axes. The interesting variation runs diagonally, across both axes at once, and a second, smaller variation runs perpendicular to it (the spread "thickness" of the ellipse). The standard coordinates are fighting the geometry: every meaningful direction is a mixture of strength and endurance, so neither coordinate alone tells the real story, and the two coordinates carry overlapping, redundant information.

This is exactly the situation of §16.4 in the chapter, where the transformation $\begin{bmatrix}2&1\\1&2\end{bmatrix}$ stretched space along the diagonal $(1,1)$ and looked complicated only because we insisted on the standard grid. The fix there was to re-grid into a basis aligned with the action. The fix here is identical: re-grid the data into a basis aligned with its spread. That re-gridding is Principal Component Analysis, and it is, at its mathematical core, nothing more than a change of basis chosen to diagonalize a particular matrix.

The covariance matrix: spread, written as a matrix

To make "spread" precise we use the covariance matrix $C$. For two standardized variables, $C$ is the $2\times 2$ matrix whose diagonal entries are the variances of each variable and whose off-diagonal entry is their covariance — a number measuring how much they vary together. For our fitness data, suppose it works out to $$C = \begin{bmatrix} 5 & 2 \\ 2 & 2 \end{bmatrix}.$$ Read it: strength has variance $5$, endurance has variance $2$, and the off-diagonal $2$ says they are positively correlated (when one is above average, the other tends to be too). That nonzero off-diagonal entry is the algebraic fingerprint of the tilt you saw in the scatter plot. A diagonal covariance matrix would mean uncorrelated axis-aligned variation — a cloud whose ellipse points straight along the axes. Our off-diagonal $2$ is precisely what makes the standard basis the wrong one.

The total amount of variation in the data is the trace of $C$ — the sum of the variances — here $\operatorname{tr}(C) = 5 + 2 = 7$. And from the chapter we know something powerful about the trace: it is a similarity invariant. No matter what basis we re-grid into, the total variance stays $7$. Changing basis redistributes the variance among the new coordinates, but it can never create or destroy any. That conservation is what makes the whole technique honest.

Re-gridding to diagonalize the spread

We want a new basis in which the covariance matrix becomes diagonal — a basis whose axes align with the ellipse, so that the new coordinates are uncorrelated and the off-diagonal entry vanishes. By the similarity story of §16.5, re-expressing $C$ in a new basis with change-of-basis matrix $P$ gives $P^{-1}CP$, and we are looking for the $P$ that makes this diagonal. That special $P$ is built from the eigenvectors of $C$ — the directions $C$ stretches without rotating — which Chapter 23 will teach us to find. For now, let numpy hand us the eigenbasis and watch the re-gridding work:

# PCA as a change of basis: diagonalize the covariance matrix.
import numpy as np
np.set_printoptions(suppress=True, precision=4)
C = np.array([[5., 2.],
              [2., 2.]])
eigvals, P = np.linalg.eigh(C)        # eigh: symmetric matrix -> orthonormal eigenvectors
print("eigenvalues (variances along new axes):", eigvals)   # [1. 6.]
print("P (columns = principal-component directions):\n", P)
D = np.linalg.inv(P) @ C @ P          # the covariance, re-gridded into the eigenbasis
print("C in the new basis = P^-1 C P:\n", np.round(D, 6))    # [[1. 0.] [0. 6.]]
print("total variance:  before =", np.trace(C), " after =", round(np.trace(D), 4))

The output is striking. The eigenvalues come back as $1$ and $6$, the re-gridded covariance is the clean diagonal $$D = P^{-1}CP = \begin{bmatrix} 1 & 0 \\ 0 & 6 \end{bmatrix},$$ and the total variance is $7$ both before and after — exactly the trace invariance the chapter promised. The off-diagonal $2$ has vanished. In the new basis, the two coordinates are uncorrelated: one new axis (the first principal component) carries variance $6$, and the perpendicular axis carries variance $1$. We have re-gridded the data into the coordinate system that the data itself prefers, and in that system its structure is laid bare — a long axis and a short axis, no tilt, no cross-talk.

Why this is the whole idea of dimensionality reduction

Now the payoff. The first principal component carries $6$ out of $7$ units of total variance — about 86% — and the second carries only $1$ unit. So if you projected every data point onto just the first principal axis, throwing away the second coordinate entirely, you would keep 86% of all the variation in the dataset while halving the number of numbers you store per person. The thin direction of the ellipse contributes so little that discarding it loses almost nothing.

That is dimensionality reduction in miniature, and it scales spectacularly. Real datasets do not live in two dimensions but in hundreds or thousands — every pixel of an image, every gene in an expression profile, every feature of a customer. In the standard basis those coordinates are massively correlated and redundant. Re-grid into the principal-component basis — diagonalize the covariance matrix — and the variance concentrates into a handful of leading components, with a long tail of near-zero ones you can safely drop. A dataset with a thousand correlated features might be faithfully described by ten principal coordinates. The compression is possible only because the right change of basis revealed that the data's intrinsic dimensionality was far smaller than its coordinate count suggested.

The connection to the chapter is exact and worth stating plainly: PCA is a change of basis. The principal components are the new basis vectors; the matrix of eigenvectors is the change-of-basis matrix $P$; "the variance along each principal component" is the diagonal of $P^{-1}CP$; and the conservation of total variance is the invariance of the trace under similarity. Every idea you need to understand PCA, you learned in this chapter. The only thing Chapter 32 adds is how to find the eigenbasis and the proof (the spectral theorem, Chapter 27) that a symmetric matrix like a covariance matrix always has an orthonormal eigenbasis — so the diagonalizing change of basis always exists.

Reading the new basis: what the principal axes actually are

It is worth looking at the change-of-basis matrix $P$ itself, because its columns have a concrete, interpretable meaning that demystifies the whole procedure. For our covariance $C = \begin{bmatrix}5&2\\2&2\end{bmatrix}$, the eigenvector for the large eigenvalue $6$ points roughly along $(0.89, 0.45)$ — up and to the right, the long axis of the tilted ellipse — while the eigenvector for the small eigenvalue $1$ points perpendicular to it, along $(-0.45, 0.89)$. (Eigenvectors are determined only up to sign, so numpy may hand you the opposite-pointing $(-0.89, -0.45)$ for the same axis — same line, same principal component.) These two directions are the new basis. The first principal component is the single direction in which the fitness data varies most; the second is the orthogonal direction of leftover variation.

That gives the principal components a plain-language reading that practitioners rely on. The first component is a weighted blend, "mostly strength with a healthy dose of endurance," and a person's score on it is essentially their overall fitness — the one number that best summarizes both measurements at once. The second component, the small one, captures the imbalance between strength and endurance, the way one person can be the same overall fitness as another while being stronger-but-less-enduring. When PCA tells you the first component holds 86% of the variance, it is telling you that overall fitness explains most of how people differ, and the strength-versus-endurance trade-off explains only a little. The change of basis did not just compress the data; it named the latent structure — and that interpretive payoff is why PCA is a workhorse of exploratory data analysis, not merely a compression trick.

There is one subtlety the chapter's machinery handles cleanly. Because a covariance matrix is symmetric, its eigenvectors can always be chosen orthonormal — mutually perpendicular and unit length — so the change-of-basis matrix $P$ is an orthogonal matrix (Chapter 21), and $P^{-1} = P^{\mathsf{T}}$. This is the gift the chapter mentioned in §16.3.2: for an orthonormal basis the inversion is a free transpose, and the change of basis is a pure rotation of the coordinate axes with no shearing or rescaling. Geometrically, PCA rotates your graph paper until the axes line up with the ellipse — nothing more violent than a turn. The spectral theorem of Chapter 27 guarantees this rotation always exists for any symmetric matrix, which is exactly why PCA is always available for any dataset.

The lesson

The standard basis is where data arrives, not where it is most honest. The strength axis and the endurance axis were given to us by how the measurements were taken, not by how the data is actually shaped. A change of basis — into the coordinate system aligned with the data's own variation — turned a tilted, correlated, redundant cloud into a clean pair of independent axes ordered by importance. The same maneuver that simplified a transformation in §16.4 simplifies a dataset here, because it is the same mathematics: find the basis that diagonalizes the relevant matrix, and the structure you were missing snaps into view.

This is why change of basis is not a footnote but a master technique. The transformation never changed; the data never changed. We changed the graph paper, and the right graph paper revealed that what looked like two entangled measurements was really one dominant pattern plus a sliver of noise. Hold onto the picture: when a problem looks complicated, suspect that you are describing it in the wrong basis — and that somewhere there is a coordinate system in which it is almost trivial. Finding that system is what the rest of this book is about.