Case Study 27.1 — The Covariance Matrix Is Symmetric: A First Look at PCA

DataField.Dev

Case Study 27.1 — The Covariance Matrix Is Symmetric: A First Look at PCA

Field: data science & statistics. Anchor tie-in: this is the PCA payoff the chapter has teased throughout, and the direct ancestor of the full treatment in Chapter 32; the orthogonal eigenvectors come from §27.5, and the technique generalizes to PCA on high-dimensional data.

The problem

A data scientist at a fitness-tracking startup has a spreadsheet: for each of 500 users, two measured features — average daily steps and average active minutes. Plotted as points in the plane, the cloud is not a round blob; it is a tilted, cigar-shaped ellipse, because the two features are strongly correlated (people who walk more are active more). The scientist wants to answer a deceptively simple question: along which direction does the data vary the most? That single direction would let her summarize each user by one number instead of two, with minimal loss — the essence of dimensionality reduction. The tool she reaches for is Principal Component Analysis, and at its mathematical core sits a symmetric matrix and the Spectral Theorem of this chapter.

The bridge from "a cloud of data" to "a symmetric matrix" is the covariance matrix. After centering the data (subtracting the mean of each feature so the cloud is centered at the origin), the covariance matrix is

$$C = \frac{1}{n-1}X^{\mathsf{T}}X,$$

where $X$ is the $n\times 2$ data matrix (one row per user, one column per feature). Its diagonal entries are the variances of the two features, and its off-diagonal entry is their covariance — how much they move together. The crucial structural fact, and the reason this chapter applies, is that $C$ is symmetric: $C^{\mathsf{T}} = \tfrac{1}{n-1}(X^{\mathsf{T}}X)^{\mathsf{T}} = \tfrac{1}{n-1}X^{\mathsf{T}}X = C$, because $(X^{\mathsf{T}}X)^{\mathsf{T}} = X^{\mathsf{T}}X$ for any matrix $X$ (Chapter 8). The covariance of feature $i$ with feature $j$ is the same as feature $j$ with feature $i$ — symmetry is built into the very meaning of covariance. And the moment we know $C$ is symmetric, the Spectral Theorem hands us its entire structure for free.

Setting up a concrete, hand-checkable covariance matrix

To see the mechanism cleanly, take a simple covariance matrix whose numbers we can verify by hand:

$$C = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}.$$

The diagonal says each feature has variance $3$; the off-diagonal $1$ says they are positively correlated. Because $C$ is symmetric, the Spectral Theorem promises real eigenvalues and perpendicular eigenvectors. We find them exactly as in §27.3: the characteristic polynomial is $(3-\lambda)^2 - 1 = (\lambda - 2)(\lambda - 4) = 0$, so the eigenvalues are $\lambda_1 = 4$ and $\lambda_2 = 2$, with orthonormal eigenvectors $\mathbf{q}_1 = \tfrac{1}{\sqrt2}(1,1)$ and $\mathbf{q}_2 = \tfrac{1}{\sqrt2}(1,-1)$.

Here is the interpretation that makes PCA click. The eigenvectors are the principal components — the perpendicular axes of the data ellipse — and the eigenvalues are the variances along those axes. The largest eigenvalue, $\lambda_1 = 4$, belongs to the direction $\tfrac{1}{\sqrt2}(1,1)$: the $45°$ diagonal, exactly the direction in which "steps" and "active minutes" increase together. That is the direction of maximum variance, the long axis of the cigar. The smaller eigenvalue, $\lambda_2 = 2$, belongs to the perpendicular direction $\tfrac{1}{\sqrt2}(1,-1)$, the short axis. The Spectral Theorem's guarantee that these two axes are orthogonal is not a convenience — it is what makes the principal components an honest, non-redundant coordinate system for the data.

# PCA in miniature: the covariance matrix is symmetric, so eigh gives ⊥ principal axes.
import numpy as np
C = np.array([[3.0, 1.0],
              [1.0, 3.0]])                      # a symmetric covariance matrix
print("symmetric?", np.allclose(C, C.T))
w, Q = np.linalg.eigh(C)                          # real eigenvalues, orthonormal eigenvectors
print("variances along principal axes:", w)      # [2, 4]
print("principal components (columns of Q):\n", np.round(Q, 4))
print("components orthogonal? q1·q2 =", round(Q[:, 0] @ Q[:, 1], 12))
print("total variance = trace =", np.trace(C), "= sum of eigenvalues =", w.sum())
print("fraction of variance on top component:", round(w[-1] / w.sum(), 4))

symmetric? True
variances along principal axes: [2. 4.]
principal components (columns of Q):
 [[-0.707107  0.707107]
 [ 0.707107  0.707107]]
components orthogonal? q1·q2 = 0.0
total variance = trace = 6.0 = sum of eigenvalues = 6.0
fraction of variance on top component: 0.6667

Three facts from this output are worth dwelling on, and each is a direct consequence of the chapter. The principal components are orthogonal (their dot product is exactly $0$) — that is §27.5. The total variance, $6$, equals the trace of $C$, which equals the sum of the eigenvalues, $2 + 4$ — that is the trace identity of §27.7.3, and it means PCA redistributes the total spread among perpendicular axes without creating or destroying any. And the largest principal component captures $4/6 \approx 67\%$ of the variance: if the scientist keeps only that one direction and discards the other, she retains two-thirds of the information while halving the number of features. With more realistic, more strongly correlated data, that fraction often climbs above $95\%$, which is why PCA can compress a dataset of dozens of features down to a handful of principal components with negligible loss.

Why the spectral decomposition is PCA

The connection runs deeper than "PCA uses eigenvectors." The spectral decomposition of §27.6 is the precise statement of what PCA does to the covariance matrix:

$$C = \lambda_1\,\mathbf{q}_1\mathbf{q}_1^{\mathsf{T}} + \lambda_2\,\mathbf{q}_2\mathbf{q}_2^{\mathsf{T}} = 4\,\mathbf{q}_1\mathbf{q}_1^{\mathsf{T}} + 2\,\mathbf{q}_2\mathbf{q}_2^{\mathsf{T}}.$$

Read this as a ranking: the covariance structure of the data is a sum of perpendicular pieces, each piece a rank-one projector weighted by how much variance lives along that axis. The pieces are ordered by importance through their eigenvalues. Dimensionality reduction is then simply truncation: keep the terms with the largest eigenvalues and drop the rest. Projecting the data onto the top $k$ principal components — onto $\operatorname{span}\{\mathbf{q}_1, \dots, \mathbf{q}_k\}$ — gives the best $k$-dimensional summary of the data in a precise, provable sense (the projection that retains the most variance). For our toy example, "reduce to 1 dimension" means projecting every user onto the $\tfrac{1}{\sqrt2}(1,1)$ axis: a single "overall activity" score that captures two-thirds of how users differ.

This is also why the orthogonality matters so much in practice. Because the principal components are perpendicular, the variance each one captures is independent of the others — there is no double-counting. The "$67\%$ on the first component" and "$33\%$ on the second" add to exactly $100\%$ precisely because the projectors $\mathbf{q}_i\mathbf{q}_i^{\mathsf{T}}$ are orthogonal and resolve the identity (§27.6.1). If the eigenvectors were skew — as they would be for a non-symmetric matrix — the variances would overlap, the percentages would not add up cleanly, and "the direction of maximum variance" would not even be well-defined. The Spectral Theorem's orthogonality is the structural guarantee that PCA's accounting is honest.

Reading the eigenvalues: when does PCA help?

The eigenvalues do more than rank the components — their spread tells the scientist whether PCA will help at all. Two extreme cases illustrate the point.

If the two features were uncorrelated and equally variable, the covariance matrix would be $C = \begin{psmallmatrix}3 & 0\\ 0 & 3\end{psmallmatrix} = 3I$, with both eigenvalues equal to $3$. The data cloud would be a perfect circle, with no preferred direction — every direction captures the same $50\%$ of the variance, and there is nothing to reduce. (This is the repeated-eigenvalue case of §27.5.1: every direction is an eigenvector, and we are free to choose any orthonormal pair as the "principal" axes, because none is special.) PCA offers no compression here.

At the other extreme, if the features were almost perfectly correlated, say $C = \begin{psmallmatrix}3 & 2.9\\ 2.9 & 3\end{psmallmatrix}$, the eigenvalues would be roughly $5.9$ and $0.1$ — wildly unequal. Nearly all the variance lives along one axis, and dropping the tiny-eigenvalue direction loses almost nothing. The data is essentially one-dimensional dressed up as two, and PCA reveals it. The ratio of the largest eigenvalue to the total is the dial that tells you how compressible your data is, and it is readable directly off the spectrum of a symmetric matrix. This same eigenvalue-spread diagnostic, run on the covariance matrix of an image's pixels or a genome's expression levels, is how PCA decides how many components to keep in real applications.

Takeaways

A covariance matrix is symmetric by construction ($C = \tfrac{1}{n-1}X^{\mathsf{T}}X$), so the Spectral Theorem applies and hands us real eigenvalues and orthogonal eigenvectors automatically.
The eigenvectors are the principal components (perpendicular axes of the data ellipse) and the eigenvalues are the variances along them — the largest eigenvalue marks the direction of maximum variance.
The spectral decomposition $C = \sum_i\lambda_i\mathbf{q}_i\mathbf{q}_i^{\mathsf{T}}$ ordered by eigenvalue is PCA; dimensionality reduction is truncating this sum to the top components.
Orthogonality of the components (§27.5) is what makes the variance accounting honest — the captured fractions add to $100\%$ with no double-counting.
The spread of the eigenvalues tells you how compressible the data is: equal eigenvalues (a round cloud) mean no reduction is possible; one dominant eigenvalue means the data is essentially low-dimensional. We build all of this into full $n$-dimensional PCA in Chapter 32.