Case Study 1 — Genes Mirror Geography: PCA and Population Structure

DataField.Dev

Case Study 1 — Genes Mirror Geography: PCA and Population Structure

Field: data science / population genetics. This case study ties to the chapter's anchor — PCA on a real high-dimensional dataset, projected to a handful of components that reveal hidden structure.

The problem: a thousand dimensions of DNA

A geneticist sequences a panel of individuals and, for each person, records their genotype at hundreds of thousands of positions along the genome — single-nucleotide polymorphisms, or SNPs, sites where the DNA letter varies across people. Each SNP is coded as $0$, $1$, or $2$ (the count of a reference allele). So every individual becomes a point in a space with one axis per SNP: a point in five-hundred-thousand-dimensional space, in a real study. The data matrix $X$ has one row per person and one column per SNP, and it is enormous and utterly unplottable.

And yet a famous and beautiful fact emerges the moment you run PCA on this data: the top two principal components of human genetic variation reproduce a map of geography. When researchers applied PCA to the genotypes of Europeans (the landmark "Genes mirror geography within Europe" study of Novembre and colleagues, 2008), the scatter of individuals on the first two principal components looked startlingly like a map of Europe — Italians at the bottom, Scandinavians at the top, the Iberian Peninsula to one side. No geography was given to the algorithm. PCA found it, purely from the statistical structure of the DNA, because populations that are geographically close have intermarried more and so share more genetic variants, making the genetic data's grain line up with the continent. This case study walks through why PCA does this and how to read the result, on a synthetic dataset small enough to compute.

Setting up the data

We simulate three populations, each of eighty individuals genotyped at two hundred SNPs. Most SNPs vary randomly across everyone (noise, from PCA's point of view), but a subset are ancestry-informative: their allele frequencies differ between populations, which is what makes the populations genetically distinguishable. We arrange the three populations so that their ancestry differences span a two-dimensional plane — population A at one corner, B at another, C lifted off in a third direction — exactly the situation that produces a 2D "ancestry map."

# Synthetic population-genetics data: 3 populations, 200 SNPs, ancestry signal in a 2D plane.
import numpy as np
rng = np.random.default_rng(42)
N_per, d = 80, 200
def population(shift1, shift2, n):
    g = rng.binomial(2, 0.3, size=(n, d)).astype(float)   # genotypes 0/1/2 at 200 SNPs
    g[:, :60]    += shift1                                 # ancestry SNPs for axis 1
    g[:, 60:100] += shift2                                 # ancestry SNPs for axis 2
    return g
A = population(0.0, 0.0, N_per)
B = population(1.4, 0.0, N_per)
C = population(0.7, 1.4, N_per)
data = np.vstack([A, B, C])                               # 240 individuals x 200 SNPs
labels = np.array([0]*N_per + [1]*N_per + [2]*N_per)      # true population of each

The data matrix is $240\times200$ — two hundred dimensions, three hidden populations. The labels exist only so we can check the result; PCA never sees them. This is unsupervised learning: we hand the algorithm a cloud of points and ask what shape it has.

Running PCA the right way

We follow the chapter's recipe, and we use the SVD route on the centered data — not the covariance route — for exactly the reason of §32.8.1: the data matrix is wide ($d = 200$ features), forming the $200\times200$ covariance matrix squares the condition number and wastes precision, and the SVD avoids it entirely. (In real studies with hundreds of thousands of SNPs, forming the covariance matrix is not just imprecise but computationally prohibitive; the SVD, or its randomized variants, is the only practical route.)

# PCA via SVD of the centered genotype matrix; project to the top 2 components.
import numpy as np
Xc = data - data.mean(axis=0)                             # CENTER (subtract mean genotype per SNP)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)         # SVD route — never form covariance
evr = S**2 / (S**2).sum()                                 # explained-variance ratio
scores = U * S                                            # individuals in PC coordinates
print("top-5 explained-variance ratio:", np.round(evr[:5], 4))
print("top-2 cumulative:", round(100 * evr[:2].sum(), 2), "%")
for p in [0, 1, 2]:                                       # mean position of each population
    print(f"population {p}: PC1 {scores[labels==p,0].mean():+.2f}, "
          f"PC2 {scores[labels==p,1].mean():+.2f}")

top-5 explained-variance ratio: [0.1711 0.151  0.012  0.0115 0.0114]
top-2 cumulative: 32.21 %
population 0: PC1 -5.51, PC2 -3.01
population 1: PC1 +5.53, PC2 -2.99
population 2: PC1 -0.02, PC2 +5.99

Reading the result

Two things jump out, and both are characteristic of real population-structure PCA.

The top two components stand apart from the rest. PC1 explains $17.1\%$ of the variance and PC2 explains $15.1\%$ — and then there is a cliff, with PC3 onward each explaining barely $1\%$. That gap is the scree-plot elbow: the first two components capture structure (the ancestry signal living in our 2D plane), and the long flat tail is noise (the hundreds of SNPs that vary randomly). The Kaiser-style reading is unambiguous: keep two components. Notice the absolute numbers are modest — only $32\%$ of total variance in the top two — and this is honest and expected. Genetic variation is genuinely high-dimensional; most of it is individual idiosyncrasy with no population meaning. PCA's job is not to explain most of the variance but to isolate the structured part, and the gap after PC2 is how you see it has succeeded.

The populations occupy distinct regions of the PC1–PC2 plane. Population 0 sits at $(-5.5, -3.0)$, population 1 at $(+5.5, -3.0)$, and population 2 at $(0, +6.0)$ — three corners of a triangle. PC1 separates populations 0 and 1 (the left–right axis), and PC2 lifts population 2 away from the other two (the up axis). If you scatter-plot the 240 individuals by their first two scores and color them by population, you see three clean clusters. This is the ancestry map. The geneticist did not tell PCA there were three populations, nor where they were; PCA discovered the clusters from the covariance structure of the DNA alone, because individuals within a population share ancestry-informative alleles and so cluster together in the directions of greatest variance.

# The ancestry map: scatter individuals by their top-2 PCA scores, colored by population.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5, 5))
for p, name in [(0, "Pop A"), (1, "Pop B"), (2, "Pop C")]:
    pts = scores[labels == p]
    ax.scatter(pts[:, 0], pts[:, 1], s=15, alpha=0.6, label=name)
ax.set_xlabel("PC1"); ax.set_ylabel("PC2")
ax.set_title("Genetic ancestry map (top 2 principal components)")
ax.legend(); ax.grid(True, alpha=0.3); ax.set_aspect("equal")
plt.show()

The plot shows three distinct blobs at three corners of the plane — the same picture, in miniature, as the European map that PCA reconstructed from real genotypes. We compressed two hundred dimensions to two and the essential biological structure survived intact, because that structure was linear: ancestry mixes additively, so populations sit at positions in a flat ancestry space, and PCA is exactly the tool for flat structure.

Why the linear-algebra story matters here

Every step rests on the chapter's theorems. The components are perpendicular — the ancestry axes are orthogonal — because they are eigenvectors of the symmetric covariance matrix, and the spectral theorem of Chapter 27 guarantees perpendicular eigenvectors. The eigenvalues rank the axes by how much genetic variance they capture, which is why the structured ancestry directions (large eigenvalues) come first and the noise (small eigenvalues) is relegated to the tail. And we computed it all through the SVD of the centered data, never forming the covariance matrix, because with $d = 200$ features (or $500{,}000$ in practice) squaring the data would be both wasteful and ruinous to precision (§32.8.1). PCA on genetic data is the spectral theorem and the SVD, applied to a covariance matrix the size of a genome.

The cautions, applied

The chapter's warnings are not abstract here. Scaling matters: SNPs are sometimes standardized (divided by $\sqrt{p(1-p)}$, a per-SNP standard deviation) before PCA, so that common and rare variants contribute comparably; the choice changes the map. Linearity matters: PCA captures the additive, continental-scale structure beautifully, but fine-grained or non-additive structure (recent admixture, complex demographic histories) may need more components or nonlinear methods. Interpretation matters: PC1 here is "the ancestry difference between populations 0 and 1," a linear combination of sixty SNPs — interpretable because we built the data that way, but in real studies the components must be related back to geography and known populations with care. Used with these cautions, PCA remains the first thing every population geneticist runs on a new dataset: it is the microscope that turns a genome's worth of numbers into a picture you can see.

A note on scale: why this could not be done any other way

It is worth pausing on the sheer size of the real problem, because it explains why the SVD route is not merely a tasteful preference here but a practical necessity. A modern genotyping array measures hundreds of thousands to millions of SNPs across tens of thousands of individuals, so the data matrix $X$ might be $50{,}000 \times 1{,}000{,}000$. The covariance matrix over SNPs would be $1{,}000{,}000 \times 1{,}000{,}000$ — a trillion entries, impossible to even store, let alone diagonalize. The eigen route is a non-starter. The SVD of the centered data, by contrast, needs only the top few singular vectors, which iterative methods (the same power-iteration ideas as Chapter 29, and randomized SVD algorithms) compute without ever forming a giant covariance matrix or even the full decomposition. This is the §32.8.1 lesson at industrial scale: you compute PCA on enormous data precisely because the SVD lets you avoid the covariance matrix. The trick that we justified on a $5 \times 2$ toy is what makes population-scale genomics computationally possible — the same theorem, the same reasoning, eight orders of magnitude larger. When you read that a study found "the top ten principal components of genetic variation," what was actually computed was a truncated SVD of a centered genotype matrix, and the perpendicular ancestry axes it returned are the eigenvectors of a covariance matrix no one ever built.