Chapter 31 — Key Takeaways

The one idea

A matrix is a sum of rank-one layers $A = \sum_i \sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ ordered by importance, and keeping only the top $k$ is the provably best low-rank approximation there is (the Eckart-Young theorem). That single move — stop the sum early — is image compression, signal denoising, and dimensionality reduction all at once. The interpretation of the discarded small-singular-value layers changes (fine detail, noise, redundancy), but the mathematics is identical and the optimality is guaranteed. This is the chapter's threshold concept: once you see compression, denoising, and dimensionality reduction as one operation seen three ways, the SVD stops being a collection of tricks and becomes a single, unifying tool.

The big ideas, in order

  1. Low-rank approximation = truncated SVD. The rank-$k$ approximation is $A_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}} = U_k\Sigma_k V_k^{\mathsf{T}}$, the partial sum of the top $k$ rank-1 layers. It has rank exactly $k$; when $k = \operatorname{rank}(A)$ the sum is complete and $A_k = A$.
  2. Eckart-Young (state the conditions). For any matrix — no symmetry, squareness, or invertibility needed — $A_k$ is the closest rank-$k$ matrix to $A$ in both the Frobenius and operator norms. The errors are $\lVert A - A_k\rVert_F = \sqrt{\sum_{i>k}\sigma_i^2}$ and $\lVert A - A_k\rVert_2 = \sigma_{k+1}$. The minimizer is unique when $\sigma_k > \sigma_{k+1}$.
  3. A grayscale image is a matrix. An $m \times n$ image is a grid of pixel brightnesses; smooth regions make rows and columns nearly dependent (low rank). Rank-$k$ reconstruction sums the top $k$ visual "basis patterns": rank-10 is a blurry ghost, rank-50 recognizable, rank-200 indistinguishable.
  4. Storage and compression ratio. Rank-$k$ costs $k(m+n+1)$ numbers versus $mn$; the ratio is $mn / k(m+n+1)$. Compression only pays off below $k \approx mn/(m+n)$. A rank-50 image of a $512\times512$ photo uses under one-fifth the storage.
  5. Energy and choosing the rank. Energy is $\lVert A\rVert_F^2 = \sum_i\sigma_i^2$; the fraction captured by the top $k$ is $\sum_{i\le k}\sigma_i^2/\sum_i\sigma_i^2$. The exact link: squared relative error $= 1 - $ (energy captured). Choose $k$ at the scree-plot elbow or a target like $99\%$.
  6. Denoising. Signal is coherent (few large singular values); noise is incoherent (spread across small ones), filling a bulk up to the Marchenko-Pastur edge $\eta(\sqrt m + \sqrt n)$. Truncating in the gap strips the noise — optimally, by Eckart-Young — but only when a real spectral gap exists.
  7. Dimensionality reduction. The same truncation applied to a data matrix is the best $k$-dimensional representation of the points (projection onto the top $k$ right singular vectors). Centering the data first turns it into PCA (Chapter 32).

Skills you gained

  • Write the truncated SVD $A_k$ and compute its rank-$k$ reconstruction as a sum of outer products.
  • State the Eckart-Young theorem with its conditions and both error formulas, and sketch why truncation is optimal.
  • Compute the storage cost $k(m+n+1)$, the compression ratio, and the break-even rank for an image.
  • Compute the relative Frobenius error and show it equals $\sqrt{\sum_{i>k}\sigma_i^2}/\lVert A\rVert_F$ and that its square is one minus the energy captured.
  • Read a scree plot to choose a truncation rank; identify a spectral gap that licenses denoising.
  • Use truncation to denoise a low-rank signal and to separate a static background from motion.
  • Explain why SVD compression differs from JPEG and when low-rank approximation is the right tool.

Terms to know

low-rank approximation · truncated SVD · rank-$k$ approximation $A_k$ · Eckart-Young theorem (Eckart-Young-Mirsky) · Frobenius norm · operator (spectral) norm · singular value · rank-one layer · energy / variance captured · scree plot · elbow · compression ratio · storage $k(m+n+1)$ · denoising · spectral gap · Marchenko-Pastur edge · dimensionality reduction · latent factors / matrix factorization · robust PCA

How this connects to the book's themes

  • Linear algebra is the most applied branch of pure mathematics. This chapter is the theme at full force: one operation — keep the top singular layers — compresses an image, denoises a signal, builds a recommender, and reduces dimension. Learn it once, use it everywhere is here a literal description.
  • Geometry and algebra are two views of one object. A singular value is an entry of $\Sigma$ (algebra) and the length of an ellipsoid axis (geometry); low-rank approximation is "drop the small entries" (algebra) and "flatten the thin axes" (geometry) — the same act.
  • Computation validates theory and theory guides computation. Eckart-Young (theory) guarantees the truncation is optimal; the numpy (computation) confirms every error matches the dropped-energy formula to machine precision; the hand computation in §31.4.2 makes the formula trustworthy.
  • Eigenvalues/eigenvectors reveal what a matrix really does. The singular values are the eigenvalues of $A^{\mathsf{T}}A$ (Chapter 30); ranking the rank-1 layers by them reveals the essential structure of the matrix, stripped of the negligible tail.
  • Toolkit contribution. You built toolkit/capstone/image_compression.py — load, SVD, reconstruct at rank $k$, report compression ratio, error, and energy — and verified the two invariants (error decreases with $k$; error equals the Eckart-Young tail). It joins svd.py (Chapter 30) and feeds pca.py (Chapter 32).

Where this leads (forward references)

This chapter is the hinge of Part VI: it turns the SVD of Chapter 30 into the workhorse of the rest of the book.

  • Chapter 32 (Principal Component Analysis). PCA is the SVD of centered data. Every quantity here reappears in statistical dress: the right singular vectors become principal components, the squared singular values become variances, the energy captured becomes the fraction of variance explained, and the scree-plot elbow becomes how many components to keep. The orthogonality of the components traces back to the spectral theorem of Chapter 27, applied to the covariance matrix — closing a loop the book has been drawing since Part V.
  • Chapter 33 (Application: Machine Learning). The matrix-factorization recommenders of Case Study 31.1 are low-rank approximation applied to a table of preferences (with matrix completion for the missing entries); dimensionality reduction is the standard preprocessing that makes high-dimensional models trainable; and word and image embeddings are learned low-dimensional vectors in the same spirit. The decompositions of Part VI are the linear algebra under the systems reshaping the world.
  • Chapter 38 (Numerical Linear Algebra). Computing the SVD of a large matrix stably and quickly — and the randomized and truncated algorithms that get just the top $k$ singular values without forming the whole decomposition — is its own subject, and it is what makes SVD compression and PCA feasible at the scale of real images and datasets.

The humble move of stopping a sum early is one of the most powerful ideas in applied mathematics. You can now compress an image with it; next, you will watch it become the engine of data science.