> Learning paths. Math majors — read everything, especially the statement and meaning of the Eckart-Young theorem (§31.4) and the Math-Major Sidebar on why the operator-norm optimality is the harder, deeper claim. CS / Data Science — focus on the...
Prerequisites
- chapter-30-singular-value-decomposition
Learning Objectives
- State the **Eckart-Young theorem** with its conditions: the truncated SVD $A_k = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ is the *best* rank-$k$ approximation of $A$ in both the Frobenius and operator (spectral) norms.
- Explain a grayscale image as a matrix and reconstruct it at rank $k$, computing the storage cost $k(m+n+1)$ versus $mn$ and the compression ratio.
- Compute the relative Frobenius error of a rank-$k$ approximation and show it equals the energy in the dropped singular values, $\sqrt{\sum_{i>k}\sigma_i^2}\,/\,\lVert A\rVert_F$.
- Interpret the squared singular values as **energy** (variance) and read off the rank needed to capture a target fraction of it.
- Use truncation to **denoise** a matrix, recognizing that signal lives in the large singular values and noise spreads across the small ones.
- Describe low-rank approximation as the unifying idea behind image compression, denoising, and dimensionality reduction, teeing up PCA (Chapter 32).
In This Chapter
- 31.1 What does it mean to approximate a matrix with a simpler one?
- 31.2 How is a photograph a matrix?
- 31.3 How much storage does a rank-$k$ image cost?
- 31.4 What is the best rank-$k$ approximation? (the Eckart-Young theorem)
- 31.5 What does rank-$k$ image compression actually look like?
- 31.6 What is the "energy," and how do we choose the rank?
- 31.7 How does the SVD remove noise?
- 31.8 How is low-rank approximation the same as dimensionality reduction?
- 31.9 How do you build an SVD image compressor in code?
- 31.10 What have we built, and where does it lead?
SVD Applications: Image Compression, Noise Reduction, and Dimensionality Reduction
Learning paths. Math majors — read everything, especially the statement and meaning of the Eckart-Young theorem (§31.4) and the Math-Major Sidebar on why the operator-norm optimality is the harder, deeper claim. CS / Data Science — focus on the Geometric Intuition, the image-compression demo (§31.3, §31.5), the energy/variance discussion (§31.6), and the denoising and dimensionality-reduction sections; the proof sketches build intuition but are optional. Physics / Engineering — focus on the geometry of stacking rank-one layers, the energy interpretation, and the signal-versus-noise separation in §31.7, which is the mathematics under matched filtering and tomography.
In Chapter 30 we proved the single most useful theorem in applied linear algebra: every matrix — tall, wide, square, singular, real or complex — factors as $A = U\Sigma V^{\mathsf{T}}$, a rotation, then a stretch along perpendicular axes by the singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$, then another rotation. We watched the unit square get carried through rotate–stretch–rotate one last time in the visualizer, and we ended with a promise: this decomposition is not just beautiful, it is the engine behind the most consequential data-processing tricks of the modern world. This chapter is where we cash that promise, and it is the book's signature "wow" moment. Image compression using SVD is the anchor we have tracked since Chapter 30, and by the end of this chapter you will be able to take a real photograph, throw away most of its data, and watch it reconstruct so faithfully that your eye cannot tell the difference.
The whole chapter rests on one idea, and it is worth stating before any algebra so you know where we are going. Recall from Chapter 30 the outer-product form of the SVD: a matrix is a sum of rank-one pieces, each one a singular value times an outer product of a left and a right singular vector,
$$A = \sum_{i=1}^{r} \sigma_i\,\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}, \qquad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0,$$
where $r = \operatorname{rank}(A)$. Because the singular values come in decreasing order, this sum is a list of layers sorted from most important to least. The first layer $\sigma_1\mathbf{u}_1\mathbf{v}_1^{\mathsf{T}}$ carries the most of the matrix; the last layers carry almost nothing. So a natural, almost irresistible question presents itself: what if we keep only the first few layers and discard the rest? The answer — that you get the best possible low-rank approximation, by a theorem we will state precisely — is the foundation of image compression, signal denoising, and the dimensionality reduction that powers all of data science. Low-rank approximation explained in one sentence: stop the sum early.
The Key Insight — The SVD writes a matrix as a sum of rank-one layers $\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ ordered by importance. Keeping the top $k$ layers gives the best rank-$k$ approximation there is — provably, in a precise norm. Compressing an image, removing noise, and reducing dimension are all the same operation: keep the big singular values, throw away the small ones.
As always, we lead with the picture. Let us see why stopping the sum early is the right thing to do before we prove that it is the best thing to do — and let us do it on the object that makes the whole idea unforgettable: a photograph.
31.1 What does it mean to approximate a matrix with a simpler one?
Start with the geometric content of "low rank," because the word rank has carried geometric weight throughout this book. Recall from Chapter 14 that the rank of a matrix is the dimension of its column space — the number of genuinely independent directions its output can reach. A rank-1 matrix is the most degenerate non-trivial transformation there is: its entire output lives on a single line. A rank-2 matrix flattens everything onto a plane. A full-rank $m \times n$ matrix can reach an entire $r$-dimensional subspace, where $r = \min(m, n)$ at most. So "approximate $A$ by a rank-$k$ matrix" means: find the transformation, using only $k$ independent output directions, that comes closest to doing what $A$ does.
Why would we ever want a worse transformation? Because rank is expensive, and most real matrices are almost low rank. A rank-$k$ matrix can be stored and applied with far less data than a full-rank one — we will count exactly how much in §31.3 — and the astonishing empirical fact, true of photographs, signals, and data tables alike, is that the singular values of real matrices plummet. A handful of large singular values carry the structure; a long tail of tiny ones carry only fine detail and noise. Discarding the tail costs almost nothing and saves almost everything. That trade — a tiny loss in fidelity for a huge gain in compactness — is the entire business of this chapter.
Geometric Intuition — Picture the SVD's geometry from Chapter 30: $A$ takes the unit sphere to an ellipsoid whose semi-axis lengths are the singular values $\sigma_1 \ge \sigma_2 \ge \cdots$. A rank-$k$ approximation flattens the ellipsoid, keeping its $k$ longest axes and collapsing the rest to zero. If the remaining axes are short — if the ellipsoid is already nearly $k$-dimensional, a thin pancake floating in a high-dimensional space — then flattening it loses almost nothing, because there was almost nothing in those thin directions to begin with. Low-rank approximation works precisely when the ellipsoid is lopsided, and real-world ellipsoids are wildly lopsided.
Each rank-1 layer $\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ deserves a closer look, because it is the atom of everything that follows. From Chapter 8, an outer product $\mathbf{u}\mathbf{v}^{\mathsf{T}}$ of a column vector $\mathbf{u} \in \mathbb{R}^m$ with a row vector $\mathbf{v}^{\mathsf{T}}$ (where $\mathbf{v} \in \mathbb{R}^n$) is an $m \times n$ matrix — every column is a scalar multiple of $\mathbf{u}$, every row a scalar multiple of $\mathbf{v}^{\mathsf{T}}$. It is the simplest non-zero matrix possible: rank exactly 1. Because $\mathbf{u}_i$ and $\mathbf{v}_i$ are unit vectors (the singular vectors are orthonormal, Chapter 30), the scalar $\sigma_i$ out front carries the entire "size" of the layer. So the SVD sum is a stack of rank-1 layers, each weighted by how big it is, and the singular values tell you at a glance which layers matter.
Common Pitfall — A rank-1 matrix $\sigma\mathbf{u}\mathbf{v}^{\mathsf{T}}$ is not a small matrix — it is a full $m \times n$ array of numbers, the same shape as $A$. What is "small" about it is its information content: every one of its $mn$ entries is determined by just $m + n + 1$ numbers (the vector $\mathbf{u}$, the vector $\mathbf{v}$, and the scalar $\sigma$). Confusing the shape of a low-rank matrix with its storage cost is the error that hides the whole point of compression. The rank-$k$ approximation $A_k$ is the same size as $A$ on screen; it is dramatically cheaper to store and transmit.
31.2 How is a photograph a matrix?
Here is the bridge from abstract linear algebra to the anchor that makes this chapter sing. A grayscale image is literally a matrix. A photograph that is 512 pixels tall and 512 pixels wide is a $512 \times 512$ grid of numbers, one per pixel, each recording a brightness: conventionally $0$ for black, $255$ for white, and the values in between for shades of gray. Stack those numbers in a matrix $A$, with $a_{ij}$ the brightness of the pixel in row $i$, column $j$, and you have turned a picture into an object every theorem in this book applies to. (A color image is three such matrices — one each for red, green, and blue — and we compress each separately; we will keep to grayscale for clarity. For the broader engineering of pixels, files, and color, see the companion treatment of working with images.)
This is one of the most satisfying realizations in applied mathematics, so let us dwell on it. The rows of the image matrix are the horizontal scan-lines of the picture; the columns are the vertical ones. Smooth regions — a clear sky, a painted wall — are regions where neighboring entries are nearly equal, which makes the rows and columns there nearly linearly dependent. And linear dependence is exactly what low rank means (Chapter 6). A photograph of a calm sky is close to a low-rank matrix because vast stretches of it repeat the same few patterns; a photograph of static or confetti is close to full rank because every pixel is independent. Most real photographs live near the low-rank end, which is precisely why SVD compression works on them.
We can make the "smooth means low rank" claim exact with a tiny example. Consider a patch of sky modeled as a linear gradient — the brightness rising steadily across the patch, $a_{ij} = i + j$. Such a patch looks like it has $mn$ independent numbers, but it does not:
# A smooth gradient patch is exactly RANK 2 — almost all its energy in one layer.
import numpy as np
g = (np.arange(1, 6)[:, None] + np.arange(1, 7)[None, :]).astype(float) # a_ij = i + j
print(g.astype(int))
s = np.linalg.svd(g, compute_uv=False)
print("singular values:", np.round(s, 3))
print("rank:", int(np.sum(s > 1e-9)), " energy in top-1 layer:", round(s[0]**2/np.sum(s**2), 4))
[[ 2 3 4 5 6 7]
[ 3 4 5 6 7 8]
[ 4 5 6 7 8 9]
[ 5 6 7 8 9 10]
[ 6 7 8 9 10 11]]
singular values: [37.567 1.929 0. 0. 0. ]
rank: 2 energy in top-1 layer: 0.9974
A linear gradient of any size is exactly rank 2 — only two singular values are nonzero — and $99.74\%$ of its energy sits in the first layer alone. A $5 \times 6$ gradient that appears to need $30$ numbers is captured by a single rank-1 layer to about a quarter-percent of its energy. Real skies are not perfectly linear, so they have a few more nonzero singular values, but the lesson holds: smooth structure collapses into a handful of large singular values, leaving a long tail of near-zeros to discard. That collapse is exactly the compressibility the SVD exploits, and it is why a photograph — mostly smooth, with edges and texture as the "interesting" minority — is so much more compressible than random noise.
Geometric Intuition — Think of the rank-1 layers of an image as basis patterns. The first layer $\sigma_1\mathbf{u}_1\mathbf{v}_1^{\mathsf{T}}$ is the single rank-1 image that best matches the photo — typically a broad gradient capturing the overall light-to-dark sweep. The second layer adds the next-most-important pattern, perhaps a coarse division into bright and dark halves. Each successive layer adds finer structure: edges, then textures, then the faint speckle of sensor noise. Reconstructing the image at rank $k$ means summing the first $k$ of these patterns — building the picture from its most important visual ingredients down to its least. The visual progression from blurry to sharp that we are about to see is literally watching these patterns accumulate.
Once the image is a matrix $A$, we run Chapter 30's machinery on it: compute $A = U\Sigma V^{\mathsf{T}}$, read off the singular values, and reconstruct at rank $k$ by keeping the top $k$ layers. Everything in this chapter is that one move, examined from different angles. Let us define the central object precisely and then go straight to the photograph.
31.2.1 The truncated SVD, in locked notation
The rank-$k$ approximation of $A$, also called the truncated SVD, is the partial sum that keeps the $k$ largest singular values and their singular vectors:
$$\boxed{\;A_k = \sum_{i=1}^{k} \sigma_i\,\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}} = U_k\Sigma_k V_k^{\mathsf{T}}\;}$$
Here $U_k$ is the $m \times k$ matrix of the first $k$ left singular vectors (the first $k$ columns of $U$), $V_k$ is the $n \times k$ matrix of the first $k$ right singular vectors, and $\Sigma_k = \operatorname{diag}(\sigma_1, \dots, \sigma_k)$ is the $k \times k$ diagonal block of the top singular values. Two facts are immediate from Chapter 30 and worth fixing now. First, $A_k$ has rank exactly $k$ (assuming $\sigma_k > 0$): it is a sum of $k$ independent rank-1 layers. Second, when $k = r = \operatorname{rank}(A)$ the sum is complete and $A_k = A$ exactly — no approximation at all. The whole game is choosing $k$ smaller than $r$: small enough to save data, large enough to keep the picture. The next two sections quantify both sides of that trade.
31.3 How much storage does a rank-$k$ image cost?
Before we look at how good a rank-$k$ image looks, let us count how cheap it is, because the entire motivation is the trade between the two. Storing the full image $A$ naively costs one number per entry: an $m \times n$ image needs
$$\text{full storage} = mn \quad\text{numbers.}$$
For our running $512 \times 512$ photograph that is $512 \times 512 = 262{,}144$ numbers — a quarter-million pixels. Now count the rank-$k$ approximation. To reconstruct $A_k = U_k\Sigma_k V_k^{\mathsf{T}}$ we do not store the full $m \times n$ array; we store only the ingredients of the $k$ layers. Each layer $\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ needs the left vector $\mathbf{u}_i$ ($m$ numbers), the right vector $\mathbf{v}_i$ ($n$ numbers), and the singular value $\sigma_i$ (1 number). So $k$ layers cost
$$\boxed{\;\text{rank-}k\text{ storage} = k(m + n + 1)\;}$$
numbers. (We can absorb each $\sigma_i$ into $\mathbf{u}_i$ to save the "$+1$," but keeping it makes the bookkeeping honest and the formula memorable.) The compression ratio is the full cost divided by the compressed cost:
$$\text{compression ratio} = \frac{mn}{k(m+n+1)}.$$
For $m = n = 512$ this is $\dfrac{262{,}144}{k \cdot 1025}$. Plug in a few ranks and the savings are dramatic:
| rank $k$ | storage $k(m{+}n{+}1)$ | compression ratio | % of original |
|---|---|---|---|
| 5 | 5{,}125 | $51.1\times$ | $2.0\%$ |
| 10 | 10{,}250 | $25.6\times$ | $3.9\%$ |
| 20 | 20{,}500 | $12.8\times$ | $7.8\%$ |
| 50 | 51{,}250 | $5.1\times$ | $19.6\%$ |
| 100 | 102{,}500 | $2.6\times$ | $39.1\%$ |
| 200 | 205{,}000 | $1.3\times$ | $78.2\%$ |
Read the last column carefully, because it carries the punchline. A rank-50 reconstruction uses under one-fifth of the original data; a rank-10 reconstruction uses under one-twenty-fifth. The break-even point — where the compressed form costs the same as the original — happens when $k(m+n+1) = mn$, i.e. $k = mn/(m+n+1) \approx 256$ for our image. Beyond that rank you are spending more than the raw pixels would cost, so low-rank compression only makes sense for $k$ well below $m n/(m+n)$. The art is finding the smallest $k$ whose picture still looks right — which is what the next two sections are about.
Check Your Understanding — A grayscale image is $1000 \times 800$ pixels. How many numbers does it take to store it naively? How many for a rank-40 approximation, and what is the compression ratio? Above what rank $k$ does the "compressed" version stop saving anything?
Answer
Naive: $mn = 1000 \times 800 = 800{,}000$ numbers. Rank-40: $k(m+n+1) = 40 \times (1000 + 800 + 1) = 40 \times 1801 = 72{,}040$ numbers — a compression ratio of $800{,}000 / 72{,}040 \approx 11.1\times$, using about $9\%$ of the original. Break-even is at $k = mn/(m+n+1) = 800{,}000/1801 \approx 444$; for any $k$ above that, storing the $k$ layers costs more than the raw image, so compression only pays off for $k$ comfortably below $\sim 444$.
31.4 What is the best rank-$k$ approximation? (the Eckart-Young theorem)
We now arrive at the theoretical heart of the chapter, and it is a genuinely remarkable result. We have been assuming that "keep the top $k$ singular layers" is a good way to approximate $A$ by a rank-$k$ matrix. The Eckart-Young theorem says it is the best way — there is no rank-$k$ matrix anywhere, by any construction, that approximates $A$ more closely. The truncated SVD is not merely a good low-rank approximation; it is the optimal one. This is the theorem that elevates SVD compression from a clever heuristic to a provably optimal method, and it is why the SVD, and not some other factorization, is the right tool.
To state it we need a way to measure how far apart two matrices are. The natural one is the Frobenius norm from Chapter 18, the matrix analogue of vector length: the square root of the sum of the squares of all the entries,
$$\lVert A \rVert_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\sum_{i} \sigma_i^2}.$$
The second equality — that the Frobenius norm equals the root-sum-of-squares of the singular values — is the fact from Chapter 30 that makes everything in this chapter computable, and we will use it constantly. We measure approximation error by $\lVert A - B \rVert_F$, the Frobenius norm of the difference. There is also the operator norm (or spectral norm) $\lVert A \rVert_2 = \sigma_1$, the largest singular value, which measures the worst-case stretch $A$ applies to any unit vector (Chapter 30). Eckart-Young is optimal in both norms simultaneously — a striking and useful coincidence.
The Eckart-Young theorem (optimal low-rank approximation). Let $A$ be any $m \times n$ matrix (real or complex) with singular value decomposition $A = U\Sigma V^{\mathsf{T}}$ and singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0$. For any integer $k$ with $1 \le k \le r$, let $A_k = \sum_{i=1}^{k}\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ be the truncated SVD. Then $A_k$ is the closest rank-$k$ (or lower) matrix to $A$ in both the Frobenius and operator norms: $$\lVert A - A_k \rVert_F \;\le\; \lVert A - B \rVert_F \qquad\text{for every matrix } B \text{ with } \operatorname{rank}(B) \le k,$$ and the same inequality holds with $\lVert\cdot\rVert_2$ in place of $\lVert\cdot\rVert_F$. The minimum errors are exactly $$\lVert A - A_k \rVert_F = \sqrt{\sum_{i=k+1}^{r}\sigma_i^{2}}, \qquad \lVert A - A_k \rVert_2 = \sigma_{k+1}.$$ Conditions. The theorem holds for every matrix — no symmetry, squareness, or invertibility is required. In the Frobenius norm, the minimizer is unique precisely when there is a gap $\sigma_k > \sigma_{k+1}$ (the operator-norm minimizer is generally not unique); if $\sigma_k = \sigma_{k+1}$, the rank-$k$ subspace is not unique and there are several equally-good minimizers.
Look hard at the two error formulas, because they are the most useful equations in the chapter. The error of the best rank-$k$ approximation is governed entirely by the singular values you threw away. In the operator norm, the worst-case error is exactly the first discarded singular value $\sigma_{k+1}$ — the largest axis of the part you flattened. In the Frobenius norm, the total error is the root-sum-of-squares of all the discarded singular values, $\sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$. Both say the same intuitive thing: you lose exactly what you dropped, no more and no less, and because the singular values decrease, dropping the tail costs little. This is the precise version of the "flatten the thin axes of the ellipsoid" picture from §31.1.
The Key Insight — Eckart-Young: the truncated SVD $A_k$ is the provably best rank-$k$ approximation of $A$ in both the Frobenius and operator norms, and its error is exactly the energy in the discarded singular values, $\lVert A - A_k\rVert_F = \sqrt{\sum_{i>k}\sigma_i^2}$. You cannot do better with rank $k$, and you know your error in advance from the spectrum.
Check Your Understanding — A matrix has singular values $\sigma = (10, 8, 6, 1, 1, 1)$. What is the operator-norm error of the best rank-3 approximation? The Frobenius-norm error? What fraction of the energy does rank 3 capture?
Answer
Operator-norm error $= \sigma_4 = 1$ (the first dropped singular value). Frobenius-norm error $= \sqrt{\sigma_4^2 + \sigma_5^2 + \sigma_6^2} = \sqrt{1 + 1 + 1} = \sqrt 3 \approx 1.732$. Total energy $= 100 + 64 + 36 + 1 + 1 + 1 = 203$; energy captured by the top 3 is $(100 + 64 + 36)/203 = 200/203 \approx 0.9852$, so $98.5\%$. The squared relative error is $1 - 0.9852 = 0.0148 = 3/203$ — consistent with the Frobenius error $\sqrt 3$ over $\lVert A\rVert_F = \sqrt{203}$. Notice how cheap dropping the last three layers is: they are a tiny tail, so rank 3 is an excellent approximation. This is exactly the compressibility a steep spectrum buys you.
31.4.1 Why is the truncation optimal? (the idea of the proof)
A full proof belongs to the math-major track, but the idea is worth seeing because it explains why no clever rival can beat the SVD. Here is the four-part shape we use for important results.
1. Why we care. Optimality is what makes the SVD trustworthy. If some other rank-$k$ matrix could approximate $A$ better, every SVD-based compression, denoising, and PCA method would be leaving accuracy on the table. Eckart-Young guarantees they are not.
2. Key idea. Any rank-$k$ matrix $B$ has a $k$-dimensional column space, so $A - B$ must "cover" everything $A$ does outside that $k$-dimensional space. The most $A$ can be captured by any $k$ directions is captured by its own top $k$ singular directions, because those are by construction the directions of largest action (Chapter 30 built them by maximizing the stretch one orthogonal direction at a time). So no $k$-dimensional subspace can capture more of $A$ than its own top-$k$ singular subspace does — which is exactly what the truncated SVD keeps.
3. Proof (Frobenius case, sketch). Compute the error of the truncated SVD directly. Since $A - A_k = \sum_{i=k+1}^{r}\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ is itself an SVD (orthonormal $\mathbf{u}_i$ and $\mathbf{v}_i$, with singular values $\sigma_{k+1}, \dots, \sigma_r$), its Frobenius norm is the root-sum-of-squares of its singular values: $\lVert A - A_k\rVert_F^2 = \sum_{i>k}\sigma_i^2$. That establishes the error formula. For the lower bound — that no $B$ does better — one shows that for any rank-$k$ matrix $B$, the $i$-th singular value of $A$ is controlled by the singular values of $A - B$ shifted by $k$ (a consequence of the Courant-Fischer min-max characterization of singular values, the same variational idea from the spectral theorem's Math-Major Sidebar in Chapter 27). Summing these inequalities gives $\lVert A - B\rVert_F^2 \ge \sum_{i>k}\sigma_i^2 = \lVert A - A_k\rVert_F^2$. $\blacksquare$
4. What this means. Geometrically: among all ways to flatten the action of $A$ down to $k$ dimensions, flattening onto its own top-$k$ singular axes preserves the most. The SVD is not one option among many — it is the unique optimal one, and the proof is the same "maximize stretch one orthogonal direction at a time" idea that built the singular values in the first place.
Math-Major Sidebar — The operator-norm version, $\min_{\operatorname{rank}(B)\le k}\lVert A - B\rVert_2 = \sigma_{k+1}$, is the part usually attributed to Eckart and Young (1936), with the generalization to all unitarily invariant norms (which includes both Frobenius and operator) due to Mirsky (1960); the result is often called the Eckart-Young-Mirsky theorem [verify]. The operator-norm bound has a clean one-line core: if $\operatorname{rank}(B) \le k$, then $\ker(B)$ has dimension at least $n - k$, so it intersects the $(k+1)$-dimensional span of $\mathbf{v}_1, \dots, \mathbf{v}_{k+1}$ in some unit vector $\mathbf{x}$; on that vector $B\mathbf{x} = \mathbf{0}$, so $\lVert (A-B)\mathbf{x}\rVert = \lVert A\mathbf{x}\rVert \ge \sigma_{k+1}$ because $\mathbf{x}$ lies in the top-$(k{+}1)$ right-singular subspace where $A$ stretches by at least $\sigma_{k+1}$. Hence $\lVert A - B\rVert_2 \ge \sigma_{k+1}$, achieved by the truncated SVD. The pigeonhole-on-dimensions step is the whole argument, and it is one of the most elegant in the subject.
31.4.2 A rank-$k$ approximation by hand
Before we let numpy compress a photograph, let us do a low-rank approximation entirely by pencil on a matrix small enough to hold in our heads. This is the hand-computation stage of the chapter, and it makes the error formula concrete. Take
$$A = \begin{bmatrix} 2 & 2 \\ 1 & -1 \end{bmatrix}.$$
We want its SVD, and there is a shortcut that avoids the full machinery of Chapter 30 here, because $A A^{\mathsf{T}}$ turns out diagonal. Recall from Chapter 30 that the left singular vectors $\mathbf{u}_i$ are the orthonormal eigenvectors of $A A^{\mathsf{T}}$ and the squared singular values $\sigma_i^2$ are its eigenvalues. Compute
$$A A^{\mathsf{T}} = \begin{bmatrix} 2 & 2 \\ 1 & -1 \end{bmatrix}\begin{bmatrix} 2 & 1 \\ 2 & -1 \end{bmatrix} = \begin{bmatrix} 8 & 0 \\ 0 & 2 \end{bmatrix}.$$
It is already diagonal, so its eigenvalues are $8$ and $2$ with eigenvectors $\mathbf{e}_1 = (1,0)$ and $\mathbf{e}_2 = (0,1)$. Therefore the singular values are $\sigma_1 = \sqrt 8 = 2\sqrt 2 \approx 2.828$ and $\sigma_2 = \sqrt 2 \approx 1.414$, and the left singular vectors are $\mathbf{u}_1 = (1,0)$, $\mathbf{u}_2 = (0,1)$. To get each right singular vector, use the defining relation $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ from Chapter 30, equivalently $\mathbf{v}_i = \tfrac{1}{\sigma_i}A^{\mathsf{T}}\mathbf{u}_i$. For $\mathbf{u}_1 = (1,0)$, $A^{\mathsf{T}}\mathbf{u}_1 = (2,2)$, so $\mathbf{v}_1 = \tfrac{1}{2\sqrt2}(2,2) = \tfrac{1}{\sqrt2}(1,1)$; for $\mathbf{u}_2 = (0,1)$, $A^{\mathsf{T}}\mathbf{u}_2 = (1,-1)$, so $\mathbf{v}_2 = \tfrac{1}{\sqrt2}(1,-1)$. Both right singular vectors are unit length and perpendicular, as they must be.
Now build the rank-1 approximation $A_1 = \sigma_1\mathbf{u}_1\mathbf{v}_1^{\mathsf{T}}$ — the single most important layer:
$$A_1 = 2\sqrt2 \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix}\cdot\tfrac{1}{\sqrt2}\begin{bmatrix} 1 & 1 \end{bmatrix} = 2\sqrt2\cdot\tfrac{1}{\sqrt2}\begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 0 & 0 \end{bmatrix}.$$
The best rank-1 approximation of $A$ keeps its first row and zeroes the second — sensible, since the first row $(2,2)$ has length $\sqrt 8$ while the second $(1,-1)$ has length $\sqrt 2$, so the first row carries four times the energy. Now check Eckart-Young by hand. The first thing the theorem promises is that the error is exactly the dropped singular value. The error matrix is $A - A_1 = \begin{psmallmatrix}0&0\\1&-1\end{psmallmatrix}$, whose Frobenius norm is $\sqrt{0 + 0 + 1 + 1} = \sqrt 2$ — and indeed $\sqrt2 = \sigma_2$, exactly the operator-norm and Frobenius-norm error formulas (they coincide here because only one singular value was dropped). The relative error is
$$\frac{\lVert A - A_1\rVert_F}{\lVert A\rVert_F} = \frac{\sqrt 2}{\sqrt{10}} = \frac{1}{\sqrt 5} \approx 0.4472,$$
since $\lVert A\rVert_F = \sqrt{2^2 + 2^2 + 1^2 + (-1)^2} = \sqrt{10}$. And the energy captured by the rank-1 layer is $\sigma_1^2 / (\sigma_1^2 + \sigma_2^2) = 8/10 = 80\%$ — so the squared relative error should be $1 - 0.8 = 0.2$, and indeed $(1/\sqrt5)^2 = 1/5 = 0.2$. The whole web of identities — error equals dropped singular value, squared relative error equals one minus energy captured — checks out on a matrix you can verify in your head. Let us confirm with numpy.
# A rank-1 approximation by hand, verified: A = [[2,2],[1,-1]].
import numpy as np
A = np.array([[2.0, 2.0], [1.0, -1.0]])
U, s, Vt = np.linalg.svd(A)
print("singular values:", np.round(s, 6)) # [2.828427, 1.414214] = √8, √2
A1 = s[0] * np.outer(U[:, 0], Vt[0]) # rank-1 layer σ₁ u₁ v₁ᵀ
print("A_1 =\n", np.round(A1, 6))
rel_err = np.linalg.norm(A - A1, "fro") / np.linalg.norm(A, "fro")
print("relative error:", round(rel_err, 6), " = σ₂/‖A‖_F =", round(s[1]/np.linalg.norm(A,'fro'), 6))
print("energy captured by rank-1:", round(s[0]**2 / np.sum(s**2), 6))
singular values: [2.828427 1.414214]
A_1 =
[[2. 2.]
[0. 0.]]
relative error: 0.447214 = σ₂/‖A‖_F = 0.447214
energy captured by rank-1: 0.8
The code reproduces the hand result exactly: $A_1 = \begin{psmallmatrix}2&2\\0&0\end{psmallmatrix}$, relative error $0.4472 = \sigma_2/\lVert A\rVert_F$, energy captured $0.8$. The rank-2 approximation, of course, would be $A$ itself ($k = r = 2$, no error). This is the photograph demo in miniature — same truncation, same formulas — just small enough to trust without a computer.
31.4.3 What does low-rank approximation look like geometrically?
We can also see the rank-1 approximation, using the recurring 2D transformation visualizer from Chapter 1 one last time, because the matrix above is $2 \times 2$. The full matrix $A$ takes the unit square to a parallelogram (a genuine 2D region, since $\operatorname{rank} A = 2$). The rank-1 approximation $A_1 = \begin{psmallmatrix}2&2\\0&0\end{psmallmatrix}$ collapses the square onto a line segment — its entire output lives on the $x$-axis, because the second row is zero. Low-rank approximation is, visibly, a flattening: the rank-1 version throws away one whole dimension of the output, replacing the 2D parallelogram with the best 1D segment through it.
# Low-rank approximation as flattening: A (rank 2) vs A_1 (rank 1) in the visualizer.
import numpy as np, matplotlib.pyplot as plt
from visualizer import visualize_2d
A = np.array([[2.0, 2.0], [1.0, -1.0]])
U, s, Vt = np.linalg.svd(A)
A1 = s[0] * np.outer(U[:, 0], Vt[0]) # rank-1: [[2,2],[0,0]]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
visualize_2d(A, title="A (rank 2): square → parallelogram", ax=ax1)
visualize_2d(A1, title="A₁ (rank 1): square → line segment", ax=ax2)
plt.show()
Figure 31.1. Low-rank approximation is geometric flattening. On the left, the full rank-2 matrix $A$ carries the unit square (dashed blue) to a 2D parallelogram (orange). On the right, the rank-1 approximation $A_1$ collapses the same square onto a one-dimensional segment along the $x$-axis — its output has lost a dimension. The "error" of the approximation is exactly the perpendicular extent that got flattened away, governed by the dropped singular value $\sigma_2 = \sqrt2$. Alt-text: two side-by-side transformation plots; the left shows a unit square mapped to a parallelogram, the right shows the same square collapsed onto a flat line segment, illustrating that a rank-1 approximation removes one output dimension.
Geometric Intuition — Every low-rank approximation in this chapter — of a $2\times2$ matrix, a $512\times512$ image, or a thousand-dimensional dataset — is this same flattening, just in higher dimensions. The full matrix's output ellipsoid has axes $\sigma_1 \ge \sigma_2 \ge \cdots$; the rank-$k$ approximation keeps the $k$ longest and sets the rest to zero, projecting the ellipsoid down onto its own $k$ widest axes. For our $2\times2$ matrix that means parallelogram $\to$ segment; for the image it means collapsing the pixel-ellipsoid from $512$ dimensions onto its $50$ widest; for data it means projecting a cloud onto its principal directions. Flatten the thin axes is the one geometric act behind compression, denoising, and dimensionality reduction alike.
31.5 What does rank-$k$ image compression actually look like?
Now the payoff — the figure progression that is the reason this chapter exists. We take a $512 \times 512$ grayscale photograph, compute its SVD, and reconstruct it at increasing ranks. To make every number in this chapter reproducible to the last digit, we use a deterministic test image built from a known, photograph-like singular-value spectrum (a power-law decay $\sigma_i = 300/i$, which is realistic for natural images), so that running the code below yields exactly the numbers printed here. The mathematics is identical for any photograph you load from disk; only the specific singular values change.
Here is the complete demonstration. Recall the indexing convention from earlier chapters: mathematics counts $\sigma_1, \mathbf{u}_1$ from 1, while numpy's s[0], U[:,0] count from 0 — the same objects, shifted labels.
# Image compression via the truncated SVD: rank-k reconstruction of a 512x512 image.
import numpy as np
m, n = 512, 512
rng = np.random.default_rng(42) # fixed seed: fully reproducible
Qu, _ = np.linalg.qr(rng.standard_normal((m, m))) # random orthonormal U
Qv, _ = np.linalg.qr(rng.standard_normal((n, n))) # random orthonormal V
r = min(m, n)
sigma = 300.0 / np.arange(1, r + 1) # photo-like spectrum: 300, 150, 100, ...
A = (Qu[:, :r] * sigma) @ Qv[:, :r].T # the "image" matrix A = U Σ Vᵀ
U, s, Vt = np.linalg.svd(A, full_matrices=False) # recompute SVD of A
froA = np.linalg.norm(A, "fro") # = sqrt(Σ σ_i²)
print("σ_1..σ_6 =", np.round(s[:6], 1))
print("‖A‖_F =", round(froA, 3))
for k in (5, 10, 50, 200):
Ak = (U[:, :k] * s[:k]) @ Vt[:k] # rank-k reconstruction
rel_err = np.linalg.norm(A - Ak, "fro") / froA # relative Frobenius error
energy = np.sum(s[:k]**2) / np.sum(s**2) # fraction of energy kept
print(f"k={k:3d} rel_err={rel_err:.4f} energy_kept={energy:.4f}")
σ_1..σ_6 = [300. 150. 100. 75. 60. 50.]
‖A‖_F = 384.537
k= 5 rel_err=0.3304 energy_kept=0.8908
k= 10 rel_err=0.2382 energy_kept=0.9433
k= 50 rel_err=0.1042 energy_kept=0.9891
k=200 rel_err=0.0430 energy_kept=0.9982
These four numbers tell the entire story of the figure progression, so let us walk through them as a reader would walk through the pictures.
Figure 31.2. Rank-$k$ reconstructions of a $512 \times 512$ grayscale photograph, for $k = 5, 10, 50, 200$, shown beside the original. At rank 5 the image is a smear of broad light-and-dark bands — you can tell roughly where the bright and dark regions are, but no detail survives; the relative error is $33\%$ and only $89.1\%$ of the energy is kept. At rank 10 the picture is a recognizable but blurry ghost — major shapes have emerged, edges are soft, fine texture is gone; relative error $23.8\%$, energy $94.3\%$. At rank 50 the photograph is clearly recognizable, with sharp major features and only a faint softness in fine textures; relative error has dropped to $10.4\%$ and energy to $98.9\%$. At rank 200 the reconstruction is visually indistinguishable from the original to the naked eye — relative error $4.3\%$, energy $99.8\%$ — yet it was built from far less than the original's information. Alt-text: five image panels in a row, progressing from an unrecognizable blur at rank 5, to a blurry ghost at rank 10, to a clear image at rank 50, to a sharp image at rank 200, to the original — illustrating that low-rank SVD reconstructions sharpen as the rank increases.
This is the "wow" of the SVD made visible. Rank-10 blurry, rank-50 recognizable, rank-200 indistinguishable — the exact progression the part introduction promised, and it falls straight out of stopping the sum $\sum_i \sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ at different points. Notice the law of diminishing returns baked into the singular values: going from rank 5 to rank 10 (5 more layers) cuts the error from $33\%$ to $24\%$, but going from rank 50 to rank 200 (150 more layers) only cuts it from $10\%$ to $4\%$. The first few layers do the heavy lifting; the long tail polishes detail. That is the decreasing spectrum at work — and it is exactly why the rank-50 image, at one-fifth the storage, is already good enough for most purposes.
Geometric Intuition — Watch what each successive batch of layers adds to the picture. The first layer lays down the broadest gradient. The next few carve the image into large bright and dark zones. The layers from 10 to 50 sharpen edges and add midscale structure — the difference between "a blurry ghost" and "clearly that photograph." The layers from 50 to 200 fill in fine texture and faint detail your eye barely registers. Each layer is a rank-1 pattern; the reconstruction is their running sum; and the singular value $\sigma_i$ is the volume knob on layer $i$. Because the knobs turn down so fast, the late layers are nearly silent — which is exactly why discarding them is nearly free.
Common Pitfall — SVD compression is not JPEG, and you should not expect it to beat the image format on your phone. Real-world codecs like JPEG exploit how human vision works — they discard high-frequency detail the eye cannot see, using the discrete cosine transform on small $8\times8$ blocks, plus entropy coding — and they routinely achieve $10\times$ compression with no visible loss, beating naive SVD on photographs. SVD's value is not as a photo format; it is as the general-purpose, provably optimal low-rank approximation that works on any matrix — signals, data tables, recommendation matrices, covariance matrices — where there is no human visual system to exploit and no notion of "frequency block." The image is the perfect teaching example because you can see the rank, but the real power of low-rank approximation is everywhere matrices appear. We will see in §31.7 and §31.8 that the same operation denoises and reduces dimension, jobs JPEG cannot do.
31.6 What is the "energy," and how do we choose the rank?
The word energy has appeared in every numbers table, and it is time to make it precise, because it is the quantity that tells you how many singular values to keep without ever looking at the picture. Define the energy of a matrix as the square of its Frobenius norm — the sum of the squared singular values:
$$\text{energy}(A) = \lVert A \rVert_F^2 = \sum_{i=1}^{r}\sigma_i^2.$$
This is "energy" by analogy with physics and signal processing, where energy is a sum of squared amplitudes; in statistics, after centering, the very same sum is the total variance of the data (a connection we make exact in Chapter 32). The point is that each singular value contributes $\sigma_i^2$ to the total, and because the $\sigma_i$ decrease, the first few contribute the lion's share. The fraction of energy captured by the top $k$ layers is
$$\text{energy captured} = \frac{\sum_{i=1}^{k}\sigma_i^2}{\sum_{i=1}^{r}\sigma_i^2} = \frac{\lVert A_k\rVert_F^2}{\lVert A\rVert_F^2}.$$
This single number is the standard dial for choosing $k$ in practice: pick the smallest $k$ that captures, say, $95\%$ or $99\%$ of the energy, and you are guaranteed (by Eckart-Young) that your relative error is small. The relationship is exact and beautiful — the relative Frobenius error and the captured energy are two faces of one coin:
$$\left(\frac{\lVert A - A_k\rVert_F}{\lVert A\rVert_F}\right)^2 = \frac{\sum_{i>k}\sigma_i^2}{\sum_i\sigma_i^2} = 1 - (\text{energy captured}).$$
In words: the squared relative error equals one minus the fraction of energy captured. Capture $99\%$ of the energy and your squared relative error is $1\%$, so your relative error is $\sqrt{0.01} = 10\%$; capture $99.8\%$ and your relative error is $\sqrt{0.002} \approx 4.5\%$. Check this against §31.5: rank 200 captured $99.82\%$ energy and had relative error $4.30\%$, and indeed $\sqrt{1 - 0.9982} = \sqrt{0.0018} = 0.0424 \approx 0.043$. The numbers in this chapter are not independent measurements — they are linked by this identity, and that linkage is the Eckart-Young error formula in disguise.
The Key Insight — Choosing the rank is choosing how much energy to keep. The squared relative error of the best rank-$k$ approximation is exactly $1 - (\text{fraction of energy captured}) = \sum_{i>k}\sigma_i^2 / \sum_i\sigma_i^2$. Plot the cumulative energy against $k$, pick the knee of the curve (or a target like $99\%$), and you have your rank — no guessing, no looking at the image.
The diagnostic plot that operationalizes this is the scree plot: the singular values (or their squares, the energies) plotted against their index. For a compressible matrix it drops steeply and then flattens into a long low tail; the "elbow" where it flattens is a natural place to truncate, because the layers beyond it contribute almost no energy. Let us compute the ranks our running image needs to hit several energy targets:
# Choosing the rank: smallest k that captures a target fraction of the energy.
import numpy as np
m, n = 512, 512
rng = np.random.default_rng(42)
Qu, _ = np.linalg.qr(rng.standard_normal((m, m)))
Qv, _ = np.linalg.qr(rng.standard_normal((n, n)))
r = min(m, n)
A = (Qu[:, :r] * (300.0 / np.arange(1, r + 1))) @ Qv[:, :r].T
s = np.linalg.svd(A, compute_uv=False) # singular values only
cum_energy = np.cumsum(s**2) / np.sum(s**2) # cumulative fraction
for target in (0.90, 0.95, 0.99, 0.999):
k = int(np.searchsorted(cum_energy, target) + 1) # +1: 0-indexed -> rank
print(f"capture {target:6.1%} of energy -> rank k = {k}")
capture 90.0% of energy -> rank k = 6
capture 95.0% of energy -> rank k = 12
capture 99.0% of energy -> rank k = 54
capture 99.9% of energy -> rank k = 278
Just 6 layers capture $90\%$ of this image's energy, 12 capture $95\%$, and 54 capture $99\%$ — out of $512$ available. To squeeze out the last tenth of a percent requires $278$ layers, more than half. This is the diminishing-returns curve quantified: the steep early drop of the singular values means a tiny rank captures most of the energy, while the long flat tail demands many layers for marginal gains. In practice you stop at the elbow — for this image, somewhere around rank $12$ to $54$ depending on how much fidelity you need — and reap a large compression ratio for a small, known error.
Real-World Application — Latent factors in recommender systems. The same energy-truncation logic powers the recommendation engines behind streaming and shopping. The giant matrix of "user $i$'s rating of item $j$" is mostly empty and, where filled, nearly low-rank: a few hidden "taste factors" (genre, mood, era) explain most of who likes what. Truncating its SVD to the top $k$ singular values keeps those dominant taste factors and discards idiosyncratic noise, yielding compact user- and item-vectors whose dot products predict missing ratings. The number of factors $k$ is chosen exactly as above — enough to capture the structure, few enough to generalize. We develop this matrix-factorization view of recommenders in Chapter 33; it is low-rank approximation applied to a table of preferences.
31.7 How does the SVD remove noise?
The third great application, and the one that reveals low-rank approximation as more than compression, is denoising — and it rests on a single, sharp observation: signal concentrates in the large singular values; noise spreads across the small ones. If your data has genuine low-rank structure but is contaminated by random noise, the structure shows up as a few large singular values standing well above a "bulk" of small noise singular values. Truncating the SVD below the bulk keeps the signal and discards the noise — and by Eckart-Young, it does so optimally.
Here is why this works, and it is one of the most useful facts in applied linear algebra. Genuine structure is coherent: the rows and columns of a clean signal are related, which makes the signal nearly low-rank, so its energy piles into a few singular values. Random noise is incoherent: it has no preferred direction, so its energy spreads thinly and evenly across all the singular values. (For an $m \times n$ matrix of independent noise with standard deviation $\eta$, random-matrix theory says the noise singular values fill an interval up to roughly $\eta(\sqrt{m} + \sqrt{n})$ — the Marchenko-Pastur edge [verify].) When the signal's singular values rise above that noise floor, there is a visible gap in the spectrum, and truncating in the gap cleanly separates the two. Let us watch it happen on a concrete example: a rank-2 "signal" buried in noise.
# Denoising by truncation: keep the singular values above the noise floor.
import numpy as np
rng = np.random.default_rng(7)
m = n = 100
unit = lambda x: x / np.linalg.norm(x)
u1, v1 = unit(rng.standard_normal(m)), unit(rng.standard_normal(n))
u2, v2 = unit(rng.standard_normal(m)), unit(rng.standard_normal(n))
clean = 50 * np.outer(u1, v1) + 30 * np.outer(u2, v2) # rank-2 signal: σ ≈ 50, 30
noisy = clean + 0.5 * rng.standard_normal((m, n)) # add incoherent noise
s = np.linalg.svd(noisy, compute_uv=False)
print("noisy σ_1..σ_6 =", np.round(s[:6], 3)) # two big, then a noise bulk
U, sv, Vt = np.linalg.svd(noisy, full_matrices=False)
denoised = (U[:, :2] * sv[:2]) @ Vt[:2] # truncate to rank 2
err_noisy = np.linalg.norm(noisy - clean, "fro") / np.linalg.norm(clean, "fro")
err_denoised = np.linalg.norm(denoised - clean, "fro") / np.linalg.norm(clean, "fro")
print(f"relative error noisy = {err_noisy:.4f}, denoised = {err_denoised:.4f}")
print(f"improvement factor = {err_noisy / err_denoised:.2f}x")
noisy σ_1..σ_6 = [50.594 29.734 9.622 9.516 9.276 8.927]
relative error noisy = 0.8549, denoised = 0.1651
improvement factor = 5.18x
The spectrum tells the story at a glance. The two signal singular values, $50.6$ and $29.7$, tower above a bulk of noise singular values that tops out around $9.6$ — and that edge matches the theory: for $m = n = 100$ and noise level $\eta = 0.5$, the Marchenko-Pastur prediction $\eta(\sqrt{m} + \sqrt{n}) = 0.5(10 + 10) = 10$ lands right at the observed bulk edge. The gap between $29.7$ and $9.6$ is the signal-noise separation, and truncating to rank 2 (just below the gap) recovers the clean signal: the relative error drops from $0.855$ (the raw noisy matrix) to $0.165$ — a 5.2-fold improvement — even though we never knew the clean signal; we read its rank straight off the spectrum.
Geometric Intuition — Denoising is the same flattening as compression, aimed differently. Compression flattens the thin ellipsoid axes because they cost storage we are willing to lose; denoising flattens them because they are the noise — random, structureless directions added on top of the true signal. In both cases Eckart-Young guarantees that keeping the top $k$ axes is the optimal rank-$k$ summary. The only difference is interpretation: in compression the small singular values are fine detail we sacrifice; in denoising they are corruption we delete. The geometry — collapse the short axes — is identical.
Common Pitfall — Denoising by truncation works only when the signal is genuinely low-rank and the gap is real. If the clean signal is itself full-rank (rich, high-dimensional structure with no dominant directions), there is no spectral gap to truncate in, and cutting the small singular values throws away real signal along with the noise — you blur, you do not denoise. Always look at the scree plot first: a clear elbow (a few big values, then a flat bulk) is the license to truncate; a smooth, gapless decay is a warning that low-rank denoising will cost you signal. The method is powerful but not universal; the spectrum tells you whether it applies.
Real-World Application — Background removal in video and static-image cleanup. A surveillance video, stacked so each frame is a column of a matrix, is nearly low-rank in its background (the static scene repeats frame to frame) plus a sparse foreground (the moving people and cars). Separating the low-rank background from the sparse-plus-noise foreground — robust PCA, a cousin of the truncation here — is how background subtraction works in computer vision. The same idea cleans up scanned documents (low-rank page, sparse ink), seismographs, and MRI scans (where low-rank structure across measurements removes acquisition noise). Wherever the truth is structured and the corruption is not, the SVD's separation of large from small singular values is the tool.
31.7.1 How do you choose where to truncate for denoising?
The natural follow-up question is where to make the cut, and here denoising differs subtly from compression. In compression you choose $k$ by a fidelity target — keep $95\%$ of the energy, say. In denoising you choose $k$ by the spectral gap: keep every singular value that stands clearly above the noise bulk, and cut everything inside it. The principled threshold comes from the random-matrix prediction we used above: discard any singular value below roughly the Marchenko-Pastur edge $\eta(\sqrt m + \sqrt n)$, where $\eta$ is the noise standard deviation, because singular values below that line are statistically indistinguishable from pure noise. In our example the edge was $0.5(\sqrt{100}+\sqrt{100}) = 10$, the signal values $50.6$ and $29.7$ sat far above it, and the noise bulk topped out right at $\approx 9.6$ — so "keep everything above $10$" gives rank 2, the true signal rank. When you do not know $\eta$ in advance, you estimate it from the bulk itself (the median singular value is a robust gauge) or simply read the elbow off the scree plot. The key conceptual point is that denoising truncation is driven by the structure of the noise floor, not by a fidelity budget — you cut where signal stops and noise begins.
This also explains why over-keeping and under-keeping both hurt, in opposite ways. Keep too few layers (truncate above a genuine signal singular value) and you discard real structure — you blur. Keep too many (reach down into the noise bulk) and you re-admit the noise you were trying to remove — in our example, the rank-5 reconstruction, which keeps three noise components, has relative error $0.326$, nearly twice as bad as the rank-2 cut at $0.165$. The optimal truncation sits exactly in the gap, and the wider the gap (the higher the signal-to-noise ratio), the more forgiving the choice. A clear spectral gap is both the license to denoise and the guide to where.
31.8 How is low-rank approximation the same as dimensionality reduction?
We close the conceptual arc by naming the third face of the one idea, the face that opens onto the rest of Part VI. Dimensionality reduction is the task of representing high-dimensional data — points in $\mathbb{R}^n$ for large $n$ — using far fewer coordinates, while preserving as much of the data's structure as possible. And it is, once again, low-rank approximation. Stack your data as rows of an $m \times n$ matrix $A$ ($m$ data points, each an $n$-dimensional vector). The truncated SVD $A_k$ is the best rank-$k$ approximation of that matrix, which means it is the best way to represent every data point using only $k$ underlying directions — the top $k$ right singular vectors $\mathbf{v}_1, \dots, \mathbf{v}_k$. Project each data point onto those $k$ directions and you have compressed an $n$-dimensional dataset to a $k$-dimensional one, losing only the energy in the dropped singular values.
This is exactly the move that makes high-dimensional data tractable. A dataset with hundreds of features (columns) often has only a handful of genuinely independent directions of variation; the rest is redundancy and noise. The SVD finds those directions — the right singular vectors, ordered by how much of the data's spread they capture (the singular values) — and lets you keep only the important ones. The coordinates of the data in this reduced basis are its scores; the directions themselves are the new, compact axes. Whether you call it compression (of an image), denoising (of a signal), or dimensionality reduction (of a dataset), the operation is identical: compute the SVD, keep the top $k$ singular layers, discard the rest.
Geometric Intuition — A cloud of high-dimensional data points is an ellipsoidal blob (Chapter 28's quadratic-form picture), and most real blobs are lopsided — wide in a few directions, thin in most. Dimensionality reduction finds the few wide directions (the top singular vectors) and projects the cloud onto them, squashing the thin directions to zero. It is the same "flatten the thin axes of the ellipsoid" picture as compression and denoising, now applied to a data cloud instead of an image or a signal. The wide axes are the meaningful variation; the thin axes are redundancy and noise; and the SVD ranks them for you.
There is one crucial refinement that separates raw SVD dimensionality reduction from its most famous incarnation, and it is the entire subject of the next chapter. Principal Component Analysis is dimensionality reduction done on the centered data — you first subtract the mean of each feature so the cloud is centered at the origin — and then the right singular vectors are called principal components, the singular values squared are the variances along them, and the energy fraction becomes the fraction of variance explained. PCA is the SVD applied to centered data, and every quantity in this chapter — the rank-$k$ truncation, the energy captured, the scree-plot elbow, the Eckart-Young optimality — reappears there in the language of statistics. Chapter 32 makes the connection exact and shows why the principal components are also the eigenvectors of the covariance matrix (the spectral theorem of Chapter 27, come full circle).
Real-World Application — Compressing feature spaces in data science and machine learning. Before training a model on data with thousands of features — gene-expression levels, word counts in documents, sensor readings — practitioners routinely reduce dimension with the SVD (as PCA) to a few dozen components that capture most of the variance. This speeds up training, reduces overfitting, and reveals latent structure. The broader practice of dimensionality reduction — including PCA, its kernelized and randomized variants, and nonlinear methods — is built on the low-rank approximation idea of this chapter; the SVD is the linear foundation under all of it. We meet it again as a preprocessing step for machine-learning models in Chapter 33.
31.9 How do you build an SVD image compressor in code?
We have reached the chapter's contribution to the from-scratch toolkit you have been assembling since Chapter 2 — and, fittingly for the book's signature application, it is a complete, runnable demo rather than a single primitive. Everything we need is already in your toolkit: svd.py from Chapter 30 computes the decomposition, and the rank-$k$ reconstruction is a sum of outer products you can build from matrices.py (Chapter 8). The new module wires these together into the image-compression pipeline of this chapter and verifies the two laws we proved: that reconstruction error decreases as $k$ grows, and that it equals the energy in the dropped singular values (Eckart-Young).
Build Your Toolkit — Implement
image_compression.pyintoolkit/capstone/. It should: (1) load a grayscale image into a matrix with Pillow (from PIL import Image;np.asarray(Image.open(path).convert("L"), dtype=float)); (2) compute its SVD (call yoursvd_from_scratchfrom Chapter 30, ornp.linalg.svdto start); (3) reconstruct at rank $k$ as $A_k = \sum_{i=1}^{k}\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ for a list of ranks; and (4) report, for each $k$, the compression ratio $mn / (k(m+n+1))$, the relative Frobenius error $\lVert A - A_k\rVert_F / \lVert A\rVert_F$, and the energy captured $\sum_{i\le k}\sigma_i^2 / \sum_i\sigma_i^2$. Then verify two invariants against numpy: that the relative error is monotonically decreasing in $k$, and that it matches the Eckart-Young formula $\sqrt{\sum_{i>k}\sigma_i^2}/\lVert A\rVert_F$ to tolerance. This capstone joinssvd.py(Chapter 30) and is the demo you can run on any photograph; the PCA module of Chapter 32 reuses the same truncation machinery on centered data.
Here is the shape of the core routine, written with numpy so you can confirm the expected behavior before wiring in your from-scratch SVD:
# Expected behavior of toolkit/capstone/image_compression.py — verify against numpy.
import numpy as np
def compress_image(A, ranks):
"""SVD-compress matrix A at each rank in `ranks`; return per-rank stats."""
A = np.asarray(A, dtype=float)
m, n = A.shape
U, s, Vt = np.linalg.svd(A, full_matrices=False)
froA = np.linalg.norm(A, "fro")
report = []
for k in ranks:
Ak = (U[:, :k] * s[:k]) @ Vt[:k] # rank-k reconstruction
rel_err = np.linalg.norm(A - Ak, "fro") / froA # relative error
ey_tail = np.sqrt(np.sum(s[k:]**2)) / froA # Eckart-Young error
ratio = (m * n) / (k * (m + n + 1)) # compression ratio
energy = np.sum(s[:k]**2) / np.sum(s**2) # energy captured
assert np.isclose(rel_err, ey_tail) # the two must agree
report.append((k, ratio, rel_err, energy))
errs = [row[2] for row in report]
assert all(errs[i] > errs[i + 1] for i in range(len(errs) - 1)) # monotone decrease
return report
A = np.diag([10.0, 6.0, 3.0, 1.0]) # tiny 4x4 test matrix, σ = 10,6,3,1
for k, ratio, err, energy in compress_image(A, [1, 2, 3]):
print(f"k={k} ratio={ratio:.2f}x rel_err={err:.4f} energy={energy:.4f}")
k=1 ratio=1.78x rel_err=0.5613 energy=0.6849
k=2 ratio=0.89x rel_err=0.2617 energy=0.9315
k=3 ratio=0.59x rel_err=0.0828 energy=0.9932
The assertions are the heart of the verification: at every rank the relative error equals the Eckart-Young tail formula, and the errors decrease as $k$ grows. For this diagonal test matrix the singular values are just $10, 6, 3, 1$, so the energy is $\sum\sigma_i^2 = 100 + 36 + 9 + 1 = 146$; the rank-1 approximation keeps $100/146 = 0.6849$ of it, with relative error $\sqrt{1 - 0.6849} = \sqrt{0.3151} = 0.5613$ — exactly the printed value. (The compression ratio is below $1\times$ for $k \ge 2$ here only because the matrix is a tiny $4 \times 4$, where $k(m{+}n{+}1)$ quickly exceeds $mn = 16$; on the $512\times512$ image of §31.3 the ratios were the dramatic $51\times$ to $1.3\times$ of that table. The point of this toy is the invariants, not the ratio.) Every number in this chapter is linked to every other by the same identity, and your toolkit confirms it.
Historical Note — The optimality of the truncated SVD for low-rank approximation was published by the psychometricians Carl Eckart and Gale Young in 1936, in the context of approximating data matrices in factor analysis — a striking early appearance of the SVD in data science rather than pure mathematics [verify]. Leon Mirsky extended the result in 1960 to the entire family of unitarily invariant norms (which includes both the Frobenius and operator norms at once), which is why the full statement is often called the Eckart-Young-Mirsky theorem [verify]. The decomposition itself is older, with roots in the work of Beltrami and Jordan in the 1870s; that this 19th-century factorization, sharpened by a 1936 optimality theorem, would become the workhorse of 21st-century image compression and recommender systems is a vivid case of the book's recurring theme — the same mathematics, rediscovered as useful again and again across the fields and the decades.
31.10 What have we built, and where does it lead?
We began with a photograph and a single idea — the SVD writes a matrix as a sum of rank-one layers $A = \sum_i \sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ ordered by importance — and from that idea everything in this chapter followed. Low-rank approximation is what you get by stopping the sum early, keeping the truncated SVD $A_k = \sum_{i=1}^{k}\sigma_i\mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$; and the Eckart-Young theorem elevates this from a good idea to the provably optimal one — no rank-$k$ matrix, by any construction, approximates $A$ more closely in the Frobenius or operator norm, and the error is exactly the energy in the singular values you dropped, $\sqrt{\sum_{i>k}\sigma_i^2}$. We saw the theorem made visible in image compression: a rank-10 reconstruction is a blurry ghost, rank-50 is clearly recognizable, and rank-200 is indistinguishable from the original, each using a fraction of the storage $mn$ — only $k(m+n+1)$ numbers. We measured the trade with the energy captured, the fraction $\sum_{i\le k}\sigma_i^2 / \sum_i\sigma_i^2$ whose complement is the squared relative error, and used it to choose the rank at the scree-plot elbow. We watched the same truncation denoise a signal — signal lives in the large singular values, noise spreads across the small ones — and we recognized dimensionality reduction as the very same operation applied to a data cloud.
The threshold concept to carry forward is the unity of these three: compression, denoising, and dimensionality reduction are one operation seen three ways — throw away the small-singular-value layers. The interpretation changes (fine detail, noise, redundancy), but the mathematics is identical and the optimality is guaranteed by the same theorem. This is the book's recurring theme made literal: the same SVD that compresses an image denoises a signal and reduces dimension — learn it once, use it everywhere. Keep the vocabulary close: the truncated SVD and rank-$k$ approximation, the Eckart-Young theorem and its error formulas, the Frobenius and operator norms, the energy (or variance) captured, the scree plot and its elbow, and the separation of signal from noise in the spectrum.
The forward references are immediate and they complete Part VI. Chapter 32 (Principal Component Analysis) is the direct sequel: PCA is the SVD applied to centered data, where the right singular vectors become the principal components, the squared singular values become the variances, and the energy captured becomes the fraction of variance explained — every quantity in this chapter, re-read in the language of statistics, and connected back to the spectral theorem of Chapter 27 through the covariance matrix. Chapter 33 (Application: Machine Learning) then shows these decompositions powering the systems reshaping the world: the matrix-factorization recommenders we previewed in §31.6 are low-rank approximation applied to a table of preferences, and dimensionality reduction is the standard preprocessing that makes models trainable. The humble move of stopping a sum early — of keeping the big singular values and discarding the small — turns out to be one of the most powerful ideas in all of applied mathematics. You can now compress an image with it, and you are ready to see it become the engine of data science.
# Forward look: the SVD of centered data IS PCA — the variance version of this chapter.
import numpy as np
rng = np.random.default_rng(0)
X = rng.standard_normal((200, 5)) @ np.diag([5, 3, 1, 0.4, 0.2]) # lopsided data
Xc = X - X.mean(axis=0) # center it (PCA step)
s = np.linalg.svd(Xc, compute_uv=False) # SVD of centered data
var_explained = s**2 / np.sum(s**2) # = fraction of variance
print("variance explained per component:", np.round(var_explained, 4))
print("cumulative:", np.round(np.cumsum(var_explained), 4))
variance explained per component: [0.6904 0.272 0.0302 0.0062 0.0013]
cumulative: [0.6904 0.9624 0.9925 0.9987 1. ]
The first two singular directions of this centered data capture $96\%$ of its variance — the same energy-truncation logic of §31.6, now reading "variance explained" instead of "energy captured." Two coordinates suffice where there were five. That is dimensionality reduction, that is PCA, and it is the truncated SVD of this chapter wearing a statistician's hat. Hold onto the picture of stacking rank-one layers and keeping only the big ones; it is about to organize all of data science.