The Singular Value Decomposition: The Most Important Factorization in Linear Algebra

DataField.Dev

53 min read

> Learning paths. Math majors — read everything, especially the existence proof in §30.5 (the SVD built from the spectral theorem on $A^{\mathsf{T}}A$) and the four-subspaces theorem in §30.8. CS / Data Science — focus on the rotate–stretch–rotate...

Prerequisites

chapter-27-spectral-theorem
chapter-21-orthogonal-matrices-and-rotations

Learning Objectives

State and interpret the Singular Value Decomposition $A = U\Sigma V^{\mathsf{T}}$ for **any** real matrix — square or rectangular, symmetric or not, full-rank or rank-deficient — and explain why it exists for every matrix when diagonalization does not.
Read the SVD geometrically as **rotate–stretch–rotate**: $V^{\mathsf{T}}$ is a rotation/reflection, $\Sigma$ an axis-aligned stretch by the singular values, and $U$ a final rotation/reflection.
Derive the SVD from the **Spectral Theorem** (Chapter 27) applied to the symmetric matrix $A^{\mathsf{T}}A$: right singular vectors are its eigenvectors, singular values are $\sigma_i = \sqrt{\lambda_i}$, and left singular vectors are $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$.
Compute a full $2\times 2$ SVD by hand via $A^{\mathsf{T}}A$ and reconcile it with `np.linalg.svd`, accounting for the sign and ordering conventions.
Use the columns of $U$ and $V$ to write orthonormal bases for all **four fundamental subspaces** (Chapter 14), and read the **rank** off the count of nonzero singular values.
Relate the singular values to the **operator norm**, the **Frobenius norm**, and the **condition number**, distinguishing the full SVD from the reduced SVD.

In This Chapter

30.1 What does the SVD say a matrix does to space?
30.2 What are the singular values and singular vectors?
30.3 How do you compute the SVD of a matrix? A full $2\times 2$ by hand
30.4 Does the hand computation match numpy?
30.5 Why does every matrix have an SVD? (the spectral theorem on $A^{\mathsf{T}}A$)
30.6 How is the SVD related to eigen-decomposition?
30.7 How do you read the rank off the singular values?
30.8 How does the SVD reveal the four fundamental subspaces?
30.9 What is the difference between the full and reduced SVD, and what do the singular values measure?
30.10 What is the pseudoinverse, and how does the SVD solve every least-squares problem?
30.11 What does the SVD look like? (the visualizer returns one last time)
30.12 How do you build the SVD from scratch?
30.13 What have we built, and where does it lead?

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

The Singular Value Decomposition: The Most Important Factorization in Linear Algebra

Learning paths. Math majors — read everything, especially the existence proof in §30.5 (the SVD built from the spectral theorem on $A^{\mathsf{T}}A$) and the four-subspaces theorem in §30.8. CS / Data Science — focus on the rotate–stretch–rotate geometry (§30.1–30.2), the worked example, the four subspaces, and the norm/rank/condition-number readings that power everything in Part VI; the existence proof can be skimmed. Physics / Engineering — focus on the geometry, the connection to $A^{\mathsf{T}}A$ and the spectral theorem, the pseudoinverse and least-squares case study, and the condition number that tees up Chapter 38.

If you remember one theorem from this entire book a decade from now, make it this one. The Singular Value Decomposition — the SVD — is the most important factorization in linear algebra, and it is not a close contest. Gilbert Strang calls it the climax of the subject; numerical analysts treat it as the reliable workhorse they reach for when nothing else can be trusted. Across the rest of this book it will compress images (Chapter 31), extract principal components (Chapter 32), power recommender systems (Chapter 33), solve least-squares problems, denoise signals, and reveal the rank and numerical health of any matrix. One decomposition does all of that. Learn it once, use it everywhere is not a slogan in Part VI; it is a literal description of what the SVD does.

Here is the headline result, the equation we will spend the chapter understanding. For any real matrix $A$ — and I mean any: square or rectangular, symmetric or wildly asymmetric, invertible or rank-deficient, $m\times n$ for any $m$ and $n$ — there exist an orthogonal matrix $U$, a "diagonal" matrix $\Sigma$ of non-negative numbers, and an orthogonal matrix $V$ such that

$$A = U\Sigma V^{\mathsf{T}}.$$

That is the singular value decomposition. The diagonal entries of $\Sigma$ are the singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$; the columns of $V$ are the right singular vectors; the columns of $U$ are the left singular vectors. The whole chapter is an unpacking of this single line: what each factor does to space, why such a factorization must exist for every matrix, and why that universality makes the SVD the keystone of applied linear algebra.

Before any of the algebra, sit with what is SVD asking, because the answer is the cleanest geometric statement in the book. We will see that every linear map is a rotation, then a stretch along perpendicular axes, then another rotation. That is the entire content of $A = U\Sigma V^{\mathsf{T}}$. The transformation $\mathbf{x}\mapsto A\mathbf{x}$ might look like an inscrutable tangle of stretching, shearing, and rotating — but the SVD says it is always, secretly, just three simple motions performed in sequence. We will watch this happen frame by frame in our recurring 2D visualizer, where the unit circle becomes an ellipse whose axis lengths are exactly the singular values.

This chapter opens Part VI, and it sits at the convergence of everything you have built. From Chapter 21 you have orthogonal matrices — the rotations and reflections that preserve length, satisfying $Q^{\mathsf{T}}Q = I$. From Chapter 14 you have the four fundamental subspaces — column space, null space, row space, left null space — the organizing framework of the subject. And from Chapter 27 you have the Spectral Theorem: every symmetric matrix is orthogonally diagonalizable, $S = Q\Lambda Q^{\mathsf{T}}$. The SVD weaves all three together. In fact, the SVD is the spectral theorem in disguise — applied not to $A$ (which need not be symmetric, or even square) but to the matrix $A^{\mathsf{T}}A$, which is always symmetric, no matter what $A$ is. That single observation, which we develop in §30.5, is the engine that makes the SVD exist for every matrix in the world. Diagonalization had fine print — it needed square matrices with enough eigenvectors, and it could fail. The SVD has no fine print at all. That universality is the gift of Part VI, and we begin, as always, with the picture.

30.1 What does the SVD say a matrix does to space?

Forget formulas for a moment and ask the only question that matters first: what does the transformation $\mathbf{x}\mapsto A\mathbf{x}$ look like? We have answered this for special matrices. A rotation (Chapter 21) spins every vector by a fixed angle. A diagonal matrix stretches along the coordinate axes. A symmetric matrix (Chapter 27) stretches along perpendicular eigen-axes and does nothing else. But a general matrix — say $A = \begin{psmallmatrix}2 & 2\\ -1 & 1\end{psmallmatrix}$ — seems to do an unholy mixture of everything at once: it stretches, it rotates, and it shears, all tangled together. Untangling that mess was hard in Chapter 7, and diagonalization could not always help, because a general matrix may have complex eigenvalues (Chapter 26) or too few eigenvectors (the defective case of Chapter 25).

The SVD cuts straight through the tangle with a startling claim: no matter how complicated $A$ looks, its action on space is always exactly three simple motions performed in order — a rotation (or reflection), then a stretch along perpendicular axes, then another rotation (or reflection). Reading $A\mathbf{x} = U\Sigma V^{\mathsf{T}}\mathbf{x}$ right to left, the input vector $\mathbf{x}$ first meets $V^{\mathsf{T}}$, which rotates it; then meets $\Sigma$, which stretches each coordinate axis by a singular value; then meets $U$, which rotates the stretched result into its final position. Rotate, stretch, rotate. Every linear map factors this way. The shearing and twisting you saw in a general matrix is an illusion created by performing these three clean motions back to back.

Geometric Intuition — Picture the unit circle, the set of all vectors of length 1. Apply $A$ to every point on it. The image is always an ellipse (in 2D; an ellipsoid in higher dimensions). The SVD reads this ellipse off completely. $V^{\mathsf{T}}$ rotates the circle (a circle looks the same after rotating, so nothing visible happens yet — but it lines up the special input directions with the coordinate axes). $\Sigma$ then stretches the circle into an axis-aligned ellipse, by $\sigma_1$ along the first axis and $\sigma_2$ along the second. Finally $U$ rotates that ellipse into its actual orientation. The singular values are the lengths of the ellipse's semi-axes, and the columns of $U$ are the directions those axes point. The SVD turns "what does $A$ do to space?" into "draw the ellipse, read off its axis lengths and directions."

This is worth slowing down on, because it is the conceptual heart of the chapter. The two rotations $U$ and $V$ contribute no stretching — they are rigid motions that preserve every length and angle (Chapter 21). All of the distortion, all of the actual squashing and stretching that $A$ does, is concentrated in the middle factor $\Sigma$, and $\Sigma$ does the simplest possible kind of stretching: independent scaling along perpendicular axes. The genius of the SVD is that it isolates a matrix's distortion into one diagonal factor, sandwiched between two distortion-free rotations. The complexity of a general transformation was never in the stretching — stretching is easy. It was in the fact that the "input axes" you stretch along and the "output axes" they land on are different sets of perpendicular directions. The SVD names both sets: $V$'s columns are the input axes, $U$'s columns are the output axes.

The Key Insight — Every linear transformation is rotate–stretch–rotate: $A = U\Sigma V^{\mathsf{T}}$ first rotates the input by $V^{\mathsf{T}}$, then stretches along perpendicular axes by the singular values in $\Sigma$, then rotates again by $U$. All of a matrix's distortion lives in the diagonal $\Sigma$; the two orthogonal factors are pure, length-preserving rotations. This holds for every matrix without exception.

30.1.1 Why two rotations, and not one?

You might reasonably ask: diagonalization used only one change of basis, $A = PDP^{-1}$ — rotate into eigen-coordinates, scale, rotate back the same way. Why does the SVD need two different rotations, $U$ and $V$? The answer reveals exactly why the SVD succeeds where diagonalization fails.

Diagonalization asks the matrix to stretch along a set of axes and leave those same axes pointing the same way — that is what an eigenvector is, a direction $A$ merely scales without rotating, $A\mathbf{v} = \lambda\mathbf{v}$. But most matrices have no such well-behaved real directions. A rotation matrix sends every direction somewhere else; it has no real eigenvectors at all. And a non-square matrix cannot even ask the question — if $A$ is $3\times 2$, then $A\mathbf{v}$ lives in a different space ($\mathbb{R}^3$) from $\mathbf{v}$ (in $\mathbb{R}^2$), so "$A\mathbf{v}$ points the same way as $\mathbf{v}$" is not even meaningful.

The SVD asks a humbler, always-answerable question instead. It does not demand that the input and output axes be the same. It finds one orthonormal set of input directions (the columns of $V$) and a possibly different orthonormal set of output directions (the columns of $U$), with the property that $A$ sends each input axis $\mathbf{v}_i$ straight onto a stretched output axis: $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$. The input axis $\mathbf{v}_i$ gets scaled by $\sigma_i$ and relocated to the new direction $\mathbf{u}_i$. Because we allow the output directions to differ from the input directions, this is always possible — that freedom is precisely what removes diagonalization's fine print. One rotation was never enough for a general matrix; two rotations always are.

Common Pitfall — Do not confuse singular values with eigenvalues, or singular vectors with eigenvectors. Eigenvalues can be negative, complex, or fail to exist as real numbers; singular values are always real and non-negative ($\sigma_i \ge 0$), and they exist for every matrix. Eigenvectors satisfy $A\mathbf{v} = \lambda\mathbf{v}$ (same direction in, same out); right singular vectors satisfy $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ (one direction in, generally a different direction out). The two coincide only in a special case — a symmetric positive-definite matrix, §30.6 — and conflating them in general is one of the most common errors in the subject.

30.2 What are the singular values and singular vectors?

Now let us name the pieces precisely, still keeping the geometric picture in view. The factorization $A = U\Sigma V^{\mathsf{T}}$ for an $m\times n$ matrix $A$ has three factors:

$\Sigma$ is an $m\times n$ matrix that is zero everywhere except on its main diagonal, where it carries the singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0$ (with any remaining diagonal entries equal to zero). By universal convention the singular values are listed in decreasing order and are non-negative. The number $r$ of strictly positive singular values is the rank of $A$, as we will prove in §30.7.
$V$ is an $n\times n$ orthogonal matrix ($V^{\mathsf{T}}V = I_n$). Its columns $\mathbf{v}_1, \dots, \mathbf{v}_n$ are the right singular vectors — an orthonormal basis for the input space $\mathbb{R}^n$.
$U$ is an $m\times m$ orthogonal matrix ($U^{\mathsf{T}}U = I_m$). Its columns $\mathbf{u}_1, \dots, \mathbf{u}_m$ are the left singular vectors — an orthonormal basis for the output space $\mathbb{R}^m$.

The single equation that ties them together — and the one to memorize, because every property in this chapter flows from it — is what happens when you apply $A$ to a right singular vector:

$$A\mathbf{v}_i = \sigma_i\,\mathbf{u}_i, \qquad i = 1, \dots, r.$$

Read it slowly. The $i$-th input axis $\mathbf{v}_i$ is sent by $A$ to $\sigma_i$ times the $i$-th output axis $\mathbf{u}_i$: same as before, scaled by the singular value and relocated to a new perpendicular direction. For $i > r$ (the directions with $\sigma_i = 0$), we get $A\mathbf{v}_i = \mathbf{0}$ — those input directions are crushed to zero, which is exactly the null space, as §30.8 will make precise. This relation $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ is just the columns of $A V = U\Sigma$ written one at a time, and $AV = U\Sigma$ is the same statement as $A = U\Sigma V^{\mathsf{T}}$ after multiplying on the right by $V^{\mathsf{T}}$ (using $VV^{\mathsf{T}} = I$). Geometry and algebra, the same fact twice.

Geometric Intuition — Return to the unit circle becoming an ellipse. The right singular vectors $\mathbf{v}_i$ are the special input directions — the points on the unit circle that land exactly on the axes of the output ellipse. The left singular vectors $\mathbf{u}_i$ are the directions of those output axes. And the singular value $\sigma_i$ is how far the ellipse extends along $\mathbf{u}_i$ — the semi-axis length. So $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ reads: "the input direction $\mathbf{v}_i$ is sent to the ellipse-axis direction $\mathbf{u}_i$, stretched to length $\sigma_i$." The largest singular value $\sigma_1$ is the longest the ellipse reaches — the maximum stretch $A$ applies to any unit vector. The smallest is the shortest reach — the minimum stretch.

That last observation is so important it deserves its own statement, because it gives the singular values a meaning independent of the whole factorization. The largest singular value answers the question, "by how much can $A$ stretch a vector, at most?" Among all unit vectors $\mathbf{x}$, the longest output $\lVert A\mathbf{x}\rVert$ is exactly $\sigma_1$, achieved when $\mathbf{x} = \mathbf{v}_1$. Likewise $\sigma_r$ (for a full-rank square matrix, $\sigma_n$) is the least stretch, the shortest the ellipse reaches. These extremal characterizations,

$$\sigma_1 = \max_{\lVert \mathbf{x}\rVert = 1}\lVert A\mathbf{x}\rVert, \qquad \sigma_{\min} = \min_{\lVert\mathbf{x}\rVert = 1}\lVert A\mathbf{x}\rVert \quad (\text{square, full rank}),$$

are why $\sigma_1$ is the operator norm of $A$ and why the ratio $\sigma_1/\sigma_{\min}$ is the condition number — both developed in §30.9. The singular values are the geometry of the matrix made into numbers.

30.3 How do you compute the SVD of a matrix? A full $2\times 2$ by hand

Geometry settled, let us earn the decomposition by pencil on a concrete matrix, then confirm every number with numpy. We deliberately choose a matrix that is not symmetric (so this is genuinely beyond Chapter 27) and not a rotation (so it really stretches), to show the machinery on an honest example:

$$A = \begin{bmatrix} 2 & 2 \\ -1 & 1 \end{bmatrix}.$$

The recipe, which we will justify in §30.5, is: form the symmetric matrix $A^{\mathsf{T}}A$, find its eigenvalues and orthonormal eigenvectors by the Spectral Theorem of Chapter 27, and read off the SVD from them. We follow it step by step.

Step 1 — Form $A^{\mathsf{T}}A$ (always symmetric). Compute the $2\times 2$ product:

$$A^{\mathsf{T}}A = \begin{bmatrix} 2 & -1 \\ 2 & 1 \end{bmatrix}\begin{bmatrix} 2 & 2 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} 4+1 & 4-1 \\ 4-1 & 4+1 \end{bmatrix} = \begin{bmatrix} 5 & 3 \\ 3 & 5 \end{bmatrix}.$$

Notice it came out symmetric — the off-diagonal entries match, $3 = 3$. This is no accident: $(A^{\mathsf{T}}A)^{\mathsf{T}} = A^{\mathsf{T}}(A^{\mathsf{T}})^{\mathsf{T}} = A^{\mathsf{T}}A$ for any matrix $A$ (Chapter 8), so $A^{\mathsf{T}}A$ is always symmetric, which is precisely why the Spectral Theorem applies and the whole construction works.

Step 2 — Eigenvalues of $A^{\mathsf{T}}A$ give the singular values. This is the friendly symmetric matrix $\begin{psmallmatrix}5 & 3\\ 3 & 5\end{psmallmatrix}$, the kind we diagonalized in Chapter 27. Its characteristic polynomial is

$$\det\begin{bmatrix} 5-\lambda & 3 \\ 3 & 5-\lambda \end{bmatrix} = (5-\lambda)^2 - 9 = \lambda^2 - 10\lambda + 16 = (\lambda - 8)(\lambda - 2) = 0,$$

so the eigenvalues are $\lambda_1 = 8$ and $\lambda_2 = 2$ — both real and both positive, exactly as the Spectral Theorem promised (and as §30.6 will explain, $A^{\mathsf{T}}A$ can never have a negative eigenvalue). The singular values are the square roots of these eigenvalues:

$$\sigma_1 = \sqrt{\lambda_1} = \sqrt 8 = 2\sqrt 2 \approx 2.828, \qquad \sigma_2 = \sqrt{\lambda_2} = \sqrt 2 \approx 1.414.$$

This is the single most important formula for computing the SVD: $\sigma_i = \sqrt{\lambda_i}$, where $\lambda_i$ are the eigenvalues of $A^{\mathsf{T}}A$. The square root is what connects the SVD to the spectral theorem, and §30.5 proves why it must be a square root.

Step 3 — Eigenvectors of $A^{\mathsf{T}}A$ are the right singular vectors $\mathbf{v}_i$. For $\lambda_1 = 8$, solve $(A^{\mathsf{T}}A - 8I)\mathbf{v} = \mathbf{0}$:

$$\begin{bmatrix} -3 & 3 \\ 3 & -3 \end{bmatrix}\begin{bmatrix} v_1 \\ v_2 \end{bmatrix} = \mathbf{0} \;\Longrightarrow\; v_1 = v_2 \;\Longrightarrow\; \mathbf{v}_1 = \frac{1}{\sqrt 2}\begin{bmatrix} 1 \\ 1 \end{bmatrix}.$$

For $\lambda_2 = 2$, solve $(A^{\mathsf{T}}A - 2I)\mathbf{v} = \mathbf{0}$:

$$\begin{bmatrix} 3 & 3 \\ 3 & 3 \end{bmatrix}\begin{bmatrix} v_1 \\ v_2 \end{bmatrix} = \mathbf{0} \;\Longrightarrow\; v_1 = -v_2 \;\Longrightarrow\; \mathbf{v}_2 = \frac{1}{\sqrt 2}\begin{bmatrix} -1 \\ 1 \end{bmatrix}.$$

These two are orthogonal automatically (the Spectral Theorem guarantees it, §27.5: distinct eigenvalues of a symmetric matrix force perpendicular eigenvectors), and we have normalized each to unit length. Stack them as the columns of $V$:

$$V = \frac{1}{\sqrt 2}\begin{bmatrix} 1 & -1 \\ 1 & 1 \end{bmatrix}, \qquad \Sigma = \begin{bmatrix} 2\sqrt 2 & 0 \\ 0 & \sqrt 2 \end{bmatrix}.$$

Step 4 — Left singular vectors from $\mathbf{u}_i = A\mathbf{v}_i / \sigma_i$. The defining relation $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ tells us how to find each $\mathbf{u}_i$: apply $A$ to $\mathbf{v}_i$ and divide by $\sigma_i$. For $\mathbf{v}_1 = \tfrac{1}{\sqrt2}(1,1)$:

$$A\mathbf{v}_1 = \frac{1}{\sqrt 2}\begin{bmatrix} 2 & 2 \\ -1 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{\sqrt 2}\begin{bmatrix} 4 \\ 0 \end{bmatrix} = \begin{bmatrix} 2\sqrt 2 \\ 0 \end{bmatrix} = 2\sqrt 2\begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad\text{so}\quad \mathbf{u}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}.$$

For $\mathbf{v}_2 = \tfrac{1}{\sqrt2}(-1,1)$:

$$A\mathbf{v}_2 = \frac{1}{\sqrt 2}\begin{bmatrix} 2 & 2 \\ -1 & 1 \end{bmatrix}\begin{bmatrix} -1 \\ 1 \end{bmatrix} = \frac{1}{\sqrt 2}\begin{bmatrix} 0 \\ 2 \end{bmatrix} = \sqrt 2\begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad\text{so}\quad \mathbf{u}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}.$$

So for this particular matrix the left singular vectors come out to be the standard basis, $U = I$. (That is a happy coincidence of this clean example, not a general fact — usually $U$ is some genuine rotation.) Assembling everything:

$$A = U\Sigma V^{\mathsf{T}} = \underbrace{\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}}_{U}\underbrace{\begin{bmatrix} 2\sqrt 2 & 0 \\ 0 & \sqrt 2 \end{bmatrix}}_{\Sigma}\underbrace{\frac{1}{\sqrt 2}\begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix}}_{V^{\mathsf{T}}}.$$

Step 5 — Verify the reconstruction by hand. Multiply it back out, right to left. First $\Sigma V^{\mathsf{T}}$:

$$\begin{bmatrix} 2\sqrt2 & 0 \\ 0 & \sqrt2 \end{bmatrix}\frac{1}{\sqrt2}\begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix} = \frac{1}{\sqrt2}\begin{bmatrix} 2\sqrt2 & 2\sqrt2 \\ -\sqrt2 & \sqrt2 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ -1 & 1 \end{bmatrix}.$$

Then $U = I$ leaves this unchanged, so $U\Sigma V^{\mathsf{T}} = \begin{psmallmatrix}2 & 2\\ -1 & 1\end{psmallmatrix} = A$ exactly. We have computed our first full SVD by hand: $A$ rotates the input by $-45°$ (that is $V^{\mathsf{T}}$), stretches by $2\sqrt2$ along the first axis and $\sqrt2$ along the second (that is $\Sigma$), then does nothing more (that is $U = I$).

Check Your Understanding — In the worked example, what is $\det(A)$, and how does it relate to the singular values? Compute $\sigma_1\sigma_2$ and compare.
Answer
$\det(A) = (2)(1) - (2)(-1) = 2 + 2 = 4$. The product of the singular values is $\sigma_1\sigma_2 = 2\sqrt2 \cdot \sqrt2 = 2\cdot 2 = 4$. They match in absolute value: for a square matrix, $\lvert\det(A)\rvert = \prod_i \sigma_i$, because the rotations $U, V$ have determinant $\pm 1$ and the stretching $\Sigma$ scales area by $\prod_i\sigma_i$ (the area-scaling meaning of the determinant from Chapter 11). The singular values know the absolute area-scaling factor; they discard the sign (orientation) information that the determinant keeps.

30.4 Does the hand computation match numpy?

It must, and the differences that appear teach an important lesson about conventions. Let us run np.linalg.svd on the same matrix. (Recall the indexing convention from earlier chapters: mathematics writes $\mathbf{v}_1, \sigma_1$ one-indexed, while numpy's V[:,0], S[0] are zero-indexed — the same objects, shifted labels. And np.linalg.svd returns $V^{\mathsf{T}}$ directly, not $V$.)

# Full SVD of a non-symmetric 2x2 matrix; compare to the hand computation.
import numpy as np
A = np.array([[2.0,  2.0],
              [-1.0, 1.0]])
U, S, Vt = np.linalg.svd(A)        # returns U, the singular values S, and V^T (not V)
print("singular values S =", S)    # the diagonal of Sigma, descending
print("U  =\n", np.round(U, 6))
print("V^T =\n", np.round(Vt, 6))
print("reconstruct U Sigma V^T =\n",
      np.round(U @ np.diag(S) @ Vt, 10))   # should equal A

singular values S = [2.82842712 1.41421356]
U  =
 [[-1.  0.]
 [ 0.  1.]]
V^T =
 [[-0.707107 -0.707107]
 [-0.707107  0.707107]]
reconstruct U Sigma V^T =
 [[ 2.  2.]
 [-1.  1.]]

The singular values match exactly: [2.828..., 1.414...] is $[2\sqrt2, \sqrt2]$, our hand result. The reconstruction recovers $A$ to machine precision. But look closely at $U$ and $V^{\mathsf{T}}$: numpy returned $U = \begin{psmallmatrix}-1 & 0\\ 0 & 1\end{psmallmatrix}$, whereas we found $U = I$; and numpy's first row of $V^{\mathsf{T}}$ is $(-0.707, -0.707)$, the negative of our $\mathbf{v}_1 = \tfrac{1}{\sqrt2}(1,1)$. The decomposition disagrees with ours by a sign — and that is completely fine. Here is why.

Computational Note — The SVD is not unique; the singular values are, but the singular vectors carry sign and ordering conventions. (1) Sign flips. If $\mathbf{v}_i$ is a right singular vector with $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$, then flipping both signs — using $-\mathbf{v}_i$ and $-\mathbf{u}_i$ — gives an equally valid pair, because $A(-\mathbf{v}_i) = \sigma_i(-\mathbf{u}_i)$ and the product $(-\mathbf{u}_i)(-\mathbf{v}_i)^{\mathsf{T}} = \mathbf{u}_i\mathbf{v}_i^{\mathsf{T}}$ is unchanged. numpy flipped the first pair relative to our choice; the product $U\Sigma V^{\mathsf{T}}$ is identical. The cardinal rule: never flip a $\mathbf{u}_i$ without flipping its partner $\mathbf{v}_i$. (2) Ordering. The singular values are sorted descending by convention, so the order is fixed — but if two singular values are equal, their singular vectors can be rotated within that shared subspace (the repeated-eigenvalue freedom of §27.5.1). (3) What numpy returns. np.linalg.svd returns the triple (U, S, Vt) where S is a 1-D array of singular values and the third output is already $V^{\mathsf{T}}$, not $V$ — a frequent source of bugs. To get $V$, transpose: V = Vt.T. The singular values are always returned in descending order.

This non-uniqueness is exactly the same phenomenon we met in Chapter 27, where np.linalg.eigh returned an eigenvector as the negative of our hand choice. The deep fact is invariant — the ellipse, the axis lengths (singular values), and the axis lines are pinned down — but the labels (which way the arrows point along each axis) have a sign freedom, and different software resolves it differently. When you check a hand SVD against numpy, compare the singular values exactly, compare the reconstruction $U\Sigma V^{\mathsf{T}}$ exactly, and expect the singular vectors to possibly differ by paired sign flips. Let us confirm our hand $U=I$, $V=\tfrac{1}{\sqrt2}\begin{psmallmatrix}1&-1\\1&1\end{psmallmatrix}$ also reconstructs $A$:

# Our HAND decomposition also reconstructs A — sign convention differs, product agrees.
import numpy as np
s = np.sqrt(2)
U_hand  = np.array([[1.0, 0.0], [0.0, 1.0]])              # we found U = I
Sig     = np.diag([2*s, s])                                # 2√2, √2
V_hand  = (1/s) * np.array([[1.0, -1.0], [1.0, 1.0]])     # columns v1, v2
print("U Sigma V^T (hand) =\n", np.round(U_hand @ Sig @ V_hand.T, 10))

U Sigma V^T (hand) =
 [[ 2.  2.]
 [-1.  1.]]

Identical to $A$. Both decompositions are correct; they are the same ellipse described with different sign labels.

30.4.1 A second example with a genuine rotation $U$

Our first matrix had $U = I$, which is unusually clean — it meant the output ellipse happened to be axis-aligned. To see the typical case, where $U$ is a real rotation, run the same four-step recipe on

$$A = \begin{bmatrix} 3 & 0 \\ 4 & 5 \end{bmatrix}.$$

Form $A^{\mathsf{T}}A = \begin{psmallmatrix}9+16 & 20\\ 20 & 25\end{psmallmatrix} = \begin{psmallmatrix}25 & 20\\ 20 & 25\end{psmallmatrix}$, again symmetric. Its characteristic polynomial is $(25-\lambda)^2 - 400 = (\lambda - 45)(\lambda - 5) = 0$, so $\lambda_1 = 45$, $\lambda_2 = 5$, giving singular values $\sigma_1 = \sqrt{45} = 3\sqrt5 \approx 6.708$ and $\sigma_2 = \sqrt5 \approx 2.236$. The eigenvectors of $\begin{psmallmatrix}25&20\\20&25\end{psmallmatrix}$ are again the $\pm45°$ directions, $\mathbf{v}_1 = \tfrac{1}{\sqrt2}(1,1)$ and $\mathbf{v}_2 = \tfrac{1}{\sqrt2}(-1,1)$. Now the left singular vectors, from $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$:

$$\mathbf{u}_1 = \frac{1}{3\sqrt5}A\mathbf{v}_1 = \frac{1}{3\sqrt5}\cdot\frac{1}{\sqrt2}\begin{bmatrix}3\\9\end{bmatrix} = \frac{1}{\sqrt{10}}\begin{bmatrix}1\\3\end{bmatrix}, \qquad \mathbf{u}_2 = \frac{1}{\sqrt5}A\mathbf{v}_2 = \frac{1}{\sqrt{10}}\begin{bmatrix}-3\\1\end{bmatrix}.$$

These are orthonormal (check: $\mathbf{u}_1\cdot\mathbf{u}_2 = \tfrac{1}{10}(-3 + 3) = 0$, and each has length 1) but they are not the standard basis — $U = \tfrac{1}{\sqrt{10}}\begin{psmallmatrix}1 & -3\\ 3 & 1\end{psmallmatrix}$ is a genuine rotation. So here $A$ rotates the input by $-45°$ ($V^{\mathsf{T}}$), stretches by $3\sqrt5$ and $\sqrt5$ ($\Sigma$), then rotates the result by the angle of $U$ (about $71.6°$) into its final orientation — the full rotate–stretch–rotate, with no factor trivial. np.linalg.svd confirms the singular values $\{3\sqrt5, \sqrt5\}$ and reconstructs $A$, with $U$ and $V$ matching ours up to the usual paired sign flips. This is the general shape of every SVD; the first example was just lucky enough to land $U = I$.

30.5 Why does every matrix have an SVD? (the spectral theorem on $A^{\mathsf{T}}A$)

We have used the recipe; now we prove it works for every matrix, which is the universality that makes the SVD the keystone of the subject. This is the chapter's central theorem, and the proof is one of the most satisfying in the book: it builds the SVD of an arbitrary matrix entirely out of the Spectral Theorem of Chapter 27, applied to a symmetric matrix we manufacture from $A$.

The Singular Value Decomposition (existence). Let $A$ be any real $m\times n$ matrix. Then there exist an orthogonal $m\times m$ matrix $U$, an orthogonal $n\times n$ matrix $V$, and an $m\times n$ matrix $\Sigma$ that is zero off its main diagonal with non-negative, non-increasing diagonal entries $\sigma_1 \ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots$, such that $$A = U\Sigma V^{\mathsf{T}}.$$ The $\sigma_i$ (the singular values) are $\sqrt{\lambda_i}$ where $\lambda_i$ are the eigenvalues of $A^{\mathsf{T}}A$; the columns of $V$ are an orthonormal eigenbasis of $A^{\mathsf{T}}A$; the columns of $U$ extend the vectors $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$ to an orthonormal basis of $\mathbb{R}^m$.

1. Why we care. This is the universality that diagonalization lacked. Chapter 25 could diagonalize only square matrices, and only those with enough independent eigenvectors — defective matrices and matrices with complex eigenvalues escaped it entirely. The SVD escapes nothing. Every matrix that exists, of every shape, has this factorization. That is why a single algorithm — np.linalg.svd — can be pointed at any data matrix, any image, any operator, and always returns. The proof shows why there are no exceptions: it never asks $A$ to be nice, only $A^{\mathsf{T}}A$, and $A^{\mathsf{T}}A$ is automatically nice (symmetric, hence spectral-theorem-able) for every $A$.

2. Key idea. $A$ itself may be hopeless — non-square, non-symmetric, defective. But $A^{\mathsf{T}}A$ is always a real symmetric $n\times n$ matrix, so the Spectral Theorem (Chapter 27) hands us, with no fine print, an orthonormal eigenbasis and real eigenvalues. We show those eigenvalues are non-negative (so their square roots are real singular values), use the eigenvectors as the right singular vectors $V$, define the left singular vectors by $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$, and prove the $\mathbf{u}_i$ come out orthonormal for free. The symmetry of $A^{\mathsf{T}}A$ does all the heavy lifting.

3. Proof. Let $A$ be any real $m\times n$ matrix and set $S = A^{\mathsf{T}}A$, an $n\times n$ matrix.

Step (a): $S$ is symmetric, so the Spectral Theorem applies. Indeed $S^{\mathsf{T}} = (A^{\mathsf{T}}A)^{\mathsf{T}} = A^{\mathsf{T}}(A^{\mathsf{T}})^{\mathsf{T}} = A^{\mathsf{T}}A = S$. By the Spectral Theorem (§27.2), $S$ has an orthonormal basis of eigenvectors $\mathbf{v}_1, \dots, \mathbf{v}_n$ with real eigenvalues $\lambda_1 \ge \cdots \ge \lambda_n$ (we may order them this way), and $S = V\Lambda V^{\mathsf{T}}$ with $V = [\mathbf{v}_1 \cdots \mathbf{v}_n]$ orthogonal.

Step (b): every eigenvalue $\lambda_i \ge 0$. This is what lets us take square roots. For each eigenvector, $S\mathbf{v}_i = \lambda_i\mathbf{v}_i$, compute the length of $A\mathbf{v}_i$:

$$\lVert A\mathbf{v}_i\rVert^2 = (A\mathbf{v}_i)^{\mathsf{T}}(A\mathbf{v}_i) = \mathbf{v}_i^{\mathsf{T}}A^{\mathsf{T}}A\mathbf{v}_i = \mathbf{v}_i^{\mathsf{T}}S\mathbf{v}_i = \mathbf{v}_i^{\mathsf{T}}(\lambda_i\mathbf{v}_i) = \lambda_i\lVert\mathbf{v}_i\rVert^2 = \lambda_i.$$

(The last step uses $\lVert\mathbf{v}_i\rVert = 1$.) A squared length is never negative, so $\lambda_i = \lVert A\mathbf{v}_i\rVert^2 \ge 0$. Define $\sigma_i = \sqrt{\lambda_i} \ge 0$. This is the algebraic origin of the square root in $\sigma_i = \sqrt{\lambda_i}$: $\sigma_i = \lVert A\mathbf{v}_i\rVert$ is a genuine length, and $\lambda_i$ is its square.

Step (c): define the left singular vectors and check orthonormality. Let $r$ be the number of strictly positive $\lambda_i$ (so $\sigma_1 \ge \cdots \ge \sigma_r > 0$ and $\sigma_{r+1} = \cdots = \sigma_n = 0$). For $i \le r$, define

$$\mathbf{u}_i = \frac{1}{\sigma_i}A\mathbf{v}_i.$$

Each $\mathbf{u}_i$ is a unit vector, since $\lVert\mathbf{u}_i\rVert = \tfrac{1}{\sigma_i}\lVert A\mathbf{v}_i\rVert = \tfrac{1}{\sigma_i}\sigma_i = 1$ by step (b). And distinct $\mathbf{u}_i$ are orthogonal: for $i \ne j$ (both $\le r$),

$$\mathbf{u}_i^{\mathsf{T}}\mathbf{u}_j = \frac{1}{\sigma_i\sigma_j}(A\mathbf{v}_i)^{\mathsf{T}}(A\mathbf{v}_j) = \frac{1}{\sigma_i\sigma_j}\mathbf{v}_i^{\mathsf{T}}A^{\mathsf{T}}A\mathbf{v}_j = \frac{1}{\sigma_i\sigma_j}\mathbf{v}_i^{\mathsf{T}}(\lambda_j\mathbf{v}_j) = \frac{\lambda_j}{\sigma_i\sigma_j}\,\mathbf{v}_i^{\mathsf{T}}\mathbf{v}_j = 0,$$

because $\mathbf{v}_i^{\mathsf{T}}\mathbf{v}_j = 0$ ($V$ is orthonormal). So $\mathbf{u}_1, \dots, \mathbf{u}_r$ are orthonormal in $\mathbb{R}^m$. If $r < m$, extend them to a full orthonormal basis $\mathbf{u}_1, \dots, \mathbf{u}_m$ of $\mathbb{R}^m$ (Gram–Schmidt on any completion, Chapter 20), and let $U = [\mathbf{u}_1 \cdots \mathbf{u}_m]$, which is orthogonal.

Step (d): assemble $A = U\Sigma V^{\mathsf{T}}$. By construction $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ for $i \le r$, and $A\mathbf{v}_i = \mathbf{0}$ for $i > r$ (because $\lVert A\mathbf{v}_i\rVert^2 = \lambda_i = 0$ forces $A\mathbf{v}_i = \mathbf{0}$). These $n$ relations are exactly the columns of the single matrix equation $AV = U\Sigma$, where $\Sigma$ is the $m\times n$ matrix with the $\sigma_i$ on its diagonal. Since $V$ is orthogonal, $V^{\mathsf{T}} = V^{-1}$, so multiplying on the right by $V^{\mathsf{T}}$ gives $A = U\Sigma V^{\mathsf{T}}$. $\blacksquare$

4. What this means. The proof is a recipe and an explanation at once. The recipe: eigen-decompose the symmetric $A^{\mathsf{T}}A$ to get $V$ and the $\sigma_i = \sqrt{\lambda_i}$, then push the $\mathbf{v}_i$ through $A$ and normalize to get $U$ — exactly the four steps of §30.3. The explanation: the SVD exists for every matrix because $A^{\mathsf{T}}A$ is symmetric for every matrix, and symmetric matrices are always orthogonally diagonalizable. The universality of the SVD is inherited directly from the universality of the Spectral Theorem. There is no defective case, no complex eigenvalue, no shape restriction — because $A^{\mathsf{T}}A$ launders away all of those pathologies of $A$ and hands back a clean symmetric matrix. The SVD is, quite literally, the spectral theorem of $A^{\mathsf{T}}A$ pulled back through $A$.

It is worth dwelling on just how much this universality buys us, because it is the single fact that elevates the SVD above every other decomposition in the book. Recall the hierarchy of failure we cataloged for diagonalization. A rotation matrix (Chapter 26) has no real eigenvalues — diagonalization over the reals is impossible — yet it has a perfectly ordinary SVD with real singular values (a rotation's singular values are all $1$, since it stretches nothing). A defective shear (Chapter 25) has too few eigenvectors to diagonalize at all — yet it has a clean SVD (we noted $\begin{psmallmatrix}1&5\\0&1\end{psmallmatrix}$ has singular values about $5.19$ and $0.19$ in §30.6). And a rectangular matrix cannot even be asked whether it is diagonalizable, since the question requires a square matrix — yet the SVD treats a $1000\times 3$ data matrix and a $5\times 5$ matrix with exactly the same machinery. Every obstruction that could stop diagonalization is dissolved by routing through $A^{\mathsf{T}}A$, because the obstruction lived in $A$ and never survives the symmetrization. This is why, when you do not know anything about a matrix — when it is just data — the SVD is the tool you reach for: it is the one factorization guaranteed to exist and to mean something, no matter what the matrix turns out to be. Every matrix is rotate–stretch–rotate, full stop.

Math-Major Sidebar — The same construction with $AA^{\mathsf{T}}$ instead of $A^{\mathsf{T}}A$ produces the left singular vectors directly: $AA^{\mathsf{T}}$ is a symmetric $m\times m$ matrix whose orthonormal eigenvectors are the columns of $U$, with the same nonzero eigenvalues $\lambda_i = \sigma_i^2$. (That $A^{\mathsf{T}}A$ and $AA^{\mathsf{T}}$ share their nonzero eigenvalues is a general fact: $A^{\mathsf{T}}A$ and $AA^{\mathsf{T}}$ have identical nonzero spectra for any $A$ — see Exercise 30.20.) So you can build $V$ from $A^{\mathsf{T}}A$ or $U$ from $AA^{\mathsf{T}}$; the relation $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ then determines the other family. In practice, forming $A^{\mathsf{T}}A$ explicitly is numerically poor — it squares the condition number (§30.9) and can lose half the significant digits — so real SVD algorithms (Golub–Kahan bidiagonalization) never form $A^{\mathsf{T}}A$ at all; they work on $A$ directly. The $A^{\mathsf{T}}A$ route is the right way to understand and hand-compute the SVD, and the wrong way to compute it at scale. We return to this stability point in Chapter 38.

The SVD and the eigen-decomposition are close cousins, and seeing exactly how they relate — and where they part ways — locks in the meaning of both. We have already seen the bridge: the right singular vectors and singular values of $A$ come from the eigen-decomposition of $A^{\mathsf{T}}A$. Let us make every part of that correspondence explicit, then identify the one case where the two decompositions coincide.

Start from $A = U\Sigma V^{\mathsf{T}}$ and compute $A^{\mathsf{T}}A$ directly, using $U^{\mathsf{T}}U = I$:

$$A^{\mathsf{T}}A = (U\Sigma V^{\mathsf{T}})^{\mathsf{T}}(U\Sigma V^{\mathsf{T}}) = V\Sigma^{\mathsf{T}}U^{\mathsf{T}}U\Sigma V^{\mathsf{T}} = V(\Sigma^{\mathsf{T}}\Sigma)V^{\mathsf{T}} = V\Lambda V^{\mathsf{T}},$$

where $\Lambda = \Sigma^{\mathsf{T}}\Sigma = \operatorname{diag}(\sigma_1^2, \dots, \sigma_n^2)$. This is an orthogonal diagonalization of the symmetric matrix $A^{\mathsf{T}}A$: its eigenvectors are the columns of $V$ (the right singular vectors) and its eigenvalues are $\sigma_i^2$. Symmetrically,

$$AA^{\mathsf{T}} = (U\Sigma V^{\mathsf{T}})(U\Sigma V^{\mathsf{T}})^{\mathsf{T}} = U(\Sigma\Sigma^{\mathsf{T}})U^{\mathsf{T}} = U\Lambda' U^{\mathsf{T}},$$

so the eigenvectors of $AA^{\mathsf{T}}$ are the columns of $U$ (the left singular vectors), with the same nonzero eigenvalues $\sigma_i^2$. This is the precise dictionary between the two worlds:

Singular value world (of $A$)	Eigenvalue world
right singular vectors $\mathbf{v}_i$ (columns of $V$)	eigenvectors of $A^{\mathsf{T}}A$
left singular vectors $\mathbf{u}_i$ (columns of $U$)	eigenvectors of $AA^{\mathsf{T}}$
singular values $\sigma_i$	$\sigma_i = \sqrt{\lambda_i}$, $\lambda_i$ eigenvalue of $A^{\mathsf{T}}A$ (or $AA^{\mathsf{T}}$)

We can verify the $AA^{\mathsf{T}}$ side directly on the worked example, where it turns out to be diagonal already (consistent with $U = I$):

# The SVD of A IS the eigen-decomposition of A^T A (gives V) and A A^T (gives U).
import numpy as np
A = np.array([[2.0, 2.0], [-1.0, 1.0]])
ATA, AAT = A.T @ A, A @ A.T
print("A^T A =\n", ATA, "\neig(A^T A):", np.linalg.eigvalsh(ATA))   # [2, 8] = σ²
print("A A^T =\n", AAT, "\neig(A A^T):", np.linalg.eigvalsh(AAT))   # [2, 8] = σ²
print("singular values from sqrt:", np.sqrt(np.linalg.eigvalsh(ATA))[::-1])  # √8, √2

A^T A =
 [[5. 3.]
 [3. 5.]] 
eig(A^T A): [2. 8.]
A A^T =
 [[8. 0.]
 [0. 2.]] 
eig(A A^T): [2. 8.]
singular values from sqrt: [2.82842712 1.41421356]

Both $A^{\mathsf{T}}A$ and $AA^{\mathsf{T}}$ have eigenvalues $\{8, 2\}$, whose square roots $\{2\sqrt2, \sqrt2\}$ are the singular values — confirming the dictionary.

Now the case where the SVD and the eigen-decomposition merge. Suppose $A$ is itself symmetric and positive definite (Chapter 28: symmetric with all eigenvalues $\lambda_i > 0$). Then its spectral decomposition $A = Q\Lambda Q^{\mathsf{T}}$ (Chapter 27) is already an SVD: take $U = V = Q$ and $\Sigma = \Lambda$, and since all $\lambda_i > 0$ they are already non-negative and play the role of singular values directly. So for a symmetric positive-definite matrix, the singular values equal the eigenvalues, and the left and right singular vectors are both the orthonormal eigenvectors. This is the one situation where "singular value" and "eigenvalue" coincide — and it is exactly the situation of the covariance matrix in PCA (Chapter 32), which is why PCA can be phrased interchangeably as "eigen-decomposition of the covariance matrix" or "SVD of the centered data."

Warning

— For a general matrix the singular values are not the absolute values of the eigenvalues, and the singular vectors are not the eigenvectors. Two cautions. First, even for a symmetric matrix with a negative eigenvalue, the singular value is $\lvert\lambda\rvert$ (the magnitude) while the matching $\mathbf{u}_i$ and $\mathbf{v}_i$ differ by a sign — the SVD records the size of the stretch but folds the sign into the singular vectors, since $\sigma_i \ge 0$ always. Second, for a non-symmetric matrix the eigenvalues and singular values can be completely unrelated: a shear $\begin{psmallmatrix}1&5\\0&1\end{psmallmatrix}$ has both eigenvalues equal to $1$ but singular values roughly $5.19$ and $0.19$ — the eigenvalues say "no stretch" while the singular values correctly report a large stretch. Singular values measure stretching; eigenvalues measure invariant directions. They answer different questions and only agree for symmetric positive-(semi)definite matrices.

30.7 How do you read the rank off the singular values?

One of the most useful and immediate consequences of the SVD is that it tells you the rank of a matrix at a glance, and in a way far more honest than counting pivots. The rule is as simple as it gets:

$$\operatorname{rank}(A) = r = \text{the number of nonzero singular values.}$$

The reason is geometric and falls right out of the factorization. Multiplying by the orthogonal matrices $U$ and $V^{\mathsf{T}}$ never changes rank — an invertible matrix times another matrix has the same rank (Chapter 14) — so $\operatorname{rank}(A) = \operatorname{rank}(\Sigma)$. And the rank of the "diagonal" $\Sigma$ is just the number of nonzero entries on its diagonal, because each nonzero $\sigma_i$ contributes one independent column and each zero contributes a zero column. So the count of strictly positive singular values is the rank.

Watch it work on a deliberately rank-deficient matrix, where the second row is twice the first (so the rows are dependent and the rank is 1):

# Rank = number of nonzero singular values. A rank-deficient 2x2.
import numpy as np
B = np.array([[2.0, 4.0],
              [1.0, 2.0]])                  # row 2 = 0.5 * row 1  → rank 1
U, S, Vt = np.linalg.svd(B)
print("singular values:", S)               # one is exactly 0
print("det(B) =", round(np.linalg.det(B), 12))
print("rank via SVD =", int(np.sum(S > 1e-12)))
print("np.linalg.matrix_rank(B) =", np.linalg.matrix_rank(B))

singular values: [5. 0.]
det(B) = 0.0
rank via SVD = 1
np.linalg.matrix_rank(B) = 1

The singular values are exactly $\{5, 0\}$: one nonzero, so rank 1, confirmed. And here is the part that makes the SVD indispensable in practice, where data is noisy and arithmetic is finite. Pivots from Gaussian elimination give a yes/no answer to "is this entry zero?" — but with floating-point data, a quantity that should be zero comes out as $10^{-15}$, and a genuinely tiny-but-real quantity looks the same. The SVD replaces the brittle yes/no with a continuous measure: the singular values tell you not just that a matrix is close to rank-deficient but how close. A matrix with singular values $\{5, 3, 0.0001\}$ is numerically rank-2 even though it is technically rank-3, because one direction is stretched ten-thousand times less than the others. This is the numerical rank, and it is how np.linalg.matrix_rank actually works: it counts singular values above a small tolerance, not pivots. The SVD is the only reliable rank detector for real, noisy matrices — a theme we develop fully in Chapter 38.

Check Your Understanding — A $4\times 6$ matrix has singular values $\{12, 7, 2, 0\}$ reported by numpy. (A $4\times 6$ matrix has $\min(4,6) = 4$ singular values.) What is the rank? What are the dimensions of the column space and the null space?
Answer
Three of the four singular values are nonzero, so $\operatorname{rank}(A) = 3$. The column space has dimension $r = 3$ (a subspace of $\mathbb{R}^4$). By rank–nullity (Chapter 14), the null space has dimension $n - r = 6 - 3 = 3$ (a subspace of $\mathbb{R}^6$). The SVD hands you the rank — and, as the next section shows, orthonormal bases for both subspaces — directly from the singular values and the columns of $U$ and $V$.

30.8 How does the SVD reveal the four fundamental subspaces?

This is where the SVD fulfills a promise made all the way back in Chapter 14 — and it is one of the most beautiful results in the whole book. Strang's four fundamental subspaces (the column space $C(A)$, the null space $N(A)$, the row space $C(A^{\mathsf{T}})$, and the left null space $N(A^{\mathsf{T}})$) are the organizing framework of linear algebra. Chapter 14 told you their dimensions (via rank–nullity) and how they fit together (orthogonal complements). But it never gave you a clean orthonormal basis for each one. The SVD gives you all four at once, for free, just by splitting the columns of $U$ and $V$ at the index $r$.

Suppose $A$ is $m\times n$ with rank $r$, and write the SVD with its singular vectors labeled. Split $V = [\mathbf{v}_1 \cdots \mathbf{v}_r \mid \mathbf{v}_{r+1}\cdots\mathbf{v}_n]$ and $U = [\mathbf{u}_1\cdots\mathbf{u}_r \mid \mathbf{u}_{r+1}\cdots\mathbf{u}_m]$ at the rank $r$ — the first $r$ vectors correspond to the nonzero singular values, the rest to the zeros. Then:

The Four Fundamental Subspaces from the SVD. For an $m\times n$ matrix $A$ of rank $r$, the columns of $U$ and $V$ split into orthonormal bases for all four subspaces: - Row space $C(A^{\mathsf{T}}) \subseteq \mathbb{R}^n$: $\{\mathbf{v}_1, \dots, \mathbf{v}_r\}$ — the first $r$ right singular vectors. (dimension $r$) - Null space $N(A) \subseteq \mathbb{R}^n$: $\{\mathbf{v}_{r+1}, \dots, \mathbf{v}_n\}$ — the last $n - r$ right singular vectors. (dimension $n - r$) - Column space $C(A) \subseteq \mathbb{R}^m$: $\{\mathbf{u}_1, \dots, \mathbf{u}_r\}$ — the first $r$ left singular vectors. (dimension $r$) - Left null space $N(A^{\mathsf{T}}) \subseteq \mathbb{R}^m$: $\{\mathbf{u}_{r+1}, \dots, \mathbf{u}_m\}$ — the last $m - r$ left singular vectors. (dimension $m - r$)

Each of these is provable in a line from the SVD, and the proofs are worth seeing because they make the geometry vivid. The null space is the easiest: $A\mathbf{v}_i = \sigma_i\mathbf{u}_i = \mathbf{0}$ exactly when $\sigma_i = 0$, i.e. for $i > r$, so the last $n - r$ right singular vectors are crushed to zero — they are an orthonormal basis for $N(A)$. The column space comes from $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$: as $\mathbf{x}$ ranges over all inputs, $A\mathbf{x}$ is a combination of the $\mathbf{u}_i$ with $\sigma_i > 0$, so $\{\mathbf{u}_1, \dots, \mathbf{u}_r\}$ spans the output — it is an orthonormal basis for $C(A)$. The row space and left null space follow by applying the same reasoning to $A^{\mathsf{T}} = V\Sigma^{\mathsf{T}}U^{\mathsf{T}}$ (whose SVD swaps the roles of $U$ and $V$), or by orthogonal complementarity: since $V$ is orthonormal, the row-space basis $\{\mathbf{v}_1, \dots, \mathbf{v}_r\}$ and the null-space basis $\{\mathbf{v}_{r+1}, \dots, \mathbf{v}_n\}$ are orthogonal complements in $\mathbb{R}^n$ — which is exactly the orthogonality $C(A^{\mathsf{T}}) \perp N(A)$ that Chapter 14 proved, now realized concretely by perpendicular columns of $V$.

The Key Insight — The SVD doesn't just tell you the four fundamental subspaces — it builds orthonormal bases for all four simultaneously, and arranges them so the orthogonality relations of Chapter 14 ($C(A^{\mathsf{T}}) \perp N(A)$ in $\mathbb{R}^n$, $C(A) \perp N(A^{\mathsf{T}})$ in $\mathbb{R}^m$) are visible as perpendicular columns of $V$ and $U$. The "big picture" of linear algebra and the SVD are the same picture.

This recasts the entire action of $A$ in the cleanest possible terms. $A$ takes the row space (spanned by $\mathbf{v}_1, \dots, \mathbf{v}_r$) and maps it invertibly onto the column space (spanned by $\mathbf{u}_1, \dots, \mathbf{u}_r$), stretching by the singular values along the way — that is the $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$ action. And it takes the null space (spanned by $\mathbf{v}_{r+1}, \dots$) and crushes it to zero. Every matrix, viewed through its SVD, is an invertible rotate-stretch-rotate between its row space and column space, plus a collapse of its null space. Let us see all four subspaces fall out of a single svd call on a rectangular, rank-deficient matrix.

# All four fundamental subspaces from one SVD call (Chapter 14, realized).
import numpy as np
A = np.array([[1.0, 2.0, 3.0],
              [2.0, 4.0, 6.0]])             # row 2 = 2·row 1  → 2x3, rank 1
U, S, Vt = np.linalg.svd(A)
r = int(np.sum(S > 1e-12))                   # numerical rank
V = Vt.T
print("singular values:", np.round(S, 6), " → rank r =", r)
print("col space C(A)  basis (first r cols of U):\n", np.round(U[:, :r], 4))
print("left null N(Aᵀ) basis (rest of U):\n",       np.round(U[:, r:], 4))
print("row space C(Aᵀ) basis (first r cols of V):\n", np.round(V[:, :r], 4))
print("null space N(A) basis (rest of V):\n",         np.round(V[:, r:], 4))
print("check A·(null vector) = 0:", np.round(A @ V[:, r], 10))

singular values: [8.366600 0.      ] → rank r = 1
col space C(A)  basis (first r cols of U):
 [[-0.4472]
 [-0.8944]]
left null N(Aᵀ) basis (rest of U):
 [[-0.8944]
 [ 0.4472]]
row space C(Aᵀ) basis (first r cols of V):
 [[-0.2673]
 [-0.5345]
 [-0.8018]]
null space N(A) basis (rest of V):
 [[ 0.9449 -0.189 ]
 [-0.3085 -0.7868]
 [-0.1093  0.5875]]
check A·(null vector) = 0: [0. 0.]

The single nonzero singular value ($\sigma_1 = \sqrt{70} \approx 8.367$) says rank 1. The first column of $U$ spans the column space — it points along $(1,2)$, the direction of both rows' output, normalized to $\tfrac{1}{\sqrt5}(1,2)$ (numpy returns its negative, a valid sign choice). The second column of $U$ spans the left null space, perpendicular to it. The first column of $V$ spans the row space, along $(1,2,3)$ normalized; the remaining two columns of $V$ span the 2-dimensional null space — and the final check confirms $A$ annihilates a null-space vector. Four orthonormal bases, four subspaces, one decomposition. (The particular orthonormal basis numpy chooses inside the 2-dimensional null space is not unique — any orthonormal pair spanning that plane is valid, the repeated-singular-value freedom of §30.4 — but the subspace it spans is pinned down.) This is the SVD delivering on Chapter 14's promise.

30.9 What is the difference between the full and reduced SVD, and what do the singular values measure?

Two practical loose ends remain, and both matter for everything that follows in Part VI: the distinction between the full and reduced SVD, and the way the singular values encode a matrix's most important scalar quantities.

30.9.1 Full versus reduced SVD

For a square matrix the SVD is square all the way through, and there is nothing to discuss. For a rectangular matrix there is a choice, because much of the full $U$ or $V$ is "wasted" on the zero part of $\Sigma$. Take a tall $m\times n$ matrix with $m > n$ (more rows than columns). The full SVD makes $U$ a full $m\times m$ orthogonal matrix and $\Sigma$ an $m\times n$ matrix padded with $m - n$ rows of zeros at the bottom. But those bottom rows of $\Sigma$ are zero, so the last $m - n$ columns of $U$ get multiplied by nothing — they contribute zero to $A$. The reduced (or "thin," or "economy") SVD simply drops them: keep only the first $n$ columns of $U$ (call it $\hat U$, size $m\times n$) and the top $n\times n$ block of $\Sigma$, giving

$$A = \hat U \hat\Sigma V^{\mathsf{T}}, \qquad \hat U \text{ is } m\times n, \quad \hat\Sigma \text{ is } n\times n, \quad V \text{ is } n\times n.$$

Both forms reconstruct $A$ identically; the reduced form just throws away the columns of $U$ that the zeros of $\Sigma$ would have annihilated anyway. For a $1000\times 3$ data matrix, the full $U$ is a giant $1000\times 1000$ orthogonal matrix, while the reduced $\hat U$ is a lean $1000\times 3$ — and the reduced form holds all the same information. This is why, in practice, you almost always want full_matrices=False.

# Full vs reduced SVD of a tall 3x2 matrix.
import numpy as np
A = np.array([[1.0, 1.0],
              [0.0, 1.0],
              [1.0, 0.0]])                          # 3x2, tall
Uf, Sf, Vtf = np.linalg.svd(A)                       # FULL:   U is 3x3
Ur, Sr, Vtr = np.linalg.svd(A, full_matrices=False)  # REDUCED: U is 3x2
print("full   U shape:", Uf.shape, " Σ implied 3x2,  V^T:", Vtf.shape)
print("reduced U shape:", Ur.shape, " Σ implied 2x2,  V^T:", Vtr.shape)
print("singular values:", np.round(Sr, 6))          # same either way
print("reduced reconstructs A?", np.allclose(Ur @ np.diag(Sr) @ Vtr, A))

full   U shape: (3, 3)  Σ implied 3x2,  V^T: (2, 2)
reduced U shape: (3, 2)  Σ implied 2x2,  V^T: (2, 2)
singular values: [1.732051 1.      ]
reduced reconstructs A? True

Same singular values $\{\sqrt3, 1\}$, same reconstruction; the reduced $U$ is $3\times 2$ instead of $3\times 3$, discarding the one column that spanned the (1-dimensional) left null space. When we truncate the SVD even further in Chapter 31 — keeping only the largest few singular values — we get low-rank approximation, the engine of image compression. The reduced SVD is the first, lossless step on that road; low-rank approximation is the lossy continuation.

30.9.2 Singular values, the operator norm, the Frobenius norm, and the condition number

The singular values are not just bookkeeping — they are the most important scalar measurements of a matrix, and three of them tee up Chapter 38 (Numerical Linear Algebra) directly.

The operator norm (also called the spectral or 2-norm) of $A$ measures the largest factor by which $A$ can stretch any vector. By the extremal characterization of §30.2, that is exactly the top singular value:

$$\lVert A\rVert_2 = \max_{\lVert\mathbf{x}\rVert = 1}\lVert A\mathbf{x}\rVert = \sigma_1.$$

The Frobenius norm treats the matrix as a long list of numbers and takes the root-sum-of-squares of its entries. Remarkably, this also equals a clean function of the singular values — the root-sum-of-squares of them:

$$\lVert A\rVert_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\sum_i \sigma_i^2}.$$

(The proof: the Frobenius norm is invariant under multiplication by orthogonal matrices, since rotations preserve all lengths, so $\lVert A\rVert_F = \lVert U\Sigma V^{\mathsf{T}}\rVert_F = \lVert\Sigma\rVert_F = \sqrt{\sum_i\sigma_i^2}$.) This identity is the secret behind the Eckart–Young theorem of Chapter 31: it is what lets us measure the error of a low-rank approximation as the root-sum-of-squares of the discarded singular values.

The condition number measures how numerically sensitive a system $A\mathbf{x} = \mathbf{b}$ is — how much a small error in $\mathbf{b}$ can be amplified into a large error in $\mathbf{x}$. For an invertible square matrix it is the ratio of the largest to the smallest singular value:

$$\kappa(A) = \frac{\sigma_1}{\sigma_n} = \frac{\text{most stretch}}{\text{least stretch}}.$$

A condition number near $1$ means the matrix stretches all directions by similar amounts (a nearly-round ellipse — well-conditioned, safe to solve). A huge condition number means some direction is stretched thousands of times more than another (a long, thin, cigar-shaped ellipse — ill-conditioned, where tiny input errors explode). When the smallest singular value is zero, the condition number is infinite — the matrix is singular and the ellipse has collapsed flat in some direction. Let us read all three off the worked example.

# The singular values ARE the operator norm, Frobenius norm, and condition number.
import numpy as np
A = np.array([[2.0, 2.0], [-1.0, 1.0]])
S = np.linalg.svd(A, compute_uv=False)          # just the singular values
print("singular values:", S)                     # [2√2, √2]
print("operator 2-norm  = σ₁           :", np.linalg.norm(A, 2),  "=", S[0])
print("Frobenius norm   = √(Σσᵢ²)      :", np.linalg.norm(A,'fro'), "=", np.sqrt((S**2).sum()))
print("condition number = σ₁/σₙ        :", np.linalg.cond(A), "=", S[0]/S[-1])

singular values: [2.82842712 1.41421356]
operator 2-norm  = σ₁           : 2.82842712474619 = 2.82842712474619
Frobenius norm   = √(Σσᵢ²)      : 3.1622776601683795 = 3.1622776601683795
condition number = σ₁/σₙ        : 1.9999999999999998 = 2.0

The operator norm is $\sigma_1 = 2\sqrt2 \approx 2.828$; the Frobenius norm is $\sqrt{\sigma_1^2 + \sigma_2^2} = \sqrt{8 + 2} = \sqrt{10} \approx 3.162$; the condition number is $\sigma_1/\sigma_2 = 2\sqrt2/\sqrt2 = 2$, comfortably small (this matrix is well-conditioned). Three of the most important numbers in all of applied linear algebra, all read straight off the singular values. We develop the condition number into a full theory of numerical stability in Chapter 38; for now, hold the picture: the condition number is the eccentricity of the ellipse — how far from round the image of the unit circle is.

Real-World Application — Numerical reliability in scientific computing and machine learning. Before solving a large linear system — a finite-element simulation of a bridge, a least-squares regression with thousands of features, the normal equations inside a neural-network training step — engineers compute the condition number $\sigma_1/\sigma_n$ from the SVD to know how much they can trust the answer. A condition number of $10^{12}$ on a machine with about $16$ digits of precision warns that roughly $12$ of those digits will be lost to amplified rounding error, so the solution may be meaningless. The SVD is the diagnostic that turns "is my matrix safe to invert?" from a guess into a measurement — which is why it is the bedrock of numerical linear algebra (Chapter 38) and why production ML libraries fall back to SVD-based solvers when a problem is ill-conditioned.

30.10 What is the pseudoinverse, and how does the SVD solve every least-squares problem?

We have one more major payoff to collect, and it unifies a thread that has run through the whole book: solving $A\mathbf{x} = \mathbf{b}$ when $A$ is not a nice invertible square matrix. Chapter 9 inverted square matrices; Chapter 17 fit regression lines by least squares; Chapter 19 projected onto the column space. The SVD subsumes all of these into a single object — the pseudoinverse $A^{+}$ — that "inverts" any matrix, of any shape and any rank, in the only sense that is ever possible.

The construction is disarmingly simple. Given $A = U\Sigma V^{\mathsf{T}}$, you might hope to invert it as $V\Sigma^{-1}U^{\mathsf{T}}$ — but $\Sigma$ may not be invertible (it can be rectangular, or have zero singular values). So we invert what we can: define the pseudoinverse of $\Sigma$, written $\Sigma^{+}$, by transposing the shape and reciprocating each nonzero singular value, leaving the zeros alone:

$$\Sigma^{+} = \operatorname{diag}\!\left(\frac{1}{\sigma_1}, \dots, \frac{1}{\sigma_r}, 0, \dots, 0\right)^{\!\text{(shape } n\times m)}, \qquad\text{then}\qquad A^{+} = V\Sigma^{+}U^{\mathsf{T}}.$$

That is the Moore–Penrose pseudoinverse. The rule "reciprocate the nonzero $\sigma_i$, leave the zeros" is the entire idea: along directions the matrix stretches (nonzero $\sigma_i$), the pseudoinverse un-stretches by $1/\sigma_i$; along directions the matrix crushes ($\sigma_i = 0$), there is nothing to undo, so the pseudoinverse does nothing there either. When $A$ is square and invertible, every singular value is nonzero, so $\Sigma^{+} = \Sigma^{-1}$ and $A^{+} = V\Sigma^{-1}U^{\mathsf{T}} = A^{-1}$ exactly — the pseudoinverse is the ordinary inverse whenever the ordinary inverse exists. The pseudoinverse is the honest generalization of $A^{-1}$ to matrices that have no inverse.

Geometric Intuition — Recall that $A$ is an invertible rotate-stretch-rotate between its row space and column space, plus a collapse of its null space (§30.8). The pseudoinverse $A^{+}$ undoes the invertible part and ignores the collapse. It takes a vector in the column space, un-rotates by $U^{\mathsf{T}}$, un-stretches by $\Sigma^{+}$ (dividing by each $\sigma_i$), and rotates back by $V$ into the row space. What it cannot recover is the information $A$ destroyed by crushing the null space to zero — and it doesn't pretend to. So $A^{+}$ maps into the row space, never adding any null-space component. It is the best possible inverse: it perfectly undoes everything $A$ did reversibly, and cleanly declines to invent what $A$ erased.

The reason the pseudoinverse matters so much is that it solves the least-squares problem — the single most common computation in applied mathematics. When $A\mathbf{x} = \mathbf{b}$ has no exact solution (an overdetermined system, more equations than unknowns, as in fitting a line to scattered data, Chapter 17), the best we can do is find the $\mathbf{x}$ minimizing $\lVert A\mathbf{x} - \mathbf{b}\rVert$ — the least-squares solution. And the SVD delivers it directly:

$$\hat{\mathbf{x}} = A^{+}\mathbf{b} \quad\text{minimizes}\quad \lVert A\mathbf{x} - \mathbf{b}\rVert,$$

and when there are infinitely many minimizers (a rank-deficient $A$), $A^{+}\mathbf{b}$ returns the one of smallest norm. This is exactly the projection-onto-the-column-space story of Chapter 19, now solved by the SVD for any matrix. We verify it on a small overdetermined system — fitting a line $y = c_0 + c_1 x$ through four points that do not lie on any line:

# The pseudoinverse A⁺ = VΣ⁺Uᵀ solves least squares — for ANY matrix.
import numpy as np
x = np.array([0.0, 1.0, 2.0, 3.0])
y = np.array([1.0, 3.0, 4.0, 4.0])               # 4 points, no exact line
A = np.column_stack([np.ones_like(x), x])        # 4x2 design matrix (overdetermined)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
Sigma_plus = np.diag(1.0 / S)                    # reciprocate the (nonzero) σ's
A_plus = Vt.T @ Sigma_plus @ U.T                 # A⁺ = V Σ⁺ Uᵀ
coef = A_plus @ y                                # least-squares [c₀, c₁]
print("pseudoinverse A⁺ =\n", np.round(A_plus, 4))
print("best-fit line: y =", round(coef[0], 4), "+", round(coef[1], 4), "x")
print("matches np.linalg.pinv?  ", np.allclose(A_plus, np.linalg.pinv(A)))
print("matches np.linalg.lstsq? ", np.allclose(coef, np.linalg.lstsq(A, y, rcond=None)[0]))

pseudoinverse A⁺ =
 [[ 0.7  0.4  0.1 -0.2]
 [-0.3 -0.1  0.1  0.3]]
best-fit line: y = 1.5 + 1.0 x
matches np.linalg.pinv?   True
matches np.linalg.lstsq?  True

The pseudoinverse built from the SVD recovers the best-fit line $\hat y = 1.5 + x$, matching both np.linalg.pinv and the dedicated least-squares solver np.linalg.lstsq. The SVD-based pseudoinverse is, in fact, the most numerically reliable way to solve least squares — far more stable than the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ of Chapter 17, because forming $A^{\mathsf{T}}A$ squares the condition number (§30.9) and can lose half the available precision, while the SVD works on $A$ directly. This is why production regression and engineering software solves least squares through the SVD. We develop a full engineering case study of the pseudoinverse and least squares in the case studies for this chapter.

Common Pitfall — Do not invert tiny singular values blindly when forming $\Sigma^{+}$. If a singular value is genuinely zero, $1/\sigma_i$ is infinite — so the pseudoinverse rule leaves it at zero rather than reciprocating. But the subtler trap is a singular value that is near zero (say $10^{-10}$): reciprocating it gives a gigantic $10^{10}$, which amplifies noise catastrophically and produces a wild, useless "solution." The remedy, the truncated SVD pseudoinverse, treats singular values below a tolerance as zero and discards them — trading a little bias for enormous stability. This is the regularization idea (closely related to ridge regression) that makes least squares usable on ill-conditioned, real-world data, and it is invisible without the SVD. Chapter 38 returns to it.

30.11 What does the SVD look like? (the visualizer returns one last time)

It is time to bring back the recurring 2D transformation visualizer from Chapter 1 — the same tool, unchanged, that has shown us shears, scalings, rotations, determinants as area, and eigenvectors as invariant directions. This is its final and most satisfying appearance, because the SVD lets us decompose a single deformation of the unit square into the three stages — rotate, stretch, rotate — and watch each one happen separately. We use the worked-example matrix $A = \begin{psmallmatrix}2 & 2\\ -1 & 1\end{psmallmatrix}$, whose SVD we computed by hand: $V^{\mathsf{T}}$ rotates by $-45°$, $\Sigma$ stretches by $2\sqrt2$ and $\sqrt2$, and $U = I$.

# toolkit/visualizer.py — the recurring 2D transformation visualizer.
# Shows what a 2x2 matrix A does to the unit square and the basis vectors.
import numpy as np
import matplotlib.pyplot as plt

def visualize_2d(A, title="", ax=None):
    """Plot the action of 2x2 matrix A on the unit square and i-hat, j-hat."""
    A = np.asarray(A, dtype=float)
    square = np.array([[0, 1, 1, 0, 0],
                       [0, 0, 1, 1, 0]])          # unit-square corners (closed)
    out = A @ square                               # transformed square
    e1, e2 = A @ np.array([1, 0]), A @ np.array([0, 1])   # images of basis vectors
    if ax is None:
        _, ax = plt.subplots(figsize=(5, 5))
    ax.plot(square[0], square[1], "b--", lw=1, label="input (unit square)")
    ax.fill(out[0], out[1], alpha=0.25, color="C1")
    ax.plot(out[0], out[1], "C1-", lw=2, label="A · (unit square)")
    ax.arrow(0, 0, *e1, color="C3", width=0.02, length_includes_head=True)  # A e1
    ax.arrow(0, 0, *e2, color="C2", width=0.02, length_includes_head=True)  # A e2
    ax.axhline(0, color="gray", lw=0.5)
    ax.axvline(0, color="gray", lw=0.5)
    ax.set_aspect("equal")
    ax.grid(True, alpha=0.3)
    ax.set_title(title or f"det = {np.linalg.det(A):.2f}")
    ax.legend(loc="best", fontsize=8)
    return ax

# Example: a horizontal shear
# visualize_2d([[1, 1], [0, 1]], title="Shear")
# plt.show()

Now the experiment that makes this whole chapter visible. We apply the three SVD factors one at a time — $V^{\mathsf{T}}$, then $\Sigma V^{\mathsf{T}}$, then the full $U\Sigma V^{\mathsf{T}} = A$ — and draw each stage in its own panel, so you can watch a complicated transformation resolve into rotate, then stretch, then rotate:

# SVD as rotate–stretch–rotate: apply V^T, then ΣV^T, then UΣV^T, frame by frame.
import numpy as np, matplotlib.pyplot as plt
from visualizer import visualize_2d

A = np.array([[2.0, 2.0], [-1.0, 1.0]])
U, S, Vt = np.linalg.svd(A)
Sig = np.diag(S)

fig, axes = plt.subplots(1, 4, figsize=(18, 5))
visualize_2d(np.eye(2),         title="Start: unit square",         ax=axes[0])
visualize_2d(Vt,                title="After Vᵀ  (rotate input)",    ax=axes[1])
visualize_2d(Sig @ Vt,          title="After ΣVᵀ (stretch by σ's)",  ax=axes[2])
visualize_2d(U @ Sig @ Vt,      title="After UΣVᵀ = A (rotate out)", ax=axes[3])
plt.tight_layout(); plt.show()
print("singular values (semi-axes of the ellipse):", np.round(S, 4))

Figure 30.1. The SVD as rotate–stretch–rotate, in four frames. Panel 1 is the dashed-blue unit square with the standard basis vectors. Panel 2 applies $V^{\mathsf{T}}$ alone: the square is rotated (here by $-45°$) with no change in size — a rigid motion, since $V^{\mathsf{T}}$ is orthogonal. Panel 3 applies $\Sigma V^{\mathsf{T}}$: the rotated square is now stretched along the coordinate axes, by $\sigma_1 = 2\sqrt2 \approx 2.83$ horizontally and $\sigma_2 = \sqrt2 \approx 1.41$ vertically, turning it into a rectangle (the unit circle would become an axis-aligned ellipse with these semi-axes). Panel 4 applies the full $U\Sigma V^{\mathsf{T}} = A$: the final rotation $U$ orients the stretched shape into its true position — here $U = I$, so panel 4 equals panel 3, and the parallelogram is exactly $A$'s image of the unit square. Alt-text: four side-by-side plots showing a unit square first rotated, then stretched into a rectangle along the axes, then rotated into the final parallelogram, illustrating the rotate–stretch–rotate decomposition of a matrix.

The four frames are the theorem made visible. The complicated parallelogram in panel 4 — which looked like an inscrutable mix of stretch, shear, and rotation back in §30.1 — is revealed as three clean motions: a rotation, an axis-aligned stretch by the singular values, and a final rotation. That is what every matrix does. And the lengths of the stretched rectangle's sides are precisely the singular values, just as the semi-axes of the image ellipse are. If you ran this on a matrix with a nontrivial $U$ (say $A = \begin{psmallmatrix}3 & 0\\ 4 & 5\end{psmallmatrix}$), panel 4 would differ from panel 3 by a genuine final rotation — but the story would be identical: rotate, stretch, rotate.

Geometric Intuition — Compare this four-panel figure with the single-stage figures from earlier chapters. A pure rotation (Chapter 21) is only panel 2's motion. A pure diagonal scaling is only panel 3's motion. A symmetric matrix (Chapter 27) is the special case where the two rotations are inverses of each other ($U = V$), so it rotates in, stretches, and rotates back the same way — pure stretch along fixed perpendicular eigen-axes. A general matrix is the full three-stage sequence with $U \ne V$: it stretches along one set of perpendicular axes ($V$'s columns) and lands the result on a different set ($U$'s columns). Watching the square rotate, then stretch, then rotate is watching "every linear map is rotate–stretch–rotate" with your own eyes. This is the picture the whole chapter has been building toward.

30.12 How do you build the SVD from scratch?

We close the exposition with the chapter's contribution to the from-scratch toolkit you have been assembling since Chapter 2. The natural deliverable is a function that computes the SVD of a matrix using the $A^{\mathsf{T}}A$ construction we proved in §30.5 — exactly the four-step hand recipe, turned into code, and verified against numpy.

The honest part to acknowledge, in the spirit of the rest of the book, is that the hard sub-problem — finding the eigenvalues and orthonormal eigenvectors of the symmetric $A^{\mathsf{T}}A$ — is precisely what the eigen-routines of toolkit/eigen.py (Chapters 23 and 29) and the spectral decomposition of Chapter 27 already do. So your svd_from_scratch is allowed to call your earlier eigen-solver (or numpy's eigh) to diagonalize $A^{\mathsf{T}}A$; its own from-scratch job is the assembly: sort the eigenvalues in descending order, take square roots to get the singular values, stack the eigenvectors into $V$, and form $U$ column by column via $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$. That assembly is pure linear algebra you can write by hand, and it is where the understanding lives.

Build Your Toolkit — Implement svd_from_scratch(A) in toolkit/svd.py. Following the §30.5 construction: (1) form $S = A^{\mathsf{T}}A$ (symmetric); (2) obtain its eigenvalues and orthonormal eigenvectors (from your toolkit/eigen.py qr_algorithm, or np.linalg.eigh for now), and sort them in descending order of eigenvalue; (3) set the singular values $\sigma_i = \sqrt{\lambda_i}$ (clamp tiny negatives from rounding up to $0$ before the square root) and stack the eigenvectors as the columns of $V$; (4) for each $\sigma_i > \text{tol}$, set $\mathbf{u}_i = A\mathbf{v}_i / \sigma_i$, and if needed extend $\{\mathbf{u}_i\}$ to a full orthonormal basis with Gram–Schmidt (Chapter 20). Return $U$, the singular-value array, and $V^{\mathsf{T}}$ — matching numpy's output order. Then verify against np.linalg.svd: confirm the singular values match exactly, that $U^{\mathsf{T}}U = I$ and $V^{\mathsf{T}}V = I$, and that $U\Sigma V^{\mathsf{T}}$ reconstructs $A$ to tolerance — and, crucially, mind the conventions: your singular vectors may differ from numpy's by paired sign flips (§30.4), so compare the reconstruction and the singular values, not the raw $U$ and $V$ entrywise. This module joins eigen.py (Chapters 23/29) and is the direct ancestor of the PCA routine you will write in Chapter 32.

Here is the kind of check your finished function should pass — written with numpy here so you can confirm the expected behavior before coding the from-scratch assembly:

# Expected behavior of svd_from_scratch(A) via A^T A — verify against np.linalg.svd.
import numpy as np
def svd_from_scratch(A):
    A = np.asarray(A, dtype=float)
    w, V = np.linalg.eigh(A.T @ A)                  # eigenvalues ASCENDING
    order = np.argsort(w)[::-1]                      # sort DESCENDING
    w, V = w[order], V[:, order]
    sigma = np.sqrt(np.clip(w, 0, None))            # σ = √λ, clamp rounding
    U = np.zeros((A.shape[0], A.shape[0]))
    for i, s in enumerate(sigma):
        if s > 1e-12:
            U[:, i] = (A @ V[:, i]) / s              # u_i = A v_i / σ_i
    return U, sigma, V.T                             # match numpy: U, σ, Vᵀ

A = np.array([[2.0, 2.0], [-1.0, 1.0]])
U, sig, Vt = svd_from_scratch(A)
Un, Sn, Vtn = np.linalg.svd(A)
print("our singular values :", np.round(sig, 6))
print("numpy singular values:", np.round(Sn, 6))    # identical
print("reconstructs A?      :", np.allclose(U @ np.diag(sig) @ Vt, A))
print("U orthonormal?       :", np.allclose(U.T @ U, np.eye(2)))

our singular values : [2.828427 1.414214]
numpy singular values: [2.828427 1.414214]
reconstructs A?      : True
U orthonormal?       : True

The from-scratch construction reproduces numpy's singular values exactly and reconstructs $A$ to machine precision. (As discussed in §30.4, the raw $U$ and $V$ may differ from numpy's by paired sign flips — which is why we verify the reconstruction, not the entries.) That is the SVD, operationalized: the spectral theorem on $A^{\mathsf{T}}A$, assembled into the universal factorization.

Historical Note — The SVD has a long and tangled history. The decomposition for real square matrices was discovered independently by Eugenio Beltrami (1873) and Camille Jordan (1874), with later contributions by James Joseph Sylvester, Erhard Schmidt, and Hermann Weyl extending it to rectangular and infinite-dimensional settings [verify]. The result that truncating the SVD gives the best low-rank approximation — which powers the image compression of Chapter 31 — is usually credited to Carl Eckart and Gale Young (1936), and is often called the Eckart–Young theorem, though Schmidt and Weyl had related results earlier [verify]. The SVD became the practical workhorse of numerical linear algebra only after Gene Golub and William Kahan published a stable algorithm to compute it in 1965 [verify]. Treat these attributions as the standard account and corroborate before citing; the multiple independent discoveries make the precise history genuinely contested.

30.13 What have we built, and where does it lead?

We began with a tangled-looking transformation and ended with a clean, universal statement: every linear map is a rotation, then a stretch along perpendicular axes, then another rotation — $A = U\Sigma V^{\mathsf{T}}$, rotate–stretch–rotate, for any matrix that exists. From that one picture, everything followed. We saw the singular values $\sigma_i \ge 0$ as the semi-axes of the ellipse $A$ makes from the unit circle, the right singular vectors $\mathbf{v}_i$ as the special input axes, and the left singular vectors $\mathbf{u}_i$ as the output axes they land on, tied together by the single relation $A\mathbf{v}_i = \sigma_i\mathbf{u}_i$. We proved the SVD exists for every matrix — the universality diagonalization never had — by applying the Spectral Theorem of Chapter 27 to the always-symmetric $A^{\mathsf{T}}A$, with $\sigma_i = \sqrt{\lambda_i}$ the square roots of its non-negative eigenvalues. We computed a full $2\times 2$ SVD by hand and reconciled it with numpy, learning that the singular values are pinned down but the singular vectors carry sign and ordering conventions. And we cashed the decomposition out: the rank is the count of nonzero singular values; the columns of $U$ and $V$ give orthonormal bases for all four fundamental subspaces of Chapter 14 at once; the singular values are the operator norm, the Frobenius norm, and (as a ratio) the condition number.

This chapter is the convergence of the book's deepest themes. Geometry and algebra are two views of one object: "$A = U\Sigma V^{\mathsf{T}}$" is algebra, "rotate–stretch–rotate" is geometry, and we proved they are the same statement. Linear algebra is the study of linear transformations: the SVD is the ultimate structural description of what a transformation does, valid for every transformation without exception. The four fundamental subspaces organize everything: the SVD builds orthonormal bases for all four and makes their orthogonality visible as perpendicular columns of $U$ and $V$. And eigenvalues reveal what a matrix really does: the SVD extends that revelation from the lucky symmetric matrices of Chapter 27 to the entire universe of matrices, by routing through $A^{\mathsf{T}}A$. The threshold idea to carry forward is exactly this universality — once you see that every matrix is rotate–stretch–rotate, the back half of the book stops being a collection of separate techniques and becomes one idea seen from different angles.

The forward references are the most consequential in the book, and Part VI exists to develop them. Chapter 31 (SVD Applications) delivers the signature payoff teased since Chapter 1: keep only the largest few singular values and you get a low-rank approximation — a rank-10 image is a blurry ghost, a rank-200 reconstruction is visually indistinguishable from the original, at a fraction of the data. The Eckart–Young theorem proves this truncation is the best possible approximation of each rank, and the error is measured by the discarded singular values, using the Frobenius-norm identity of §30.9. The same low-rank idea denoises signals and fills in missing data. Chapter 32 (Principal Component Analysis) reveals that the workhorse of data science is the SVD in disguise: the principal components of a dataset are the right singular vectors of its centered data matrix, and the variance along each is $\sigma_i^2/(n-1)$ — PCA is the SVD applied to data, the engine of dimensionality reduction. Chapter 33 (Application: Machine Learning) shows that the recommender systems behind streaming and shopping are matrix factorizations — the low-rank idea from Chapter 31 applied to a giant table of who-likes-what — and that the same singular values diagnose the numerical health of every layer of a neural network. And Chapter 38 (Numerical Linear Algebra) develops the condition number $\sigma_1/\sigma_n$ of §30.9 into a full theory of when computations can be trusted.

# Forward look: truncating the SVD to the top-k singular values is low-rank approximation.
import numpy as np
A = np.array([[2.0, 2.0], [-1.0, 1.0]])
U, S, Vt = np.linalg.svd(A)
print("singular values:", np.round(S, 4))
# Best rank-1 approximation: keep only σ₁, u₁, v₁ (Eckart–Young, Chapter 31).
A1 = S[0] * np.outer(U[:, 0], Vt[0, :])
print("best rank-1 approximation A₁ =\n", np.round(A1, 4))
print("approximation error ‖A - A₁‖_F =", round(np.linalg.norm(A - A1, 'fro'), 4),
      " = discarded σ₂ =", round(S[1], 4))

singular values: [2.8284 1.4142]
best rank-1 approximation A₁ =
 [[ 2.  2.]
 [ 0.  0.]]
approximation error ‖A - A₁‖_F = 1.4142  = discarded σ₂ = 1.4142

Keeping only the largest singular value gives the best rank-1 approximation of $A$, and the error is exactly the discarded singular value $\sigma_2 = \sqrt2$ — a first glimpse of the Eckart–Young theorem that Chapter 31 turns into image compression. The same decomposition that we proved factors every matrix into rotate–stretch–rotate is about to compress images, find principal components, power recommenders, and certify numerical reliability. Hold onto the rotate–stretch–rotate picture and the singular values as the ellipse's axes; they are the single most useful idea in the back half of this book, and they are about to do extraordinary work.

Prerequisites

Learning Objectives

In This Chapter

The Singular Value Decomposition: The Most Important Factorization in Linear Algebra

30.1 What does the SVD say a matrix does to space?

30.1.1 Why two rotations, and not one?

30.2 What are the singular values and singular vectors?

30.3 How do you compute the SVD of a matrix? A full $2\times 2$ by hand

30.4 Does the hand computation match numpy?

30.4.1 A second example with a genuine rotation $U$

30.5 Why does every matrix have an SVD? (the spectral theorem on $A^{\mathsf{T}}A$)

30.6 How is the SVD related to eigen-decomposition?

Warning

30.7 How do you read the rank off the singular values?

30.8 How does the SVD reveal the four fundamental subspaces?

30.9 What is the difference between the full and reduced SVD, and what do the singular values measure?

30.9.1 Full versus reduced SVD

30.9.2 Singular values, the operator norm, the Frobenius norm, and the condition number

30.10 What is the pseudoinverse, and how does the SVD solve every least-squares problem?

30.11 What does the SVD look like? (the visualizer returns one last time)

30.12 How do you build the SVD from scratch?

30.13 What have we built, and where does it lead?