Numerical Linear Algebra: Stability, Conditioning, and How Computers Actually Do It

DataField.Dev

45 min read

> Learning paths. Math majors — read everything; the condition-number bound and the backward-stability argument are the rigorous heart of the chapter, and the Math-Major Sidebar sharpens the inequality. CS / Data Science — focus on the Geometric...

Prerequisites

chapter-30-singular-value-decomposition
chapter-10-lu-decomposition

Learning Objectives

Explain how floating-point arithmetic represents real numbers and why rounding, machine epsilon, and catastrophic cancellation make exact computation impossible.
Define the condition number $\kappa(A)=\sigma_{\max}/\sigma_{\min}$ from the SVD and use it to bound the error in solving $A\mathbf{x}=\mathbf{b}$.
Diagnose an ill-conditioned system (e.g., a Hilbert matrix) and predict how many digits of accuracy you can trust.
Distinguish a problem's conditioning from an algorithm's stability, and explain why QR and SVD beat the normal equations and why we pivot in LU.
Describe the major iterative methods (Jacobi, Gauss–Seidel, the power method, conjugate gradient) and when they beat direct methods on large sparse systems.
Reason about what LAPACK and numpy actually do under the hood and verify your own error estimates against `np.linalg.cond`.

In This Chapter

38.1 What is floating-point arithmetic, and why can't computers do exact math?
38.2 What is the condition number of a matrix, and what does it tell us?
38.3 The Hilbert matrix: how a tiny perturbation wrecks the solution
38.4 What is numerical stability, and why is it different from conditioning?
38.5 Why do QR and SVD beat the normal equations for least squares?
38.6 How do iterative methods solve enormous systems?
38.7 What do LAPACK and numpy actually do under the hood?
38.8 Putting it together: a checklist for trustworthy numerical linear algebra

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Numerical Linear Algebra: Stability, Conditioning, and How Computers Actually Do It

Learning paths. Math majors — read everything; the condition-number bound and the backward-stability argument are the rigorous heart of the chapter, and the Math-Major Sidebar sharpens the inequality. CS / Data Science — focus on the Geometric Intuition, the numpy experiments, and the Hilbert-matrix and least-squares disasters; this is the chapter that explains why your regression sometimes returns nonsense. Physics / Engineering — focus on the geometry of error amplification, the pivoting and stability arguments, and the iterative-methods section that powers every large-scale simulation you will ever run.

Every chapter before this one lived in a fantasy. We wrote $A\mathbf{x}=\mathbf{b}$ and solved it; we computed eigenvalues and singular values and reported them to as many digits as we pleased; we factored $A=U\Sigma V^{\mathsf{T}}$ as though the entries of $U$, $\Sigma$, and $V$ were Platonic ideals waiting to be read off. That fantasy is exact arithmetic — the world of pencil, paper, and infinite precision, where $1/3$ is a number and not an approximation, and where the only errors are the ones you make yourself.

Real computers do not live in that world. A 64-bit machine stores a real number in 64 bits, which means it can represent only finitely many of the uncountably many reals. Every number you type, every product you form, every sum you accumulate is rounded to the nearest representable value. Most of the time you never notice — the errors are tiny, around fifteen or sixteen significant digits down, and they wash out. But sometimes a problem amplifies those tiny errors until the answer your computer hands you is not merely inaccurate but meaningless: a vector with no correct digits at all, returned with the same serene confidence as a perfect answer.

This chapter is about the difference between those two worlds, and about the two distinct things that can go wrong when we cross from one to the other. The first is conditioning — a property of the problem: how much does the true answer change when the inputs are perturbed a little? The second is stability — a property of the algorithm: does the method introduce more error than the problem inherently demands? A problem can be hopelessly ill-conditioned, in which case no algorithm on Earth can save it. Or a problem can be perfectly well-conditioned and a careless algorithm can still wreck it. Numerical linear algebra is the discipline of telling these two apart and choosing methods that respect both. This is numerical linear algebra explained from the ground up — the reality check that turns textbook formulas into software that works.

The Key Insight — Correctness in exact arithmetic is not enough. The two questions that matter on a real machine are: How sensitive is this problem to perturbation? (conditioning) and Does my algorithm add error the problem didn't ask for? (stability). The condition number answers the first; the choice between QR, SVD, pivoted LU, and the normal equations answers the second.

We begin with the machine itself.

38.1 What is floating-point arithmetic, and why can't computers do exact math?

Intuition first. Imagine you may write any number you like, but only using a fixed-width slip of paper: one sign, a handful of significant digits, and an exponent. You can write $6.022 \times 10^{23}$ and $1.602 \times 10^{-19}$ with equal ease — the exponent slides the decimal point wherever it needs to go, which is why we call it floating point. But you cannot write every number. Between any two of your representable numbers there is a gap, and any real number that falls in a gap must be rounded to one end of it. The slip of paper is finite; the real line is not. Something has to give, and what gives is exactness.

That is precisely what a computer does, in binary rather than decimal. The dominant standard, IEEE 754 double precision (float64 in numpy), stores a real number in 64 bits: 1 sign bit, 11 exponent bits, and 52 bits for the mantissa (the significant digits). A nonzero number is stored as

$$ x = \pm (1.b_1 b_2 \cdots b_{52})_2 \times 2^{e}, $$

a sign, a binary fraction with 52 bits after the implied leading 1, and an integer exponent $e$ roughly in the range $-1022$ to $1023$. The 52 stored bits plus the implied leading 1 give 53 bits of precision, which buy you about $53 \log_{10} 2 \approx 15.95$ decimal significant digits — call it "about sixteen digits." That is your entire budget. Everything beyond the sixteenth digit is gone.

Geometric Intuition — Picture the representable floating-point numbers as tick marks on the real line. Near zero the ticks are dense; far from zero they are sparse. The marks are logarithmically spaced: between $1$ and $2$ the gap between consecutive doubles is about $2.2\times 10^{-16}$, but between $2^{52}$ and $2^{53}$ the gap is exactly $1$ — there are no integers representable between consecutive doubles up there, and beyond $2^{53}$ you cannot even represent every integer. Floating point trades absolute precision for relative precision: it keeps roughly the same number of significant digits everywhere, which is exactly what you want for a world where quantities range from the size of an atom to the size of a galaxy.

The size of that gap near $1$ has a name. Machine epsilon, written $\varepsilon_{\text{mach}}$, is the distance from $1$ to the next larger representable double — equivalently, the smallest number you can add to $1$ and still get something different from $1$. For IEEE double precision,

$$ \varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}. $$

This single number governs almost everything in this chapter. The fundamental promise of IEEE arithmetic is that the result of any single basic operation ($+$, $-$, $\times$, $\div$, $\sqrt{\ }$) is the exactly rounded version of the true result: if $\text{fl}(x)$ denotes the floating-point value stored for a real $x$, then

$$ \text{fl}(x) = x(1 + \delta), \qquad |\delta| \le \tfrac{1}{2}\varepsilon_{\text{mach}}. $$

Each operation introduces a relative error no larger than half a machine epsilon. One operation is harmless. The trouble is that a linear-algebra computation performs millions of them, and the errors can interact.

# Floating point is finite: machine epsilon and the classic surprise.
import numpy as np
print(np.finfo(np.float64).eps)   # 2.220446049250313e-16  (= 2**-52)
print(np.finfo(np.float32).eps)   # 1.1920929e-07           (single precision)
print(0.1 + 0.2 == 0.3)           # False
print(repr(0.1 + 0.2))            # '0.30000000000000004'
print(1.0 + np.finfo(float).eps/2 == 1.0)  # True: half an eps vanishes

Run it and you meet the most famous gotcha in computing: 0.1 + 0.2 is not 0.3. None of $0.1$, $0.2$, or $0.3$ is representable exactly in binary — they are infinite repeating binary fractions, just as $1/3 = 0.333\ldots$ is an infinite repeating decimal. Each is rounded on input, and the rounded sum lands one tick above the rounded $0.3$. The error is about $5.6\times 10^{-17}$, far too small to matter for most purposes — but == does not care how small. It asks for exact equality, and exact equality is the wrong question to ask of floating-point numbers.

Common Pitfall — Never test floating-point numbers for equality with ==. 0.1 + 0.2 == 0.3 is False, and a loop that waits for a float to hit an exact value may never terminate. Compare with a tolerance instead: abs(a - b) <= tol, or np.isclose(a, b) (which uses both a relative and an absolute tolerance). The same applies to checking whether a matrix is singular: a determinant of 1e-18 is not a reliable signal of singularity — a small singular value relative to the largest is the honest test, which is exactly the condition number we are about to meet.

Single vs. double, and why precision is a budget

Double precision gives ~16 digits; single precision (float32) gives only ~7, with $\varepsilon_{\text{mach}} \approx 1.19\times 10^{-7}$. For decades, scientific computing defaulted to double because the extra digits were a cheap insurance policy against error growth. That calculus is shifting: modern machine learning runs enormous matrix multiplications in float32, float16, or even bfloat16, trading precision for speed and memory because a neural network's training is robust to noise in a way that solving an ill-conditioned linear system is not. Knowing how much precision a given computation needs — and how much it will lose — is the core skill this chapter builds. The connection between bit-level number formats and arithmetic is explored further in floating-point arithmetic, where you can see how IEEE 754 is actually laid out in hardware.

The cost of choosing the wrong precision is concrete, and we can measure it. Take the Hilbert matrix again (you will meet it in detail in Section 38.3) and solve the same problem twice — once in float64, once in float32:

# The same problem, two precisions: float32 runs out of digits much sooner.
import numpy as np
from scipy.linalg import hilbert
for n in (6, 8):
    H = hilbert(n); x_true = np.ones(n); b = H @ x_true
    x64 = np.linalg.solve(H.astype(np.float64), b.astype(np.float64))
    x32 = np.linalg.solve(H.astype(np.float32), b.astype(np.float32))
    e64 = np.linalg.norm(x64 - x_true) / np.linalg.norm(x_true)
    e32 = np.linalg.norm(x32.astype(np.float64) - x_true) / np.linalg.norm(x_true)
    print(n, '%.2e' % np.linalg.cond(H), 'float64 err %.2e' % e64, 'float32 err %.2e' % e32)
# 6 1.50e+07 float64 err 1.42e-10 float32 err 3.25e-02
# 8 1.53e+10 float64 err 6.12e-08 float32 err 3.55e+00

At $n=6$ ($\kappa\approx 1.5\times 10^7$), double precision keeps ten correct digits while single precision is already down to about one — a 3% relative error. At $n=8$ ($\kappa\approx 1.5\times 10^{10}$), single precision has no correct digits at all (a relative error of $3.55$), while double precision still holds seven. The pattern follows the rule of thumb exactly: you start with $\sim 16$ digits in double or $\sim 7$ in single, and you lose $\log_{10}\kappa$ of them. Single precision crosses zero usable digits at $\kappa\approx 10^7$; double precision survives until $\kappa\approx 10^{16}$. That nine-order-of-magnitude cushion is why scientific solvers default to double, and why ML's move to lower precision is a deliberate bet that its problems are well-conditioned enough to tolerate the smaller budget.

Historical Note — The IEEE 754 standard was ratified in 1985, with William Kahan its principal architect; he later won the 1989 Turing Award largely for this work [verify]. Before the standard, every manufacturer rounded differently, and a program that ran correctly on one machine could give wrong answers on another. IEEE 754 made floating-point behavior predictable — the same operation gives the same correctly-rounded result everywhere — which is precisely what makes the error analysis in this chapter possible at all.

The one operation that destroys precision: subtraction

Not all floating-point operations are equally dangerous. Multiplication and division are gentle — each introduces only the unavoidable half-an-epsilon relative error and never more. Addition of like-signed numbers is gentle too. The single operation that can annihilate precision is subtraction of two nearly-equal numbers, the phenomenon called catastrophic cancellation. Here is the mechanism, and it is worth understanding deeply because it is the root cause of most "mysterious" numerical failures.

Suppose $a = 1.23456789012345$ and $b = 1.23456789012300$, each stored to sixteen significant digits. Both are excellent approximations of their true values — sixteen good digits each. But their difference is $a - b = 0.00000000000045$, which has only two significant digits. The fourteen leading digits that the two numbers shared canceled exactly, leaving only the last two digits — and those last two digits are exactly where the rounding error of $a$ and $b$ lived. The subtraction did not create error; it exposed the error that was already there, by deleting the good digits in front of it. You went in with sixteen reliable digits and came out with two. That is the catastrophe.

We saw the consequence in the quadratic formula. Solving $x^2 + 10^8 x + 1 = 0$, the small root computed the textbook way, $(-b+\sqrt{b^2-4ac})/(2a)$, requires subtracting $\sqrt{b^2-4ac}\approx 10^8$ from $b = 10^8$ — two nearly-equal large numbers — and the naive formula returns $-7.45\times 10^{-9}$ when the true root is $-10^{-8}$, an error in the first significant digit. The algebraically equivalent form $2c/(-b-\sqrt{b^2-4ac})$ adds two like-signed numbers instead of subtracting, never cancels, and returns $-10^{-8}$ to full precision. Identical mathematics; one formula is stable and the other is not, and the only difference is whether a near-cancellation occurs. This is the purest possible illustration of the chapter's thesis that stability is a property of the algorithm, not the problem — and it is why, when a numerical result looks wrong, the first thing an expert hunts for is a subtraction of nearly-equal quantities.

38.2 What is the condition number of a matrix, and what does it tell us?

Intuition first. Suppose you are solving $A\mathbf{x}=\mathbf{b}$, and someone nudges the right-hand side $\mathbf{b}$ by a hair — perhaps because $\mathbf{b}$ came from a measurement with a little noise, or simply because storing $\mathbf{b}$ in floating point rounded it. The question that decides whether your answer is trustworthy is: how much does the solution $\mathbf{x}$ move in response? For some matrices, a hair's nudge in $\mathbf{b}$ produces a hair's nudge in $\mathbf{x}$ — the problem is forgiving. For others, a hair's nudge in $\mathbf{b}$ swings $\mathbf{x}$ wildly across the room. The condition number of a matrix is the number that measures exactly this amplification, and it is the single most important diagnostic in all of numerical linear algebra.

Here is the geometric source of the trouble. Recall from Chapter 30 that the SVD writes $A = U\Sigma V^{\mathsf{T}}$, and that geometrically $A$ acts on the unit sphere by rotating (by $V^{\mathsf{T}}$), stretching along the axes by the singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n$ (that's $\Sigma$), and rotating again (by $U$). The image of the unit sphere is an ellipsoid whose semi-axis lengths are exactly the singular values. If the largest axis $\sigma_{\max}$ is enormous and the smallest axis $\sigma_{\min}$ is tiny, the ellipsoid is a long thin sliver. Inverting $A$ reverses the process: it stretches by $1/\sigma_i$. The direction that $A$ squashed by $\sigma_{\min}$ gets blown up by $1/\sigma_{\min}$ on the way back. A small error along that squashed direction becomes a giant error in the solution.

Geometric Intuition — Using the visualizer from Chapter 1, picture a $2\times 2$ matrix that maps the unit square to a long, thin parallelogram — wide in one direction, nearly flat in the other. The "flat" direction is the small singular value. To solve $A\mathbf{x}=\mathbf{b}$ is to ask "which input landed at $\mathbf{b}$?" — to run the transformation backward. Backward, the flat direction becomes a violent stretch: two right-hand sides $\mathbf{b}$ that look almost identical can have pre-images $\mathbf{x}$ that are wildly far apart. An ill-conditioned matrix is a near-degenerate transformation — it almost collapses a dimension, and "almost collapsing" forward means "almost exploding" backward.

That ratio of the biggest stretch to the smallest is the definition.

Definition (condition number, 2-norm). For an invertible matrix $A$ with singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n > 0$, the condition number (in the 2-norm) is $$ > \kappa(A) \;=\; \lVert A \rVert_2 \, \lVert A^{-1} \rVert_2 \;=\; \frac{\sigma_{\max}}{\sigma_{\min}}. > $$ It satisfies $\kappa(A) \ge 1$ always, with $\kappa(A)=1$ for orthogonal matrices (which only rotate and reflect — no stretching). A matrix with large $\kappa(A)$ is called ill-conditioned; a singular matrix has $\sigma_{\min}=0$ and so $\kappa(A)=\infty$.

The middle expression, $\lVert A\rVert_2 \lVert A^{-1}\rVert_2$, is the general definition for any matrix norm; the right expression, $\sigma_{\max}/\sigma_{\min}$, is what it equals in the 2-norm, because the 2-norm of a matrix is its largest singular value and the 2-norm of $A^{-1}$ is $1/\sigma_{\min}$ (the singular values of $A^{-1}$ are the reciprocals of those of $A$). The condition number falls straight out of the SVD you already know.

The error bound: how $\kappa(A)$ controls accuracy

Now we make "amplification" precise. Suppose we solve $A\mathbf{x}=\mathbf{b}$ but the right-hand side is perturbed to $\mathbf{b} + \Delta\mathbf{b}$, giving a perturbed solution $\mathbf{x}+\Delta\mathbf{x}$. Then:

Theorem (perturbation bound for linear systems). For invertible $A$ and $\mathbf{b}\neq\mathbf{0}$, if $A(\mathbf{x}+\Delta\mathbf{x}) = \mathbf{b}+\Delta\mathbf{b}$, then $$ > \frac{\lVert \Delta\mathbf{x}\rVert}{\lVert \mathbf{x}\rVert} \;\le\; \kappa(A)\,\frac{\lVert \Delta\mathbf{b}\rVert}{\lVert \mathbf{b}\rVert}. > $$

Why we care. The left side is the relative error in the answer; the right side is the relative error in the input times the condition number. The condition number is exactly the worst-case factor by which input error is magnified into output error. If $\kappa(A)=10^6$ and your right-hand side is good to ten digits ($\lVert\Delta\mathbf{b}\rVert/\lVert\mathbf{b}\rVert \approx 10^{-10}$), your answer may be good to only four ($10^6 \cdot 10^{-10} = 10^{-4}$). You lose about $\log_{10}\kappa(A)$ digits of accuracy. That rule of thumb is the most useful sentence in the chapter.

Key idea. $A\mathbf{x}=\mathbf{b}$ and $A\,\Delta\mathbf{x} = \Delta\mathbf{b}$ (subtract the two equations), so $\Delta\mathbf{x}=A^{-1}\Delta\mathbf{b}$. Bound each piece by the relevant norm.

Proof. From $A\mathbf{x}=\mathbf{b}$ and $A(\mathbf{x}+\Delta\mathbf{x})=\mathbf{b}+\Delta\mathbf{b}$, subtracting gives $A\,\Delta\mathbf{x}=\Delta\mathbf{b}$, so $\Delta\mathbf{x}=A^{-1}\Delta\mathbf{b}$ and therefore $\lVert\Delta\mathbf{x}\rVert \le \lVert A^{-1}\rVert\,\lVert\Delta\mathbf{b}\rVert$. From $A\mathbf{x}=\mathbf{b}$ we get $\lVert\mathbf{b}\rVert = \lVert A\mathbf{x}\rVert \le \lVert A\rVert\,\lVert\mathbf{x}\rVert$, hence $\dfrac{1}{\lVert\mathbf{x}\rVert} \le \dfrac{\lVert A\rVert}{\lVert\mathbf{b}\rVert}$. Multiplying the two inequalities, $$ \frac{\lVert\Delta\mathbf{x}\rVert}{\lVert\mathbf{x}\rVert} \;\le\; \lVert A^{-1}\rVert\,\lVert\Delta\mathbf{b}\rVert \cdot \frac{\lVert A\rVert}{\lVert\mathbf{b}\rVert} \;=\; \lVert A\rVert\,\lVert A^{-1}\rVert \cdot \frac{\lVert\Delta\mathbf{b}\rVert}{\lVert\mathbf{b}\rVert} \;=\; \kappa(A)\,\frac{\lVert\Delta\mathbf{b}\rVert}{\lVert\mathbf{b}\rVert}. \qquad\blacksquare $$

What this means. The bound is tight — there exist perturbations $\Delta\mathbf{b}$ that achieve it (those aligned with the smallest singular direction). So $\kappa(A)$ is not a pessimistic overestimate; it is the genuine worst-case amplification baked into the problem. No clever algorithm can beat it, because it is a property of $A$, not of how you solve.

# The condition number from the SVD equals np.linalg.cond, and the bound holds.
import numpy as np
from scipy.linalg import hilbert
H = hilbert(6)                      # a famously ill-conditioned matrix (Sec 38.3)
U, S, Vt = np.linalg.svd(H)
print(S[0] / S[-1])                 # 14951058.642...  sigma_max/sigma_min
print(np.linalg.cond(H))            # 14951058.642...  exact match

x = np.ones(6); b = H @ x           # construct a problem with known answer x = (1,...,1)
np.random.seed(7)
db = 1e-10 * np.random.standard_normal(6)     # a tiny perturbation of b
xp = np.linalg.solve(H, b + db)
lhs = np.linalg.norm(xp - x) / np.linalg.norm(x)
rhs = np.linalg.cond(H) * np.linalg.norm(db) / np.linalg.norm(b)
print(lhs, rhs)        # 0.000291...  <=  0.000839...   the bound holds

The condition number from the SVD matches np.linalg.cond to every digit (it is the same computation), and the measured relative error sits comfortably under the $\kappa(A)\cdot(\text{input error})$ bound. The point of the bound is not to predict the error exactly but to cap it — and to warn you, before you trust a single digit, how much damage a noisy or merely rounded input can do.

A hand-sized example you can feel

The Hilbert matrix is dramatic but large; let us make the amplification visible in a matrix small enough to hold in your head. Consider

$$ A = \begin{bmatrix} 1 & 1 \\ 1 & 1.0001 \end{bmatrix}. $$

The two rows are nearly identical — the second is the first plus a whisper. Geometrically, the two row-vectors $(1,1)$ and $(1, 1.0001)$ point in almost exactly the same direction, so the lines $x_1 + x_2 = b_1$ and $x_1 + 1.0001\,x_2 = b_2$ are nearly parallel. Two nearly-parallel lines intersect at a point that is exquisitely sensitive to where the lines sit: nudge one line a hair and the intersection slides a mile down the corridor between them. That is ill-conditioning in its most elementary geometric form — and it is the same near-collinearity that plagued the Hilbert columns, now visible in two dimensions.

Let us watch it. With $\mathbf{b} = (2, 2.0001)$ the exact solution is $\mathbf{x} = (1, 1)$. Now perturb only the last component of $\mathbf{b}$ by $0.0001$, to $\mathbf{b}' = (2, 2.0002)$ — a relative change in the input of about $3.5\times 10^{-5}$, a few thousandths of a percent. The solution lurches to $(0, 2)$:

# A 2x2 you can check by hand: near-parallel rows mean a fragile intersection.
import numpy as np
A = np.array([[1.0, 1.0], [1.0, 1.0001]])
print(np.linalg.cond(A))                 # 40002.0  (sigma_max/sigma_min)
b  = np.array([2.0, 2.0001]); print(np.linalg.solve(A, b))   # [1. 1.]
b2 = np.array([2.0, 2.0002]); print(np.linalg.solve(A, b2))  # [0. 2.]  <-- huge swing
print(np.linalg.norm(b2-b)/np.linalg.norm(b))      # 3.54e-05   input changed this much
print(np.linalg.norm(np.array([0,2])-np.array([1,1]))/np.linalg.norm([1,1]))  # 1.0  output changed 100%

A $3.5\times 10^{-5}$ relative change in the input produced a 100% relative change in the output. The amplification factor is about $1/3.5\times 10^{-5}\approx 2.8\times 10^4$ — right in line with $\kappa(A) = 40002$. You can see every step of this by hand: subtract row 1 from row 2 to get $0.0001\,x_2 = b_2 - b_1$, so $x_2 = (b_2-b_1)/0.0001 = 10^4(b_2-b_1)$. That factor of $10^4$ — dividing by the tiny $0.0001$ — is the error amplification, and it is exactly what makes the condition number large. Whenever you must divide by a small pivot, you are amplifying error; the condition number is the systematic, basis-independent measure of how much.

Computational Note — np.linalg.cond(A) defaults to the 2-norm condition number $\sigma_{\max}/\sigma_{\min}$, computed via the SVD. You can ask for cheaper norms (np.linalg.cond(A, 1) or np.inf) that avoid the SVD but give a less geometrically meaningful number. When in doubt, use the default 2-norm — it is the one the theory above is written for. And note $\kappa$ is dimensionless and scale-invariant: multiplying $A$ by $1000$ leaves $\kappa(A)$ unchanged, because both $\sigma_{\max}$ and $\sigma_{\min}$ scale together.

38.3 The Hilbert matrix: how a tiny perturbation wrecks the solution

Intuition first. We have talked about ill-conditioning in the abstract; now let us watch it destroy a computation in front of us. The cleanest victim is the Hilbert matrix, the symmetric matrix $H_n$ whose entry in row $i$, column $j$ is

$$ (H_n)_{ij} = \frac{1}{i+j-1}, \qquad H_4 = \begin{bmatrix} 1 & \tfrac12 & \tfrac13 & \tfrac14 \\[2pt] \tfrac12 & \tfrac13 & \tfrac14 & \tfrac15 \\[2pt] \tfrac13 & \tfrac14 & \tfrac15 & \tfrac16 \\[2pt] \tfrac14 & \tfrac15 & \tfrac16 & \tfrac17 \end{bmatrix}. $$

It looks utterly innocent — just unit fractions, symmetric, every entry between 0 and 1. It is invertible for every $n$ (its determinant is a tiny but nonzero number — for $H_6$ it is about $5.4\times 10^{-18}$, vanishingly small but genuinely not zero). And it is one of the most viciously ill-conditioned matrices in all of mathematics. Here is the subtlety the determinant exposes: a near-zero determinant smells like singularity, but the determinant alone is a treacherous gauge of conditioning. Scaling $H_6$ by 10 multiplies its determinant by $10^6$ without changing $\kappa$ at all (recall $\kappa$ is scale-invariant). The determinant conflates "small entries" with "ill-conditioned," and only the ratio of singular values $\sigma_{\max}/\sigma_{\min}$ separates the two. The Hilbert matrix's condition number grows exponentially with size:

$n$	$\kappa(H_n)$	digits lost (≈ $\log_{10}\kappa$)
3	$5.24\times 10^{2}$	2.7
4	$1.55\times 10^{4}$	4.2
5	$4.77\times 10^{5}$	5.7
6	$1.50\times 10^{7}$	7.2
8	$1.53\times 10^{10}$	10.2
10	$1.60\times 10^{13}$	13.2
12	$1.76\times 10^{16}$	16.2

Look at the last row. By $n=12$, the condition number has reached $1.76\times 10^{16}$ — larger than $1/\varepsilon_{\text{mach}}$. The rule of thumb says we will lose about 16 digits. Double precision has about 16 digits. So at $n=12$ we expect to lose all of them — every digit of the answer can be garbage. Let us prove it by experiment.

We rig the experiment so the true answer is known exactly. Choose $\mathbf{x}_{\text{true}} = (1,1,\ldots,1)^{\mathsf{T}}$, the vector of all ones, and manufacture the right-hand side as $\mathbf{b} = H_{12}\,\mathbf{x}_{\text{true}}$. Now ask numpy to solve $H_{12}\mathbf{x}=\mathbf{b}$. Since we built $\mathbf{b}$ from the known $\mathbf{x}_{\text{true}}$, the answer must be all ones. Watch what comes back.

# The Hilbert matrix disaster: a solver returns nonsense for an exact problem.
import numpy as np
from scipy.linalg import hilbert
np.set_printoptions(precision=4, suppress=True)

n = 12
H = hilbert(n)
x_true = np.ones(n)
b = H @ x_true                      # b is built FROM the known answer
print(np.linalg.cond(H))            # 1.760619121841585e+16

x_hat = np.linalg.solve(H, b)       # ask for x back; it must be all ones... right?
print(x_hat)
# [1.     1.     0.9997 1.0037 0.973  1.1185 0.6706 1.595  0.3039 1.5089 0.7888 1.038 ]
print(np.max(np.abs(x_hat - x_true)))            # 0.6961...  worst entry off by ~70%
print(np.linalg.norm(x_hat - x_true) / np.linalg.norm(x_true))  # 0.3249  ~32% rel error

There it is. We asked for a vector of ones, and numpy — using the best direct solver available, LAPACK's pivoted LU — returned a vector with entries ranging from $0.30$ to $1.60$. The seventh component, which should be $1$, came back as $0.67$; the eighth came back as $1.59$. The relative error is about 32%. This is not a bug in numpy. numpy did everything right. The catastrophe is inherent in the matrix: $\kappa(H_{12})\approx 1.76\times 10^{16}$ means the problem amplifies the unavoidable rounding error in storing $\mathbf{b}$ (a perturbation of relative size $\varepsilon_{\text{mach}}\approx 2.2\times 10^{-16}$) by a factor of $10^{16}$, and $10^{16}\times 2.2\times 10^{-16}\approx 2$ — an error of order 1, which is exactly what we see.

Warning

— A wrong answer can be returned with no error message and a tiny residual. The vector x_hat above satisfies $H\,\hat{\mathbf{x}}=\mathbf{b}$ to machine precision — the residual $\lVert H\hat{\mathbf{x}}-\mathbf{b}\rVert$ is about $6\times 10^{-16}$, essentially zero — even though $\hat{\mathbf{x}}$ is 32% wrong. The solver faithfully solved a nearby problem; it is just that for an ill-conditioned matrix, "nearby problem" can have a far-away answer. Always compute and report the condition number before trusting the solution of a linear system. If $\kappa(A)$ is anywhere near $1/\varepsilon_{\text{mach}}\approx 4.5\times 10^{15}$, your answer in double precision may have zero correct digits.

Push it one notch further to $n=15$ and the situation becomes absurd: $\kappa(H_{15})\approx 3.7\times 10^{17}$, and solving for the all-ones vector returns garbage with a relative error of about 4.9 — the "solution" is not just wrong in its small digits, it is off by a factor of several hundred percent, a vector pointing in nearly the wrong direction entirely.

The Key Insight — The Hilbert disaster teaches the chapter's central lesson in one experiment: the residual being small does not mean the answer is right. A small residual ($\lVert A\hat{\mathbf{x}}-\mathbf{b}\rVert$) tells you that $\hat{\mathbf{x}}$ solves a problem close to yours. The condition number tells you whether "a problem close to yours" has an answer close to yours. For an ill-conditioned matrix, those are different questions, and only the second one is the one you care about.

Common Pitfall — Confusing the residual with the error. The error is $\lVert\hat{\mathbf{x}}-\mathbf{x}_{\text{true}}\rVert$ — how far your computed answer is from the truth. The residual is $\lVert A\hat{\mathbf{x}}-\mathbf{b}\rVert$ — how far $A\hat{\mathbf{x}}$ is from $\mathbf{b}$. In the Hilbert experiment the residual is $\approx 6\times 10^{-16}$ while the error is $\approx 1.13$ (for $n=12$, measured in the 2-norm). They are related by — you guessed it — the condition number: $\text{(relative error)} \le \kappa(A)\cdot\text{(relative residual)}$. You can measure the residual (you know $A$ and $\mathbf{b}$); you usually cannot measure the error (you don't know $\mathbf{x}_{\text{true}}$). The condition number is the bridge between the one you can see and the one you actually care about.

Why does the Hilbert matrix do this?

The deep reason is that the columns of $H_n$ are nearly linearly dependent. The Hilbert matrix is the matrix of inner products $\langle t^{i-1}, t^{j-1}\rangle = \int_0^1 t^{i-1}t^{j-1}\,dt = \frac{1}{i+j-1}$ of the monomials $1, t, t^2, \ldots$ on $[0,1]$ — and high-degree monomials look very similar on the unit interval ($t^9$ and $t^{10}$ are nearly the same curve there). Vectors that are nearly parallel make a matrix whose smallest singular value is nearly zero, which is exactly a huge condition number. This is the same near-collinearity that will sabotage least-squares fitting in Section 38.5 — and it is no coincidence, because polynomial curve-fitting is a Hilbert-like problem.

38.4 What is numerical stability, and why is it different from conditioning?

Intuition first. Conditioning is about the problem. Stability is about the algorithm. The cleanest way to feel the difference: imagine a well-conditioned problem — one where the true answer barely moves when you nudge the input. A good algorithm will return that answer accurately. But a bad algorithm can take that same forgiving problem and, through a poor sequence of operations, introduce so much rounding error that the answer comes out wrong anyway. The problem didn't betray you; your method did. Numerical stability is the property of an algorithm that it does not add more error than the problem inherently demands.

The gold standard is backward stability, an idea due largely to James Wilkinson in the 1960s [verify]. An algorithm for solving $A\mathbf{x}=\mathbf{b}$ is backward stable if the answer $\hat{\mathbf{x}}$ it returns is the exact solution of a slightly perturbed problem: $(A+\Delta A)\hat{\mathbf{x}} = \mathbf{b}$ with $\lVert\Delta A\rVert/\lVert A\rVert$ on the order of $\varepsilon_{\text{mach}}$. In words: the algorithm gives the right answer to nearly the right question. You then combine the two pieces — the backward error (from the algorithm, $\sim\varepsilon_{\text{mach}}$) and the conditioning (from the problem, $\kappa(A)$) — into the master estimate of numerical linear algebra:

$$ \underbrace{\frac{\lVert\hat{\mathbf{x}}-\mathbf{x}\rVert}{\lVert\mathbf{x}\rVert}}_{\text{forward error (what you get)}} \;\lesssim\; \underbrace{\kappa(A)}_{\text{conditioning of the problem}} \times \underbrace{\varepsilon_{\text{mach}}}_{\text{backward error of the algorithm}}. $$

This is the formula to memorize. A backward-stable algorithm contributes only $\varepsilon_{\text{mach}}$; the rest of the error is the problem's fault, unavoidable. An unstable algorithm contributes much more than $\varepsilon_{\text{mach}}$, manufacturing error the problem never asked for. The job of a numerical analyst is to use backward-stable algorithms so that the only error you suffer is the irreducible $\kappa(A)\varepsilon_{\text{mach}}$.

It helps to think of your sixteen double-precision digits as a budget that gets spent in two places. The problem's conditioning spends $\log_{10}\kappa(A)$ of them no matter what algorithm you use — that is the toll the problem charges at the door. A stable algorithm spends nothing more; whatever digits the conditioning leaves, you keep. An unstable algorithm reaches into the same budget and spends extra, leaving you with fewer correct digits than the problem alone would have cost. So for a problem with $\kappa(A) = 10^6$: conditioning takes 6 of your 16 digits, leaving 10. A stable solver hands you those 10. An unstable solver — say, one that needlessly squares the condition number by forming $A^{\mathsf{T}}A$ — takes a further 6, leaving only 4. Same problem, same machine, six digits of accuracy thrown away purely by algorithm choice. This is why the rest of the chapter cares so intensely about which algorithm computes a given quantity: the conditioning is fixed, but the stability is yours to lose.

The Key Insight — Forward error ≈ conditioning × backward error. You cannot fix a bad condition number — that is the problem's nature — but you can always choose a backward-stable algorithm so the backward error stays at $\varepsilon_{\text{mach}}$. Good numerical software is the relentless pursuit of backward stability.

Why we pivot in LU (Chapter 10)

You met LU decomposition in Chapter 10 and saw that we always pivot — swap rows so the largest available entry sits on the diagonal before each elimination step. Now you can see why, in numerical terms. Consider the tiny $2\times 2$ system

$$ \begin{bmatrix} \varepsilon & 1 \\ 1 & 1 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \qquad \varepsilon = 10^{-20}. $$

This matrix is perfectly well-conditioned — $\kappa \approx 2.6$, not ill-conditioned at all — so a good algorithm should solve it accurately. The true answer is essentially $(1, 1)$. Now do naive Gaussian elimination without pivoting, using the tiny $\varepsilon$ as the pivot. The multiplier is $1/\varepsilon = 10^{20}$, an enormous number. Subtracting $10^{20}$ times row 1 from row 2 swamps the modest entries of row 2: $1 - 10^{20}\cdot 1$ rounds to $-10^{20}$, and the original "1" in that position is lost entirely to rounding. The computation forgets a number it needed, and the answer for $x_1$ comes out as $0$ instead of $1$ — completely wrong.

# A WELL-conditioned system that naive (no-pivot) elimination ruins.
import numpy as np
eps = 1e-20
A = np.array([[eps, 1.0], [1.0, 1.0]])
b = np.array([1.0, 2.0])

# naive elimination WITHOUT pivoting: multiplier m = 1/eps = 1e20
m  = 1.0 / eps
a22 = 1.0 - m*1.0          # 1 - 1e20  rounds to  -1e20  (the '1' is lost)
b2  = 2.0 - m*1.0
x2  = b2 / a22
x1  = (1.0 - 1.0*x2) / eps
print(np.array([x1, x2]))   # [0. 1.]   <-- x1 is WRONG (should be 1)

print(np.linalg.solve(A, b))  # [1. 1.]   numpy pivots, gets it right
print(np.linalg.cond(A))      # 2.618...  the PROBLEM was never ill-conditioned!

The system was well-conditioned; the algorithm (no-pivot LU) was unstable. Swapping the rows first — partial pivoting, putting the "1" on the diagonal — makes the multiplier $\varepsilon$ instead of $1/\varepsilon$, the tiny number now sits where it can do no harm, and the answer comes out right, as numpy's pivoted solver demonstrates. Pivoting is the cheap insurance that turns LU into a backward-stable algorithm. Without it, LU can fail on problems that have nothing wrong with them.

Common Pitfall — Catastrophic cancellation. The deepest source of algorithmic instability is subtracting two nearly-equal numbers. When $a \approx b$, the difference $a - b$ has very few significant digits — the leading digits cancel and only the noise in the trailing digits survives. Compute $(1-\cos x)/x^2$ at $x=10^{-8}$ in double precision and you get 0.0, though the true limit is $\tfrac12$: $\cos(10^{-8})$ rounds to exactly $1$, so $1-\cos x$ is computed as $0$. The fix is algebraic, not numerical — rewrite the formula to avoid the subtraction (here, use $1-\cos x = 2\sin^2(x/2)$). The classic case is the quadratic formula: for $x^2 + 10^8 x + 1 = 0$, computing the small root as $(-b+\sqrt{b^2-4ac})/2a$ cancels catastrophically and loses half the digits, while the mathematically equivalent $2c/(-b-\sqrt{b^2-4ac})$ recovers the small root exactly. Same math, different stability.

Math-Major Sidebar. The bound forward $\lesssim \kappa(A)\varepsilon_{\text{mach}}$ can be made rigorous. Wilkinson's analysis of Gaussian elimination with partial pivoting shows the computed solution satisfies $(A+\Delta A)\hat{\mathbf{x}}=\mathbf{b}$ with $\lVert\Delta A\rVert_\infty \le c_n\,\rho\,\varepsilon_{\text{mach}}\lVert A\rVert_\infty$, where $c_n$ is a low-degree polynomial in $n$ and $\rho$ is the growth factor — the ratio of the largest entry encountered during elimination to the largest entry of $A$. Partial pivoting keeps $\rho \le 2^{n-1}$ in the worst case (a bound that is essentially never approached in practice; typically $\rho$ is $O(\sqrt{n})$ or smaller). This is why pivoted LU is almost always backward stable, and why the rare matrices that trigger exponential growth (e.g., certain Wilkinson matrices) are a beloved exam question and a non-issue in real life. The honest statement is that partial pivoting is not provably backward stable in the worst case, but is overwhelmingly so in practice — one of numerical analysis's enduring "it works far better than we can prove" mysteries.

38.5 Why do QR and SVD beat the normal equations for least squares?

Intuition first. Back in Chapter 17 you solved linear regression — fitting a line or curve through noisy data — by the normal equations $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. Algebraically that is correct; it is the exact solution of the least-squares problem. But numerically it carries a hidden cost: forming the product $A^{\mathsf{T}}A$ squares the condition number. If $A$ is somewhat ill-conditioned, $A^{\mathsf{T}}A$ is catastrophically ill-conditioned, and the regression can blow up even though a stabler method handles the very same data without breaking a sweat. This is the most consequential stability lesson in applied linear algebra, because almost everyone fits a model at some point.

Here is the precise fact: for a tall matrix $A$ with full column rank, the condition number of the least-squares problem is governed by $\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A)$, but the normal equations solve a system whose matrix is $A^{\mathsf{T}}A$, and

$$ \kappa(A^{\mathsf{T}}A) = \kappa(A)^2. $$

The reason is immediate from the SVD: if $A = U\Sigma V^{\mathsf{T}}$, then $A^{\mathsf{T}}A = V\Sigma^2 V^{\mathsf{T}}$, so the singular values of $A^{\mathsf{T}}A$ are the squares $\sigma_i^2$, and the ratio $\sigma_{\max}^2/\sigma_{\min}^2 = (\sigma_{\max}/\sigma_{\min})^2$. Squaring the condition number doubles the number of digits you lose. If $\kappa(A) = 10^8$ — bad but survivable — then $\kappa(A^{\mathsf{T}}A) = 10^{16}$, right at the edge of double-precision oblivion. This is the conditioning trap that Chapter 17 flagged and promised to explain here.

The cure is to never form $A^{\mathsf{T}}A$. The QR decomposition (Chapter 20) factors $A = QR$ with $Q$ having orthonormal columns and $R$ upper-triangular, and solves the least-squares problem via $R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}$. Because $Q$ is orthogonal it has condition number $1$ — it neither stretches nor squashes — so the QR route works with the original $\kappa(A)$, never the squared version. The SVD (Chapter 30) goes further still and is the most robust of all, gracefully handling even rank-deficient $A$ via the pseudoinverse. Let us watch the difference on a deliberately ill-conditioned fit.

# Normal equations vs QR for an ill-conditioned least-squares fit.
import numpy as np
np.set_printoptions(precision=4)

m, d = 50, 11
x = np.linspace(0, 1, m)
A = np.vander(x, d+1, increasing=True)    # a 50x12 Vandermonde matrix (very ill-conditioned)
print(np.linalg.cond(A))                  # 1.1718e+08   kappa(A)
print(np.linalg.cond(A.T @ A))            # 1.3668e+16   kappa(A^T A) = kappa(A)^2 !

coef_true = np.ones(d+1)
rng = np.random.default_rng(0)
b = A @ coef_true + 1e-3 * rng.standard_normal(m)   # noisy data, known true coefficients

# Method 1: the normal equations  (A^T A) x = A^T b
coef_normal = np.linalg.solve(A.T @ A, A.T @ b)

# Method 2: QR  ->  R x = Q^T b
Q, R = np.linalg.qr(A)
coef_qr = np.linalg.solve(R, Q.T @ b)

# Ground truth: numpy's SVD-based least-squares solver
coef_svd, *_ = np.linalg.lstsq(A, b, rcond=None)

print(np.linalg.norm(coef_normal - coef_svd))   # 6.4308   normal equations: WAY off
print(np.linalg.norm(coef_qr     - coef_svd))   # 1.51e-08 QR: essentially exact

The numbers tell the whole story. The Vandermonde matrix has $\kappa(A)\approx 1.17\times 10^8$ — bad, but the QR and SVD methods handle it to eight-plus digits. But $\kappa(A^{\mathsf{T}}A)\approx 1.37\times 10^{16}$, right at the precipice, and the normal-equations coefficients come out off by a norm of about 6.4 — utterly wrong, a fitted polynomial that bears no resemblance to the truth. The QR coefficients differ from the SVD ground truth by only $1.5\times 10^{-8}$. Same data, same problem, one method gives garbage and the other gives the right answer. The difference is entirely whether the algorithm formed $A^{\mathsf{T}}A$ and squared the conditioning.

Real-World Application — (Data science.) This is why no serious statistical or machine-learning library solves least squares by literally inverting $A^{\mathsf{T}}A$, even though the textbook formula $\hat{\mathbf{x}} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ suggests it. scikit-learn's LinearRegression, numpy's lstsq, R's lm, and MATLAB's backslash all use QR or SVD under the hood precisely to dodge the squared condition number. When a regression with many correlated predictors returns wild, unstable coefficients that swing violently as you add a data point, the culprit is almost always near-collinearity inflating $\kappa(A)$ — and the fix is QR/SVD plus regularization (ridge regression nudges $\sigma_{\min}$ away from zero, which is to say it lowers the condition number on purpose). Case Study 1 walks through exactly such a blow-up.

Geometric Intuition — Forming $A^{\mathsf{T}}A$ is geometrically a projection onto the column space followed by re-expression in the (possibly skewed) basis of $A$'s columns. When two columns of $A$ point in nearly the same direction — the near-collinearity of the Hilbert and Vandermonde matrices — the Gram matrix $A^{\mathsf{T}}A$ has a near-zero eigenvalue, the squashed sliver again. QR sidesteps this by first orthonormalizing the columns (Gram–Schmidt, Chapter 20): it replaces the skewed near-parallel basis with a clean perpendicular one before doing any arithmetic, so no information hides in a near-degenerate direction.

38.6 How do iterative methods solve enormous systems?

Intuition first. Everything so far has been a direct method: pivoted LU, QR, SVD — algorithms that, in exact arithmetic, produce the answer in a finite, predetermined number of steps. Direct methods are wonderful for matrices up to a few thousand on a side. But the matrices of real science and engineering are often millions on a side — a finite-element model of a bridge, a discretized fluid flow, a web-scale graph — and they are almost always sparse: nearly all entries are zero. A direct method would need to store and factor a dense matrix with $10^{12}$ entries, which is impossible. Iterative methods take a different path: start with a guess and refine it, step by step, getting closer to the true solution each iteration, and stop when you are close enough. They never form the inverse and never even store a dense factor — they only need to multiply by $A$, which is cheap when $A$ is sparse.

The simplest iterative ideas split the matrix and bootstrap. Write $A = D + L + U$ (diagonal, strictly-lower, strictly-upper parts). The Jacobi method rearranges $A\mathbf{x}=\mathbf{b}$ into $\mathbf{x} = D^{-1}(\mathbf{b} - (L+U)\mathbf{x})$ and iterates: plug in your current guess on the right, get an improved guess on the left, repeat. The Gauss–Seidel method is the same idea but uses each freshly-updated component immediately within the same sweep instead of waiting for the next one — a small change that typically doubles the convergence rate.

# Jacobi vs Gauss-Seidel on a diagonally dominant system.
import numpy as np
A = np.array([[10.,-1., 2., 0.],
              [-1.,11.,-1., 3.],
              [ 2.,-1.,10.,-1.],
              [ 0., 3.,-1., 8.]])
b = np.array([6., 25., -11., 15.])
x_exact = np.linalg.solve(A, b)          # [ 1.  2. -1.  1.]

# Jacobi
Dinv = np.diag(1/np.diag(A)); R = A - np.diag(np.diag(A))
x = np.zeros(4)
for k in range(1, 26):
    x = Dinv @ (b - R @ x)
    if k in (5, 15, 25):
        print('Jacobi %2d: err=%.2e' % (k, np.linalg.norm(x - x_exact)))
# Jacobi  5: err=2.85e-02   Jacobi 15: err=5.39e-06   Jacobi 25: err=1.07e-09

# Gauss-Seidel (use updated components within the sweep)
x = np.zeros(4)
for k in range(1, 16):
    for i in range(4):
        s = A[i,:] @ x - A[i,i]*x[i]
        x[i] = (b[i] - s) / A[i,i]
    if k in (5, 15):
        print('Gauss-Seidel %2d: err=%.2e' % (k, np.linalg.norm(x - x_exact)))
# Gauss-Seidel  5: err=9.95e-05   Gauss-Seidel 15: err=3.41e-15

Both methods march steadily toward the true solution $(1, 2, -1, 1)$. Jacobi reaches an error of $\sim 10^{-9}$ in 25 sweeps; Gauss–Seidel reaches machine precision ($\sim 3\times 10^{-15}$) in just 15. Each sweep costs only a few matrix-vector products — and crucially, a matrix-vector product with a sparse matrix touches only the nonzero entries, so the cost scales with the number of nonzeros, not with $n^2$.

Warning

— Iterative splitting methods do not always converge. Jacobi and Gauss–Seidel converge for every starting guess if and only if the spectral radius (largest $|\lambda|$, Chapter 23) of the iteration matrix is less than $1$. A sufficient condition that is easy to check is strict diagonal dominance — each diagonal entry larger in magnitude than the sum of the other entries in its row. The matrix above is diagonally dominant (e.g., $|10| > |-1|+|2|+|0|$), and indeed its Jacobi iteration matrix has spectral radius $\approx 0.426 < 1$. Feed these methods a matrix that is not diagonally dominant and they may diverge to infinity — a reminder that an iterative method's convergence is itself a stability question.

Why eigenvalues decide the convergence rate

The connection to eigenvalues is not incidental — it is the whole story, and it ties this chapter straight back to the heart of the book. Every splitting method can be written in the form $\mathbf{x}^{(k+1)} = T\mathbf{x}^{(k)} + \mathbf{c}$ for some iteration matrix $T$ (for Jacobi, $T = -D^{-1}(L+U)$; for Gauss–Seidel, $T = -(D+L)^{-1}U$). Subtract the fixed point $\mathbf{x} = T\mathbf{x}+\mathbf{c}$ from both sides and the error $\mathbf{e}^{(k)} = \mathbf{x}^{(k)} - \mathbf{x}$ obeys the breathtakingly simple recurrence

$$ \mathbf{e}^{(k+1)} = T\,\mathbf{e}^{(k)}, \qquad\text{so}\qquad \mathbf{e}^{(k)} = T^k\,\mathbf{e}^{(0)}. $$

This is exactly the "powers of a matrix" situation you analyzed via diagonalization in Chapter 25 and the power method in Chapters 23 and 29. Whether $T^k\mathbf{e}^{(0)}\to\mathbf{0}$ — that is, whether the error dies — is governed entirely by the eigenvalues of $T$: if every $|\lambda_i| < 1$, the powers shrink to zero and the method converges; if any $|\lambda_i| > 1$, that eigendirection of the error grows and the iteration diverges. The spectral radius $\rho(T) = \max_i |\lambda_i|$ is the asymptotic error-reduction factor per step: each iteration multiplies the error by roughly $\rho(T)$. The smaller $\rho(T)$, the faster you converge.

This is precisely why Gauss–Seidel beats Jacobi on our system. Compute the two spectral radii and you find $\rho(T_{\text{Jacobi}})\approx 0.426$ but $\rho(T_{\text{Gauss–Seidel}})\approx 0.090$ — over five times smaller. To shrink the error by a factor of $10^6$ you need about $\log(10^{-6})/\log(0.426)\approx 16$ Jacobi sweeps but only $\log(10^{-6})/\log(0.090)\approx 6$ Gauss–Seidel sweeps, which matches the experiment above almost exactly. The eigenvalues that revealed the destiny of a dynamical system in Chapter 37 now reveal the speed of an iterative solver. The same idea drives modern accelerated methods (successive over-relaxation, multigrid, preconditioning): every one of them is, at heart, a scheme to make the iteration matrix's spectral radius as small as possible.

Real-World Application — (Graphics / video games.) Iterative solvers are the beating heart of real-time physics. A cloth, rope, or soft-body simulation in a game models the fabric as a mesh of point masses joined by stiff springs; enforcing all the spring constraints each frame is a large sparse linear system, and game engines solve it with a handful of Gauss–Seidel sweeps per frame — never a direct solve, because there is no time. (The popular "Position Based Dynamics" method is essentially Gauss–Seidel relaxation of the constraints.) The same is true of the pressure-projection step in real-time fluid simulation, which solves a sparse Poisson system by iterative methods every frame. Designers consciously trade iterations for frame rate: fewer sweeps means a springier, less rigid cloth but a faster game. The conditioning of the constraint matrix — driven by how stiff the springs are — is exactly what decides how many sweeps you need, linking this chapter's theory to the feel of the simulation in your hands. These techniques connect to the broader toolkit of numerical methods.

The power method and Krylov subspaces

You already know one iterative method intimately: the power method from Chapters 23 and 29, which finds the dominant eigenvector by repeatedly multiplying a starting vector by $A$ and renormalizing — the engine behind PageRank, where the matrix is the colossal, sparse web link graph. The power method is the spiritual ancestor of the most important class of large-scale solvers, the Krylov subspace methods.

The idea behind Krylov methods is gorgeous. Starting from $\mathbf{b}$, the only thing we can cheaply compute is repeated matrix-vector products: $\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, A^3\mathbf{b}, \ldots$ Their span, $\mathcal{K}_k = \operatorname{span}\{\mathbf{b}, A\mathbf{b}, \ldots, A^{k-1}\mathbf{b}\}$, is the Krylov subspace, and it turns out to be a remarkably good place to look for the solution of $A\mathbf{x}=\mathbf{b}$. The flagship method, the conjugate gradient (CG) algorithm — for symmetric positive-definite matrices (Chapter 28) — builds the best approximation in growing Krylov subspaces and converges astonishingly fast: in exact arithmetic it reaches the exact answer in at most $n$ steps, but in practice it gets close in far fewer, with the number of iterations governed by $\sqrt{\kappa(A)}$ rather than $\kappa(A)$ (a square-root improvement that, with good preconditioning, makes million-variable systems routine).

# Conjugate gradient on a large SPARSE symmetric positive-definite system.
import numpy as np, scipy.sparse as sp, scipy.sparse.linalg as spla
N = 500
main, off = 2*np.ones(N), -1*np.ones(N-1)
A = sp.diags([off, main, off], [-1, 0, 1]).tocsr()   # 1D Laplacian: tridiagonal, SPD
b = np.ones(N)
res = []
spla.cg(A, b, rtol=1e-10, callback=lambda xk: res.append(np.linalg.norm(b - A@xk)))
print(A.nnz, N*N)        # 1498 stored nonzeros vs 250000 in the dense version
print(len(res))          # 250 iterations to converge to rtol=1e-10
print(np.linalg.cond(A.toarray()))   # ~1.0e+05  (kappa governs the iteration count)

This $500\times 500$ Laplacian stores only $1498$ nonzeros — versus the $250{,}000$ entries of its dense form, a factor of 167 saved — and CG drives the residual to $10^{-10}$ in 250 iterations, each one nothing but a sparse matrix-vector product. Scale this matrix to $N = 10^7$ (a realistic 3D simulation) and a direct method is hopeless, while CG with a good preconditioner solves it in minutes. This is how the matrices of computational science actually get solved.

Real-World Application — (Engineering / physics.) Krylov methods are the backbone of large-scale simulation. Every finite-element structural analysis (will this bridge resonate?), every computational fluid dynamics solver (how does air flow over this wing?), every reservoir-flow model in petroleum engineering, and every implicit time-stepping scheme in climate modeling ultimately solves enormous sparse linear systems by CG, GMRES, or a relative — never by Gaussian elimination, which would exhaust the world's memory. Case Study 2 follows a structural simulation whose accuracy quietly decays as its stiffness matrix becomes ill-conditioned. The broader landscape of these techniques is surveyed in numerical methods.

38.7 What do LAPACK and numpy actually do under the hood?

Intuition first. When you call np.linalg.solve(A, b), you are not running Python — you are calling, through a thin wrapper, into LAPACK, the Linear Algebra PACKage, a body of meticulously engineered Fortran routines that has been the bedrock of numerical computing since 1992. Forty years of hard-won numerical wisdom — every pivoting strategy, every stability fix, every cache-aware block algorithm — is compiled into that library. Understanding what it does demystifies why your code is fast and why it is trustworthy.

When you call numpy's linear-algebra functions, here is roughly what dispatches underneath:

np.linalg.solve(A, b) → LAPACK's *gesv: pivoted LU decomposition (Chapter 10) followed by triangular solves. Backward stable thanks to partial pivoting.
np.linalg.lstsq(A, b) → LAPACK's *gelsd: an SVD-based least-squares solver (Chapter 30) that handles rank deficiency gracefully — not the normal equations.
np.linalg.qr(A) → LAPACK's *geqrf: QR via Householder reflections (Chapter 20), the backward-stable way to orthonormalize.
np.linalg.svd(A) → LAPACK's *gesdd: a divide-and-conquer SVD, among the most sophisticated algorithms in the library.
np.linalg.eig / eigh → LAPACK's *geev / *syev: the QR algorithm (Chapter 29) for eigenvalues, with eigh exploiting symmetry for speed and guaranteed real eigenvalues.

Two practical consequences follow. First, the heavy arithmetic happens in optimized, multi-threaded compiled code (often via a tuned BLAS like OpenBLAS or Intel MKL), which is why numpy on a million-element matrix multiply can be thousands of times faster than a Python loop. Second — and this is the deeper point of the whole chapter — the algorithm choice is already made for you, and it is the stable one. numpy does not invert $A^{\mathsf{T}}A$ for least squares; it does not skip pivoting in LU; it computes condition numbers via the SVD. The decades of numerical-stability research distilled in this chapter are baked into the libraries you call every day.

Never compute the inverse to solve a system

A corollary that trips up nearly every beginner: to solve $A\mathbf{x}=\mathbf{b}$, do not compute $A^{-1}$ and multiply. The mathematics says $\mathbf{x}=A^{-1}\mathbf{b}$, but the computation np.linalg.inv(A) @ b is both slower (forming the full inverse costs about three times the work of a single solve, and you usually need only one solve) and less accurate than np.linalg.solve(A, b). The inverse route does extra arithmetic — it effectively solves $n$ systems to build $A^{-1}$, then does a matrix-vector multiply — and each extra operation is another chance to accumulate rounding error.

# inv-and-multiply vs solve: same math, solve is faster AND more accurate.
import numpy as np
from scipy.linalg import hilbert
H = hilbert(10); x_true = np.ones(10); b = H @ x_true
x_solve = np.linalg.solve(H, b)
x_inv   = np.linalg.inv(H) @ b
print(np.linalg.norm(x_solve - x_true) / np.linalg.norm(x_true))  # 8.67e-05
print(np.linalg.norm(x_inv   - x_true) / np.linalg.norm(x_true))  # 2.85e-03  ~33x worse

On this Hilbert system the inverse-and-multiply answer is about 33 times less accurate than solve, for more work. The rule is absolute: if you find yourself computing a matrix inverse, stop and ask whether you really need the inverse itself, or just the solution of a system. Almost always it is the latter, and solve is the right tool. The explicit inverse is needed only in the rare cases where the matrix $A^{-1}$ is itself the object of interest (e.g., a covariance-to-precision-matrix conversion in statistics), and even then, factor once and reuse.

Computational Note — This is exactly why your from-scratch toolkit (the progressive project of this book) sometimes disagrees with numpy in the last few digits — and occasionally by a lot. Your hand-rolled gaussian_elimination from Chapter 4, if it skips pivoting, can be unstable on a matrix where LAPACK's pivoted *gesv is fine. Your svd_from_scratch from Chapter 30, built via the eigendecomposition of $A^{\mathsf{T}}A$, squares the condition number and so loses half the digits that LAPACK's direct bidiagonalization keeps. The lesson is not that your toolkit is wrong — it correctly implements the math — but that production numerical software is the math plus thirty years of stability engineering. That gap is the entire subject of this chapter.

Build Your Toolkit — Condition number and stability lab. Add three functions to toolkit/numerics.py. (1) condition_number(A) — compute $\kappa(A)=\sigma_{\max}/\sigma_{\min}$ from your own svd_from_scratch (Chapter 30); verify it matches np.linalg.cond(A). (2) hilbert_demo(n) — build the $n\times n$ Hilbert matrix, set $\mathbf{x}_{\text{true}}=\mathbf{1}$, form $\mathbf{b}=H\mathbf{x}_{\text{true}}$, solve, and report $\kappa(H_n)$, the relative error $\lVert\hat{\mathbf{x}}-\mathbf{x}_{\text{true}}\rVert/\lVert\mathbf{x}_{\text{true}}\rVert$, and the relative residual — watch the error grow with $n$ while the residual stays at $\varepsilon_{\text{mach}}$. (3) compare_lstsq(A, b) — solve the same least-squares problem two ways, by the normal equations $A^{\mathsf{T}}A\hat{\mathbf{x}}=A^{\mathsf{T}}\mathbf{b}$ and by your QR from Chapter 20, and report how far each lands from np.linalg.lstsq. On a Vandermonde or Hilbert-style $A$, the normal-equations answer will be visibly worse. Implement the linear-algebra core in pure Python; use numpy only to check. This is the chapter where your toolkit learns to tell you whether to trust its own answers.

38.8 Putting it together: a checklist for trustworthy numerical linear algebra

Intuition first. We have assembled a complete picture of what can go wrong and how to avoid it. The discipline of numerical linear algebra reduces, in daily practice, to a short list of habits — a checklist you can run before trusting any matrix computation. Internalize it and you will be in the top tier of practitioners who know when their answers are real.

Before you trust the solution of a numerical linear-algebra problem, ask:

What is the condition number? Compute np.linalg.cond(A). Compare it to $1/\varepsilon_{\text{mach}}\approx 4.5\times 10^{15}$. Expect to lose about $\log_{10}\kappa(A)$ of your $\sim 16$ digits. If $\kappa(A) > 10^{12}$, treat every answer with deep suspicion.
Am I using a stable algorithm? Prefer QR or SVD over the normal equations for least squares; trust the library's pivoted LU for square systems; never invert a matrix explicitly when you can solve a system instead (computing $A^{-1}$ and multiplying is both slower and less stable than solve).
Did I avoid catastrophic cancellation? Watch for subtractions of nearly-equal quantities, and reformulate algebraically when you find them.
Did I check the residual and respect the conditioning? A small residual is necessary but not sufficient; multiply the relative residual by $\kappa(A)$ to bound the true error.
For large sparse systems, am I exploiting sparsity? Use iterative Krylov methods (CG for SPD, GMRES otherwise) and a preconditioner; never densify a sparse matrix.
Never compare floats with ==. Use a tolerance.

Check Your Understanding — A colleague solves a $1000\times 1000$ system $A\mathbf{x}=\mathbf{b}$, reports a residual $\lVert A\hat{\mathbf{x}}-\mathbf{b}\rVert/\lVert\mathbf{b}\rVert \approx 10^{-15}$, and concludes the answer is accurate to ~15 digits. You check and find $\kappa(A)\approx 10^{10}$. How many digits of the answer can you actually trust, and why is the colleague's reasoning flawed?
Answer
About 5 digits, not 15. The relative error is bounded by $\kappa(A)$ times the relative residual: $10^{10}\times 10^{-15} = 10^{-5}$, so roughly 5 correct digits. The colleague confused the residual (how well $\hat{\mathbf{x}}$ satisfies the equation — small, because the algorithm was backward stable) with the error (how close $\hat{\mathbf{x}}$ is to the true $\mathbf{x}$ — much larger, because the problem is ill-conditioned). A small residual only certifies that you solved a nearby problem; the condition number decides whether the nearby problem has a nearby answer. The right report is "accurate to about 5 digits," and the honest move is to flag $\kappa(A)\approx 10^{10}$ explicitly.

Historical Note — The phrase "numerical linear algebra" as a named discipline owes much to the post-WWII explosion of digital computing and to figures like Alan Turing (who analyzed rounding error in matrix inversion in 1948), John von Neumann and Herman Goldstine (whose 1947 paper introduced the condition number for matrix inversion), and James Wilkinson (whose 1960s backward-error analysis gave the field its rigorous foundation) [verify]. The modern bibles — Golub & Van Loan's Matrix Computations, Trefethen & Bau's Numerical Linear Algebra, and Higham's Accuracy and Stability of Numerical Algorithms — are the inheritors of this lineage; see the Further Reading for this chapter.

The big picture

Step back and see what this chapter has done to the rest of the book. Every earlier chapter handed you an exact tool: solve a system, factor a matrix, find eigenvalues. This chapter has told you the price each tool pays on a real computer, and the insurance you buy with the right algorithm. The condition number $\kappa(A)=\sigma_{\max}/\sigma_{\min}$ — born from the SVD of Chapter 30 — is the gauge on the dashboard. Backward stability is the quality of the engine. Pivoting (Chapter 10), QR over normal equations (Chapters 17, 20), and SVD-based least squares (Chapter 30) are the stable engines. And iterative Krylov methods, descended from the power iteration of Chapters 23 and 29, are how we drive the systems too large for any direct method.

This is the chapter that closes Part VII's advanced arc by grounding the whole book back in reality: the proofs guarantee the right answer in exact arithmetic, but it takes numerical insight — conditioning, stability, the honest reckoning with finite precision — to get a trustworthy answer from a finite machine. It is also the chapter that retroactively justifies a hundred small choices made earlier in the book — why Chapter 10 insisted on pivoting, why Chapter 17 warned that the normal equations carry a hidden cost, why Chapter 30's SVD is called the most numerically robust factorization of all. Each of those was a deposit; this chapter is where the account is reconciled. The two themes of this part converge here. Computation validates theory and theory guides computation: the theory tells you the answer exists and is unique; the numerics tell you whether you can actually compute it; and the condition number is the precise quantity where those two worlds meet. You now know not just how to do linear algebra, but how to do it right — which is the difference between a formula and a result you can stake an engineering decision on.

In Part VIII you will put every piece of this together: the capstone (Chapter 39) drives your from-scratch toolkit on a real application, where the conditioning and stability ideas of this chapter become the difference between a demo that works and one that quietly produces nonsense — and Chapter 40 looks ahead to where linear algebra goes next.