> Learning paths. Math majors — read everything, especially the proof that Gram-Schmidt produces an orthonormal basis in §20.4, the existence-and-uniqueness statement of QR in §20.8, and the Math-Major Sidebar on the QR algorithm in §20.10. CS /...
Prerequisites
- chapter-19-orthogonal-projection
Learning Objectives
- Explain the Gram-Schmidt process geometrically as repeated orthogonal projection: subtract from each new vector its projection onto everything already built.
- Run Gram-Schmidt by hand on a small basis to produce an orthogonal, then orthonormal, set, and verify the result is orthonormal.
- State the QR factorization A = QR with its conditions (Q has orthonormal columns, R is upper-triangular with positive diagonal) and read R off as the matrix of dot products.
- Derive least squares via QR and explain why it avoids the ill-conditioned normal-equations matrix A-transpose A from Chapter 17.
- Describe how QR seeds an eigenvalue algorithm (the QR algorithm of Chapters 24 and 29).
- Distinguish classical from modified Gram-Schmidt and explain why the modified version is numerically stable, computing both against np.linalg.qr.
In This Chapter
- 20.1 What problem are we trying to solve?
- 20.2 What does Gram-Schmidt do, geometrically?
- 20.3 How do I run Gram-Schmidt by hand?
- 20.4 Why does the Gram-Schmidt process produce an orthonormal basis?
- 20.5 What is the QR decomposition, and where does it come from?
- 20.6 How do I read R off the Gram-Schmidt computation?
- 20.7 What does QR look like for a tall matrix?
- 20.8 Does QR agree with numpy, and what about the signs?
- 20.9 Why does QR give a more stable least squares than the normal equations?
- 20.10 What else is QR good for besides least squares?
- 20.11 What can go wrong numerically, and how does modified Gram-Schmidt fix it?
- 20.12 Bringing it together
Gram-Schmidt and QR Decomposition: Making Orthonormal Bases
Learning paths. Math majors — read everything, especially the proof that Gram-Schmidt produces an orthonormal basis in §20.4, the existence-and-uniqueness statement of QR in §20.8, and the Math-Major Sidebar on the QR algorithm in §20.10. CS / Data Science — focus on the Geometric Intuition callouts, the hand computation in §20.3, the numpy verifications, the stable-least-squares story in §20.9, and the numerical-stability Warning in §20.11. Physics / Engineering — focus on the geometry of stripping away shadows, the QR-as-rotation-into-axes picture, and the least-squares and signals applications. This chapter assumes the orthogonal projection of Chapter 19 (especially
project_onto) and the dot product and norm of Chapter 18.
20.1 What problem are we trying to solve?
In Chapter 18 you learned to measure length and angle in any dimension, and in Chapter 19 you learned the single most useful move that geometry offers: drop a perpendicular. Given a vector $\mathbf{b}$ and a subspace, the orthogonal projection of $\mathbf{b}$ onto that subspace is the closest point in it, and the leftover — the part of $\mathbf{b}$ that the subspace cannot explain — is exactly perpendicular to the subspace. That one idea solved least squares. This chapter takes the very same move and applies it repeatedly, and out of the repetition falls one of the most quietly important tools in all of scientific computing.
Here is the problem. You have a basis — some set of linearly independent vectors $\mathbf{a}_1, \mathbf{a}_2, \dots, \mathbf{a}_n$ that span a subspace. A basis is good: it lets you write every vector in the subspace as a unique combination of the $\mathbf{a}_i$. But a generic basis is awkward. The vectors point in arbitrary directions, at arbitrary angles to one another, with arbitrary lengths. To find the coordinates of a vector in such a basis you have to solve a linear system (Chapter 4), and if the basis vectors are nearly parallel, that system is nearly singular and the coordinates are unstable. The arbitrariness is the enemy.
What we want is an orthonormal basis: a basis whose vectors are mutually perpendicular and each of unit length. Orthonormal bases are the coordinate systems mathematics dreams about. Finding a coordinate in an orthonormal basis is just a dot product — no system to solve (you proved a version of this in Chapter 19, and we will see it again). Lengths and angles compute exactly as they do for the standard $\mathbf{e}_1, \mathbf{e}_2, \dots$ axes. And as you will see in Chapter 21, the matrix whose columns are an orthonormal basis is the easiest matrix in the world to invert, because its transpose is its inverse. The question is purely practical: given any basis, how do you manufacture an orthonormal one that spans the same space?
It helps to remember why perpendicular directions are so prized, the theme of all of Part IV. Orthogonal directions do not contaminate one another: the amount of a vector that lies along one of them tells you nothing about, and does not interfere with, the amount along any other. With a skewed basis, changing one coordinate of a vector forces all the others to readjust — the coordinates are entangled, because the basis directions overlap. With an orthonormal basis the coordinates are independent readings, each extracted by its own dot product, blind to the rest. That independence is the entire reason orthonormal coordinate systems make least squares, Fourier analysis, and principal component analysis so clean — and it is what Gram-Schmidt manufactures on demand.
The Key Insight — The answer, the Gram-Schmidt process, is not a new idea at all. It is the orthogonal projection of Chapter 19, applied over and over. Take your vectors one at a time; for each new vector, subtract its projection onto the space already spanned by the vectors you have orthogonalized so far. What remains is perpendicular to all of them. Normalize, and move on. Repeated projection — that is the whole secret.
Let's see the geometry before a single formula, because the picture makes the algebra inevitable.
20.2 What does Gram-Schmidt do, geometrically?
Stand in $\mathbb{R}^3$ and imagine two vectors, $\mathbf{a}_1$ and $\mathbf{a}_2$, that span a plane through the origin. They are independent but not perpendicular — say they meet at a sixty-degree angle. We want two perpendicular directions that span the same plane.
Keep the first vector as your anchor. Call its direction $\mathbf{q}_1$ (we will normalize at the end; for now think of building a direction). The first axis is just $\mathbf{a}_1$ itself, no work required — one vector cannot be "out of alignment" with nothing.
Now the second vector $\mathbf{a}_2$. It points partly along $\mathbf{a}_1$ and partly across it. The part along $\mathbf{a}_1$ is exactly the orthogonal projection of $\mathbf{a}_2$ onto the line through $\mathbf{a}_1$ — the shadow $\mathbf{a}_2$ casts on the first axis. That shadow is the redundant part: it is already spanned by $\mathbf{q}_1$, so it contributes nothing new to a perpendicular system. Strip it off. What remains, $$\mathbf{u}_2 = \mathbf{a}_2 - \operatorname{proj}_{\mathbf{q}_1}(\mathbf{a}_2),$$ is the part of $\mathbf{a}_2$ that points across $\mathbf{a}_1$ — and by the defining property of projection from Chapter 19, the leftover is perpendicular to what you projected onto. So $\mathbf{u}_2 \perp \mathbf{q}_1$. You now have two perpendicular directions, $\mathbf{q}_1$ and $\mathbf{u}_2$, spanning the same plane as $\mathbf{a}_1$ and $\mathbf{a}_2$. That is one full step of Gram-Schmidt.
Geometric Intuition — Every Gram-Schmidt step is a single act of subtraction: take the new vector, remove its shadow on everything you have already built, and keep what is left. The leftover sticks out perpendicular to the established directions, because "removing the projection" is precisely the operation that produces a perpendicular remainder (Chapter 19). The new vector's shadow was the part you already had; its anti-shadow is the genuinely new direction. Gram-Schmidt is shadow-removal, repeated.
The pattern continues into higher dimensions without changing in spirit. For a third vector $\mathbf{a}_3$, you subtract its shadow on the entire plane already spanned — which, because $\mathbf{q}_1$ and $\mathbf{u}_2$ are perpendicular, is just the sum of its shadows on each axis separately (this clean additivity is why we orthogonalize as we go). The remainder $\mathbf{u}_3$ sticks out perpendicular to the plane, into the third dimension. Each new vector contributes exactly one new perpendicular direction, after its overlap with all the previous directions is stripped away.
Geometric Intuition — Why subtract the shadow on each previous axis separately and add the results? Because once your earlier directions are mutually orthogonal, projecting onto the subspace they span decomposes into independent one-dimensional projections — there is no cross-contamination between perpendicular axes (this is the "non-interfering directions" theme of Part IV). Projecting onto a plane spanned by two perpendicular vectors equals projecting onto each line and adding. That additivity is exactly the gift orthogonality keeps giving, and it is why Gram-Schmidt orthogonalizes incrementally: each step makes the next step's projection a simple sum.
There is one last cosmetic step. The directions $\mathbf{q}_1, \mathbf{u}_2, \mathbf{u}_3$ are mutually perpendicular but not unit length. Normalizing — dividing each by its own length — turns an orthogonal set into an orthonormal one. We typically fold normalization into each step (normalize $\mathbf{u}_1$ to get $\mathbf{q}_1$ before projecting onto it), which makes every projection formula simpler, since projecting onto a unit vector $\mathbf{q}$ is just $(\mathbf{q}\cdot\mathbf{a})\,\mathbf{q}$ with no division. We will write the process both ways and reconcile them.
There is a way to see this whole process as a transformation of space, which connects it to the book's running visual anchor. Using the 2D transformation visualizer from Chapter 1, think of the original basis $\mathbf{a}_1, \mathbf{a}_2$ as the images of the standard basis vectors under some matrix $A$ — they form a slanted, possibly stretched parallelogram instead of the unit square. Gram-Schmidt asks: what rigid frame (perpendicular unit axes) spans the same parallelogram's edges in sequence? The answer $Q$ is the "un-slanted" version, a clean set of right-angled unit axes, and the matrix $R$ records exactly how much slant and stretch you removed. In Chapter 21 we will see that $Q$ is an orthogonal matrix — a pure rotation or reflection, the most rigid transformation there is — so the factorization $A = QR$ literally separates $A$ into a rigid rotation $Q$ followed by a shear-and-scale $R$. The visualizer's picture of "what does this matrix do to the unit square?" becomes "rotate the square into place ($Q$), then skew and stretch it ($R$)."
Why insist on building the directions in order, one at a time, rather than perpendicularizing them all at once? Because order is what makes the bookkeeping triangular, and triangular is what makes everything downstream cheap. Each vector only ever has to reckon with the directions that came before it; it never looks ahead. That "no looking ahead" discipline is precisely why the rebuild matrix $R$ will have zeros below its diagonal, and why solving with $R$ is a one-pass back substitution rather than a full elimination. The sequential structure is not an arbitrary choice of presentation — it is the source of the computational payoff.
20.3 How do I run Gram-Schmidt by hand?
Let's make all of this concrete with a worked example you can follow with a pencil. We will orthonormalize three vectors in $\mathbb{R}^3$, chosen so the arithmetic stays clean and every number is checkable. Take $$\mathbf{a}_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \qquad \mathbf{a}_2 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}, \qquad \mathbf{a}_3 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}.$$ These three are linearly independent (you can check the determinant of the matrix with these columns is $-2 \neq 0$, using Chapter 11), so they form a basis for $\mathbb{R}^3$. We will turn them into an orthonormal basis $\mathbf{q}_1, \mathbf{q}_2, \mathbf{q}_3$.
Step 1 — the first direction. No projection is needed; the first vector simply gets normalized. Its length is $\lVert \mathbf{a}_1 \rVert = \sqrt{1^2 + 1^2 + 0^2} = \sqrt{2}$, so $$\mathbf{q}_1 = \frac{\mathbf{a}_1}{\lVert \mathbf{a}_1 \rVert} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ 0 \end{bmatrix}.$$ Check: $\mathbf{q}_1 \cdot \mathbf{q}_1 = \tfrac12 + \tfrac12 + 0 = 1$. Unit length, as required.
Step 2 — strip $\mathbf{a}_2$'s shadow on $\mathbf{q}_1$. First compute how much of $\mathbf{a}_2$ lies along $\mathbf{q}_1$. Because $\mathbf{q}_1$ is already a unit vector, the projection of $\mathbf{a}_2$ onto it is $(\mathbf{q}_1 \cdot \mathbf{a}_2)\,\mathbf{q}_1$. The dot product is $$\mathbf{q}_1 \cdot \mathbf{a}_2 = \tfrac{1}{\sqrt 2}\cdot 1 + \tfrac{1}{\sqrt 2}\cdot 0 + 0 \cdot 1 = \tfrac{1}{\sqrt 2}.$$ So subtract that shadow from $\mathbf{a}_2$: $$\mathbf{u}_2 = \mathbf{a}_2 - (\mathbf{q}_1 \cdot \mathbf{a}_2)\,\mathbf{q}_1 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} - \tfrac{1}{\sqrt 2}\cdot\tfrac{1}{\sqrt 2}\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} - \tfrac12\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix}.$$ Confirm perpendicularity before normalizing: $\mathbf{q}_1 \cdot \mathbf{u}_2 = \tfrac{1}{\sqrt 2}\cdot\tfrac12 + \tfrac{1}{\sqrt 2}\cdot(-\tfrac12) + 0 = 0$. The leftover is perpendicular to the first axis, exactly as the geometry promised. Now normalize. Its length is $\lVert \mathbf{u}_2 \rVert = \sqrt{\tfrac14 + \tfrac14 + 1} = \sqrt{\tfrac32} = \tfrac{\sqrt 6}{2}$, so $$\mathbf{q}_2 = \frac{\mathbf{u}_2}{\lVert \mathbf{u}_2 \rVert} = \frac{2}{\sqrt 6}\begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} = \begin{bmatrix} 1/\sqrt 6 \\ -1/\sqrt 6 \\ 2/\sqrt 6 \end{bmatrix}.$$
Step 3 — strip $\mathbf{a}_3$'s shadow on both $\mathbf{q}_1$ and $\mathbf{q}_2$. Now $\mathbf{a}_3$ casts a shadow on each of the two established axes, and we subtract both. The two dot products are $$\mathbf{q}_1 \cdot \mathbf{a}_3 = \tfrac{1}{\sqrt 2}\cdot 0 + \tfrac{1}{\sqrt 2}\cdot 1 + 0\cdot 1 = \tfrac{1}{\sqrt 2}, \qquad \mathbf{q}_2 \cdot \mathbf{a}_3 = \tfrac{1}{\sqrt 6}\cdot 0 - \tfrac{1}{\sqrt 6}\cdot 1 + \tfrac{2}{\sqrt 6}\cdot 1 = \tfrac{1}{\sqrt 6}.$$ Subtract both shadows: $$\mathbf{u}_3 = \mathbf{a}_3 - (\mathbf{q}_1 \cdot \mathbf{a}_3)\,\mathbf{q}_1 - (\mathbf{q}_2 \cdot \mathbf{a}_3)\,\mathbf{q}_2.$$ The first shadow is $\tfrac{1}{\sqrt 2}\cdot\tfrac{1}{\sqrt 2}(1,1,0) = \tfrac12(1,1,0) = (\tfrac12, \tfrac12, 0)$. The second shadow is $\tfrac{1}{\sqrt 6}\cdot\tfrac{1}{\sqrt 6}(1,-1,2) = \tfrac16(1,-1,2) = (\tfrac16, -\tfrac16, \tfrac13)$. Therefore $$\mathbf{u}_3 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} - \begin{bmatrix} 1/2 \\ 1/2 \\ 0 \end{bmatrix} - \begin{bmatrix} 1/6 \\ -1/6 \\ 1/3 \end{bmatrix} = \begin{bmatrix} 0 - \tfrac12 - \tfrac16 \\ 1 - \tfrac12 + \tfrac16 \\ 1 - 0 - \tfrac13 \end{bmatrix} = \begin{bmatrix} -2/3 \\ 2/3 \\ 2/3 \end{bmatrix}.$$ Check both perpendicularities: $\mathbf{q}_1 \cdot \mathbf{u}_3 = \tfrac{1}{\sqrt 2}(-\tfrac23) + \tfrac{1}{\sqrt 2}(\tfrac23) + 0 = 0$, and $\mathbf{q}_2 \cdot \mathbf{u}_3 = \tfrac{1}{\sqrt 6}(-\tfrac23) - \tfrac{1}{\sqrt 6}(\tfrac23) + \tfrac{2}{\sqrt 6}(\tfrac23) = \tfrac{1}{\sqrt 6}(-\tfrac23 - \tfrac23 + \tfrac43) = 0$. Both zero. Finally normalize: $\lVert \mathbf{u}_3 \rVert = \sqrt{\tfrac49 + \tfrac49 + \tfrac49} = \sqrt{\tfrac{12}{9}} = \tfrac{2}{\sqrt 3}$, so $$\mathbf{q}_3 = \frac{\mathbf{u}_3}{\lVert \mathbf{u}_3 \rVert} = \frac{\sqrt 3}{2}\begin{bmatrix} -2/3 \\ 2/3 \\ 2/3 \end{bmatrix} = \begin{bmatrix} -1/\sqrt 3 \\ 1/\sqrt 3 \\ 1/\sqrt 3 \end{bmatrix}.$$
We have produced an orthonormal basis $$\mathbf{q}_1 = \tfrac{1}{\sqrt 2}\!\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \quad \mathbf{q}_2 = \tfrac{1}{\sqrt 6}\!\begin{bmatrix} 1 \\ -1 \\ 2 \end{bmatrix}, \quad \mathbf{q}_3 = \tfrac{1}{\sqrt 3}\!\begin{bmatrix} -1 \\ 1 \\ 1 \end{bmatrix}.$$ Three mutually perpendicular unit vectors, spanning the same $\mathbb{R}^3$ as the originals. That is Gram-Schmidt in full.
Common Pitfall — Projecting onto the original vectors instead of the orthogonalized ones. When you compute $\mathbf{u}_3$, you subtract its projection onto $\mathbf{q}_1$ and $\mathbf{q}_2$ — the already-orthogonalized directions — not onto the raw $\mathbf{a}_1$ and $\mathbf{a}_2$. If you project onto the non-orthogonal originals, the "subtract each shadow separately and add" shortcut breaks, because the shortcut relies on the previous directions being mutually perpendicular. Always project onto the $\mathbf{q}$'s you have built, never onto the $\mathbf{a}$'s you started with. (Equivalently, you may project $\mathbf{a}_3$ onto the whole subspace at once with the full projection-matrix formula of Chapter 19 — but the per-axis subtraction is only valid for an orthonormal set.)
# Gram-Schmidt by hand, confirmed: orthonormalize three vectors in R^3.
import numpy as np
A = np.array([[1.0, 1.0, 0.0], # a1, a2, a3 as COLUMNS
[1.0, 0.0, 1.0],
[0.0, 1.0, 1.0]])
Q = np.zeros_like(A)
for j in range(A.shape[1]): # math column j+1; numpy column j (0-indexed)
v = A[:, j].copy()
for i in range(j): # subtract shadows on already-built q_i
v -= (Q[:, i] @ A[:, j]) * Q[:, i]
Q[:, j] = v / np.linalg.norm(v) # normalize
print(np.round(Q, 4))
# columns: q1=(.7071,.7071,0), q2=(.4082,-.4082,.8165), q3=(-.5774,.5774,.5774)
print("Q^T Q =\n", np.round(Q.T @ Q, 10)) # the 3x3 identity -> orthonormal
The printed columns are $\mathbf{q}_1 \approx (0.7071, 0.7071, 0)$, $\mathbf{q}_2 \approx (0.4082, -0.4082, 0.8165)$, and $\mathbf{q}_3 \approx (-0.5774, 0.5774, 0.5774)$ — exactly $1/\sqrt 2 \approx 0.7071$, $1/\sqrt 6 \approx 0.4082$, $2/\sqrt 6 \approx 0.8165$, and $1/\sqrt 3 \approx 0.5774$ from the hand computation. The decisive check is the last line: $Q^{\mathsf{T}}Q$ comes back as the $3\times 3$ identity, which says every column has unit length (the diagonal $1$'s) and every pair of distinct columns is perpendicular (the off-diagonal $0$'s). That single matrix equation, $Q^{\mathsf{T}}Q = I$, is the definition of orthonormal columns, and it is the property we will exploit relentlessly.
20.4 Why does the Gram-Schmidt process produce an orthonormal basis?
We have seen it work; now let's prove it always does, because a worked example is not a theorem. The proof is short and follows the geometry exactly, which is the best kind.
Why we care. We are about to build the entire QR factorization on the claim that Gram-Schmidt turns any basis into an orthonormal one spanning the same space. If that claim has exceptions, every downstream use — stable least squares, the QR algorithm — inherits them. So we pin down precisely when it works and prove that it does.
Key idea. Induction on the step number. At each step, "subtract the projection onto everything already built" produces a remainder orthogonal to all previous directions (the projection theorem of Chapter 19), and that remainder is nonzero precisely because the original vectors are independent.
Theorem (Gram-Schmidt). Let $\mathbf{a}_1, \dots, \mathbf{a}_n$ be linearly independent vectors in $\mathbb{R}^m$. Define recursively $$\mathbf{u}_k = \mathbf{a}_k - \sum_{i=1}^{k-1} (\mathbf{q}_i \cdot \mathbf{a}_k)\,\mathbf{q}_i, \qquad \mathbf{q}_k = \frac{\mathbf{u}_k}{\lVert \mathbf{u}_k \rVert} \quad (k = 1, \dots, n).$$ Then $\mathbf{q}_1, \dots, \mathbf{q}_n$ is an orthonormal set, and for every $k$, $\operatorname{span}\{\mathbf{q}_1, \dots, \mathbf{q}_k\} = \operatorname{span}\{\mathbf{a}_1, \dots, \mathbf{a}_k\}$.
Proof. We induct on $k$. For $k = 1$, $\mathbf{u}_1 = \mathbf{a}_1$, which is nonzero (an independent set contains no zero vector), so $\mathbf{q}_1 = \mathbf{a}_1/\lVert\mathbf{a}_1\rVert$ is a unit vector and obviously spans the same line as $\mathbf{a}_1$.
Assume that $\mathbf{q}_1, \dots, \mathbf{q}_{k-1}$ are orthonormal and span the same space as $\mathbf{a}_1, \dots, \mathbf{a}_{k-1}$. Consider $\mathbf{u}_k = \mathbf{a}_k - \sum_{i Finally the span. By construction $\mathbf{q}_k$ is a combination of $\mathbf{a}_k$ and the earlier $\mathbf{q}_i$ (hence of $\mathbf{a}_1, \dots, \mathbf{a}_k$), and conversely $\mathbf{a}_k = \lVert\mathbf{u}_k\rVert\,\mathbf{q}_k + \sum_{i What this means. The condition is exactly linear independence — nothing more, nothing less. Feed Gram-Schmidt an independent set and it returns an orthonormal basis for the same span. Feed it a dependent set and the process announces the dependence by producing $\mathbf{u}_k = \mathbf{0}$ at the first redundant vector (you cannot normalize the zero vector). That failure is informative, not catastrophic: it pinpoints which vector was already spanned. Numerically it shows up as a $\mathbf{u}_k$ of tiny length, which is the first hint of the stability concerns we tackle in §20.11. Notice how lightly the proof leaned on machinery — it used essentially one result, the projection theorem of Chapter 19, applied at every step. The single fact "a vector minus its projection onto a subspace is orthogonal to that subspace" did all the work; induction merely applied it $n$ times and tracked the spans. That economy is worth savoring, because it confirms the chapter's thesis literally: Gram-Schmidt contains no genuinely new mathematics beyond Chapter 19. It is the projection theorem, iterated, with normalization for tidiness. When a later chapter invokes "an orthonormal basis exists," this short proof is the reason you are entitled to believe it — and the reason you could produce one if asked, not merely assert that it exists somewhere. There is also a subtle point about the order of the input vectors that the proof quietly settles. Gram-Schmidt's output depends on the order you feed the vectors: orthogonalizing $\mathbf{a}_1$ then $\mathbf{a}_2$ gives a different orthonormal pair than $\mathbf{a}_2$ then $\mathbf{a}_1$, because the first vector is always kept as-is and the rest are bent around it. Both outputs are valid orthonormal bases for the same span — the span never depends on order — but the individual axes differ. This is the same freedom that made QR non-unique up to signs in §20.8; pinning down the input order and the sign convention together is what makes the factorization canonical. — Gram-Schmidt requires a linearly independent input. If the vectors are dependent, some $\mathbf{u}_k$ is exactly $\mathbf{0}$ and the normalization step divides by zero. Do not blindly run the recursion on an arbitrary spanning set and expect an orthonormal basis of the right size — first ensure independence (or detect and skip the zero remainders, which yields an orthonormal basis of the span with fewer vectors than you started with). Stating the condition is not pedantry: the theorem is false without it. There is a beautiful corollary hiding in this theorem, and it is worth stating on its own because it answers a question you may not have realized was open. Back in Chapter 15 you learned that every subspace has a basis — but is there always an orthonormal one? The answer, now, is an emphatic yes: take any basis (Chapter 15 guarantees one exists), run Gram-Schmidt, and out comes an orthonormal basis for the same subspace. So every finite-dimensional subspace of $\mathbb{R}^m$ has an orthonormal basis — the convenient coordinate systems are not a lucky special case, they are always available. This is not obvious from the definitions alone; it is Gram-Schmidt that makes it true, by giving a constructive recipe rather than a mere existence claim. The recipe matters: it means whenever a proof or an algorithm would be cleaner with an orthonormal basis, you are entitled to assume one, because you know how to build it. This existence guarantee is quietly load-bearing for the rest of the book. The Spectral Theorem of Chapter 27 will assert that a symmetric matrix has an orthonormal basis of eigenvectors; the singular value decomposition of Chapter 30 will assert that every matrix has orthonormal bases on both ends. Both of those rest on the fact, established right here, that orthonormal bases are not exotic — they can always be manufactured from whatever basis you happen to have. Gram-Schmidt is the workhorse that quietly supplies them. Math-Major Sidebar (optional) — The same construction works in any inner product space. Nothing in the proof of §20.4 used that our vectors were lists of real numbers; it used only the dot product's defining properties — linearity, symmetry, and positivity ($\mathbf{v}\cdot\mathbf{v} > 0$ for $\mathbf{v}\neq\mathbf{0}$). So Gram-Schmidt runs verbatim in any inner product space (Chapter 34), including spaces of functions, where the "dot product" is an integral $\langle f, g\rangle = \int f g$. Apply it to the powers $1, x, x^2, x^3, \dots$ on the interval $[-1, 1]$ and you generate the Legendre polynomials; apply it with a different weight and you get the Chebyshev or Hermite families. These orthogonal polynomials are the backbone of numerical integration and approximation theory. The humble "subtract the shadow" step, run on functions instead of arrows, organizes a whole corner of mathematical physics — a preview of the function-space viewpoint that makes Chapter 22's Fourier series possible. Now comes the move that turns a procedure into a matrix factorization, and it requires no new work — only careful bookkeeping of what we already did. Look back at the rearranged formula from the proof. For each original vector,
$$\mathbf{a}_k = \lVert\mathbf{u}_k\rVert\,\mathbf{q}_k + \sum_{i=1}^{k-1} (\mathbf{q}_i \cdot \mathbf{a}_k)\,\mathbf{q}_i.$$
Read it carefully: every original vector $\mathbf{a}_k$ is written as a combination of the orthonormal vectors $\mathbf{q}_1, \dots, \mathbf{q}_k$ — and crucially, only of $\mathbf{q}_1$ through $\mathbf{q}_k$, never a $\mathbf{q}$ with a higher index. The first vector $\mathbf{a}_1$ uses only $\mathbf{q}_1$; the second uses $\mathbf{q}_1$ and $\mathbf{q}_2$; the $k$-th uses $\mathbf{q}_1$ through $\mathbf{q}_k$. That "only the earlier ones" structure is the seed of triangularity. Stack the $\mathbf{a}_k$ as the columns of a matrix $A$ (so $A$ is $m \times n$), and the $\mathbf{q}_k$ as the columns of a matrix $Q$ (also $m \times n$, with orthonormal columns). Then the collection of equations above is one matrix equation:
$$A = QR,$$
where $R$ is the $n \times n$ matrix that records the coefficients. Its entries are the dot products and lengths we computed: the $(i,k)$ entry of $R$ is $\mathbf{q}_i \cdot \mathbf{a}_k$ for $i < k$, the diagonal entry $(k,k)$ is $\lVert\mathbf{u}_k\rVert$, and everything below the diagonal is $0$ (because $\mathbf{a}_k$ never uses a $\mathbf{q}$ with index above $k$). A matrix with zeros below the diagonal is upper-triangular. So: The Key Insight — The QR decomposition factors a matrix $A$ with independent columns as $A = QR$, where $Q$ has orthonormal columns ($Q^{\mathsf{T}}Q = I$) and $R$ is upper-triangular. It is Gram-Schmidt written in matrix form: $Q$ holds the orthonormal directions Gram-Schmidt produces, and $R$ holds the bookkeeping — the dot products and lengths that rebuild the originals. The columns of $A$ and $Q$ span the same space; $R$ is the change-of-coordinates recipe between them. The triangularity is not a happy accident — it is the matrix shadow of the sequential nature of Gram-Schmidt. Because we built the orthonormal vectors one at a time, and each original vector only needed the directions available so far, the rebuild matrix can only reach "backward and up." That is exactly the upper-triangular pattern, and it is what makes QR so computationally pleasant: solving a system with a triangular matrix is trivial back substitution (Chapter 4), no elimination required. Common Pitfall — Confusing the shape of $R$ with the shape of $A$. For an $m \times n$ matrix $A$ with $m \geq n$ (tall or square), the reduced QR gives an $m \times n$ matrix $Q$ (orthonormal columns) and a square $n \times n$ upper-triangular $R$ — this is the Gram-Schmidt output, and it is what you almost always want. There is also a full QR, $Q$ an $m \times m$ square orthogonal matrix and $R$ an $m \times n$ matrix with an upper-triangular top block and zero rows beneath. numpy's Let's build $R$ explicitly for the running example so the abstract description becomes a concrete matrix. We have all the pieces already; we just collect them. The diagonal of $R$ holds the lengths $\lVert\mathbf{u}_k\rVert$ we divided by when normalizing:
$$R_{11} = \lVert\mathbf{u}_1\rVert = \sqrt 2, \quad R_{22} = \lVert\mathbf{u}_2\rVert = \tfrac{\sqrt 6}{2}, \quad R_{33} = \lVert\mathbf{u}_3\rVert = \tfrac{2}{\sqrt 3}.$$
The above-diagonal entries hold the dot products $\mathbf{q}_i \cdot \mathbf{a}_k$ for $i < k$ that we computed when stripping shadows:
$$R_{12} = \mathbf{q}_1 \cdot \mathbf{a}_2 = \tfrac{1}{\sqrt 2}, \quad R_{13} = \mathbf{q}_1 \cdot \mathbf{a}_3 = \tfrac{1}{\sqrt 2}, \quad R_{23} = \mathbf{q}_2 \cdot \mathbf{a}_3 = \tfrac{1}{\sqrt 6}.$$
Everything below the diagonal is zero. So
$$R = \begin{bmatrix} \sqrt 2 & 1/\sqrt 2 & 1/\sqrt 2 \\ 0 & \sqrt 6/2 & 1/\sqrt 6 \\ 0 & 0 & 2/\sqrt 3 \end{bmatrix} \approx \begin{bmatrix} 1.414 & 0.707 & 0.707 \\ 0 & 1.225 & 0.408 \\ 0 & 0 & 1.155 \end{bmatrix}.$$
Notice we computed no new numbers — every entry of $R$ is a length or a dot product that already appeared in §20.3. That is the sense in which QR is "free" once you have run Gram-Schmidt: the factorization is the byproduct, neatly arranged. It is worth pausing on what each entry of $R$ means, because the matrix is more than a container of leftover arithmetic. The diagonal entry $R_{kk} = \lVert\mathbf{u}_k\rVert$ measures how much genuinely new length the $k$-th vector contributed after its overlap with all the earlier directions was stripped away — it is the "height" of $\mathbf{a}_k$ above the subspace already built. A small diagonal entry is a warning sign: it means $\mathbf{a}_k$ was nearly in the span of its predecessors, i.e. the original vectors were close to dependent, which is exactly the ill-conditioning that §20.11 will show wrecks naive computation. The above-diagonal entry $R_{ik} = \mathbf{q}_i \cdot \mathbf{a}_k$ measures how much of $\mathbf{a}_k$ pointed along the established direction $\mathbf{q}_i$ — the size of the shadow you removed. So $R$ is a complete ledger of the orthogonalization: its diagonal records what was new, its off-diagonal records what was redundant, and together they reconstruct the original $A$ exactly. Reading conditioning off the diagonal of $R$ is a habit worth forming now; it returns in force in Chapter 38. Check Your Understanding — Without multiplying it all out, what should the product of the diagonal entries of $R$ equal, in terms of the original matrix $A$? (Hint: $\det(A) = \det(Q)\det(R)$, and think about what $\det(Q)$ can be for orthonormal columns.) The diagonal product is $R_{11}R_{22}R_{33} = \sqrt 2 \cdot \tfrac{\sqrt 6}{2} \cdot \tfrac{2}{\sqrt 3} = \tfrac{\sqrt 2 \cdot \sqrt 6 \cdot 2}{2\sqrt 3} = \tfrac{2\sqrt{12}}{2\sqrt 3} = \sqrt{12}/\sqrt 3 = \sqrt 4 = 2$. Since $R$ is triangular, its determinant is the product of its diagonal, so $\det(R) = 2$. For a square $Q$ with orthonormal columns, $\det(Q) = \pm 1$ (Chapter 21), so $|\det(A)| = |\det(Q)|\,|\det(R)| = |\det(R)| = 2$ — matching $|\det(A)| = |{-2}| = 2$ from the start of §20.3. The diagonal of $R$ literally carries the volume-scaling information of $A$. Let's confirm the whole factorization in code and reproduce $A$ exactly. The printed The first worked example was square, which is the gentlest case. But the situations where QR earns its keep — least squares, regression, overdetermined systems — are all tall: more rows than columns, $m > n$. Let's run Gram-Schmidt on a tall matrix end to end, both to cement the procedure and because the result feeds directly into the stable-least-squares story of §20.9. The geometry is identical; only the ambient dimension grows. Take the $4 \times 2$ matrix whose first column is all ones and whose second column is the numbers $0, 1, 2, 3$:
$$A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \qquad \mathbf{a}_1 = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}, \quad \mathbf{a}_2 = \begin{bmatrix} 0 \\ 1 \\ 2 \\ 3 \end{bmatrix}.$$
This is no random matrix: it is the design matrix for fitting a straight line $y = c_0 + c_1 t$ to four data points at $t = 0, 1, 2, 3$ — the intercept multiplies the ones-column, the slope multiplies the $t$-column. We met exactly this matrix in the four-subspaces discussion of Chapter 14, and we will fit a line with it in §20.8. Its two columns are plainly independent (one is constant, the other increasing), so Gram-Schmidt applies. The vectors live in $\mathbb{R}^4$ now, but the recipe does not care about the dimension. Step 1. Normalize $\mathbf{a}_1$. Its length is $\lVert\mathbf{a}_1\rVert = \sqrt{1+1+1+1} = 2$, so
$$\mathbf{q}_1 = \tfrac12\,(1, 1, 1, 1) = (\tfrac12, \tfrac12, \tfrac12, \tfrac12).$$
Geometrically, $\mathbf{q}_1$ is the "average" direction — the unit vector along which every coordinate is equal. Projecting any vector onto it picks out that vector's mean, a fact that will reappear as the centering step in PCA (Chapter 32). Step 2. Strip $\mathbf{a}_2$'s shadow on $\mathbf{q}_1$. The dot product is
$$\mathbf{q}_1 \cdot \mathbf{a}_2 = \tfrac12(0 + 1 + 2 + 3) = \tfrac12 \cdot 6 = 3.$$
(That is $\tfrac12$ times the sum of the $t$-values — projecting onto the all-ones direction really does extract a multiple of the mean.) Subtract the shadow:
$$\mathbf{u}_2 = \mathbf{a}_2 - 3\,\mathbf{q}_1 = (0,1,2,3) - 3(\tfrac12,\tfrac12,\tfrac12,\tfrac12) = (0 - \tfrac32,\; 1 - \tfrac32,\; 2 - \tfrac32,\; 3 - \tfrac32) = (-\tfrac32, -\tfrac12, \tfrac12, \tfrac32).$$
Look what happened: subtracting the shadow on the all-ones direction centered $\mathbf{a}_2$ — the entries $-\tfrac32, -\tfrac12, \tfrac12, \tfrac32$ are the original $t$-values $0,1,2,3$ with their mean $\tfrac32$ removed, so they now sum to zero. That is exactly what "orthogonal to the constant direction" means, and it is no coincidence that centering data and orthogonalizing against the ones-vector are the same operation. Normalize: $\lVert\mathbf{u}_2\rVert = \sqrt{\tfrac94 + \tfrac14 + \tfrac14 + \tfrac94} = \sqrt{5}$, so
$$\mathbf{q}_2 = \tfrac{1}{\sqrt 5}(-\tfrac32, -\tfrac12, \tfrac12, \tfrac32) = \tfrac{1}{2\sqrt 5}(-3, -1, 1, 3).$$ The factorization assembles as before. The diagonal of $R$ holds the lengths $R_{11} = 2$ and $R_{22} = \sqrt 5$; the single above-diagonal entry is the dot product $R_{12} = \mathbf{q}_1\cdot\mathbf{a}_2 = 3$. So
$$Q = \begin{bmatrix} 1/2 & -3/(2\sqrt5) \\ 1/2 & -1/(2\sqrt5) \\ 1/2 & 1/(2\sqrt5) \\ 1/2 & 3/(2\sqrt5) \end{bmatrix}, \qquad R = \begin{bmatrix} 2 & 3 \\ 0 & \sqrt 5 \end{bmatrix}.$$
Here $Q$ is $4 \times 2$ (tall, like $A$) and $R$ is $2 \times 2$ (square) — the reduced QR of §20.5. The two columns of $Q$ are an orthonormal basis for the plane in $\mathbb{R}^4$ spanned by the ones-vector and the $t$-vector; that plane is the column space $C(A)$ from Chapter 13, and it is the haystack inside which least squares will search for the best fit. After the sign fix, Check Your Understanding — In Step 2 above, subtracting the projection onto the all-ones vector $\mathbf{q}_1$ turned the $t$-values $(0,1,2,3)$ into the centered values $(-\tfrac32, -\tfrac12, \tfrac12, \tfrac32)$. Why must the result always sum to zero, for any starting vector? The remainder $\mathbf{u}_2$ is, by construction, orthogonal to $\mathbf{q}_1$ — and $\mathbf{q}_1$ is a positive multiple of the all-ones vector $\mathbf{1} = (1,1,1,1)$. Orthogonality to $\mathbf{1}$ means $\mathbf{1} \cdot \mathbf{u}_2 = 0$, and $\mathbf{1}\cdot\mathbf{u}_2$ is literally the sum of the entries of $\mathbf{u}_2$. So the entries must sum to zero. Subtracting the projection onto the constant direction is exactly "subtract the mean," and a mean-centered vector always sums to zero — the geometric statement (orthogonal to $\mathbf{1}$) and the statistical statement (mean zero) are the same fact. This is the seed of why PCA centers data first (Chapter 32). You will rarely write the Gram-Schmidt loop in production; you will call a library routine, which uses a more stable method (Householder reflections — Chapter 38). But the result should agree with ours up to one well-known wrinkle: the sign convention. Let's confront it directly, because it is the single most common source of "my QR doesn't match the textbook" confusion. The QR factorization is not unique unless you pin down the signs. If $\mathbf{q}_k$ is a valid orthonormal direction, so is $-\mathbf{q}_k$ — flipping a direction keeps it a unit vector perpendicular to the others. Flipping the sign of a column of $Q$ and the corresponding row of $R$ leaves the product $QR$ unchanged. So there are $2^n$ different QR factorizations of the same $A$, differing only by which way each axis points. The standard way to make it unique is to require the diagonal of $R$ to be positive — which is exactly what our Gram-Schmidt produces, since each $R_{kk} = \lVert\mathbf{u}_k\rVert > 0$ is a length. LAPACK (and hence numpy) does not enforce that convention; it lets the diagonal signs fall where the Householder algorithm puts them. The Key Insight — QR with the condition "$R$ has a positive diagonal" is unique for a matrix with independent columns. Gram-Schmidt always returns this canonical form (lengths are positive). numpy's After the sign fix, numpy's Computational Note — Here is the payoff that makes QR a workhorse rather than a curiosity. In Chapter 17 you solved least squares — fitting a model to overdetermined data — by the normal equations $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$. That formula is correct and geometrically beautiful (it is projection onto the column space, Chapter 19), but it has a numerical weakness that QR repairs completely. The trouble is the matrix $A^{\mathsf{T}}A$. Forming it squares the conditioning of the problem. Roughly, the condition number measures how much a matrix amplifies relative errors when you solve a system with it (we make this precise in Chapter 38); if $A$ has condition number $\kappa$, then $A^{\mathsf{T}}A$ has condition number $\kappa^2$. When the columns of $A$ are nearly parallel — common in real regression, where predictors are correlated — $\kappa$ is large, and $\kappa^2$ can be catastrophically large, drowning the answer in roundoff. You are throwing away half your significant digits before you even start solving. QR sidesteps the squaring entirely. Substitute $A = QR$ into the least-squares problem and watch $A^{\mathsf{T}}A$ vanish. The normal equations become
$$A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b} \;\Longrightarrow\; (QR)^{\mathsf{T}}(QR)\,\hat{\mathbf{x}} = (QR)^{\mathsf{T}}\mathbf{b}
\;\Longrightarrow\; R^{\mathsf{T}}\underbrace{Q^{\mathsf{T}}Q}_{=\,I}R\,\hat{\mathbf{x}} = R^{\mathsf{T}}Q^{\mathsf{T}}\mathbf{b}.$$
Because $Q^{\mathsf{T}}Q = I$, the middle collapses, leaving $R^{\mathsf{T}}R\,\hat{\mathbf{x}} = R^{\mathsf{T}}Q^{\mathsf{T}}\mathbf{b}$. Since $A$ has independent columns, $R$ is invertible, so $R^{\mathsf{T}}$ cancels from both sides, and we are left with the strikingly simple
$$\boxed{\,R\,\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}.\,}$$
This is a triangular system — solve it by back substitution in one pass (Chapter 4), no elimination. We never formed $A^{\mathsf{T}}A$, so we never squared the condition number; the conditioning of the QR route is $\kappa$, not $\kappa^2$. For well-conditioned problems both methods agree to full precision, but when the predictors are correlated, QR keeps the digits that the normal equations lose. Let's make $R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}$ concrete by finishing the line fit we set up in §20.7. There the design matrix $A$ (ones-column and $t = 0,1,2,3$) factored as $Q = \begin{bmatrix} 1/2 & -3/(2\sqrt5) \\ \vdots & \vdots \end{bmatrix}$ and $R = \begin{bmatrix} 2 & 3 \\ 0 & \sqrt 5 \end{bmatrix}$. Suppose our four measured $y$-values are $\mathbf{b} = (1, 3, 4, 6)$ — roughly increasing, but not exactly on any line, so the system is overdetermined and we want the best fit. We need two dot products to form $Q^{\mathsf{T}}\mathbf{b}$. The first column of $Q$ dotted with $\mathbf{b}$ is $\tfrac12(1 + 3 + 4 + 6) = \tfrac12 \cdot 14 = 7$. The second column $\tfrac{1}{2\sqrt5}(-3,-1,1,3)$ dotted with $\mathbf{b}$ is $\tfrac{1}{2\sqrt5}(-3 - 3 + 4 + 18) = \tfrac{16}{2\sqrt5} = \tfrac{8}{\sqrt5}$. So the triangular system to solve is
$$R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b} \;\Longrightarrow\; \begin{bmatrix} 2 & 3 \\ 0 & \sqrt 5 \end{bmatrix}\begin{bmatrix} c_0 \\ c_1 \end{bmatrix} = \begin{bmatrix} 7 \\ 8/\sqrt 5 \end{bmatrix}.$$
Back substitution runs bottom-up. The last row gives $\sqrt 5\, c_1 = 8/\sqrt 5$, so the slope is $c_1 = 8/5 = 1.6$. Substitute into the first row: $2c_0 + 3(1.6) = 7$, so $2c_0 = 7 - 4.8 = 2.2$ and the intercept is $c_0 = 1.1$. The best-fit line is $y = 1.1 + 1.6\,t$ — and we obtained it by two dot products and a two-step back substitution, never forming or inverting $A^{\mathsf{T}}A$. That is the entire least-squares computation, done by the QR route, with numbers you can check. The code prints The Key Insight — Least squares via QR reduces to solving the triangular system $R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}$ by back substitution. It is the same projection-onto-the-column-space solution from Chapters 17 and 19 — but computed without ever forming the ill-conditioned $A^{\mathsf{T}}A$, so it survives the correlated predictors that wreck the normal equations. This is why every serious least-squares library (numpy's The output shows Real-World Application — Stable curve fitting in econometrics and data science. Polynomial regression fits a curve by using $1, t, t^2, t^3, \dots$ as predictors — but those powers are terribly correlated (over $[0,1]$, $t^3$ and $t^4$ look nearly identical), so the design matrix is wildly ill-conditioned and the normal equations return garbage coefficients for degree $\gtrsim 8$. Statistical software dodges this by fitting with QR (R's Least squares is the headline application, but QR's reach is wider, and two further uses are worth previewing because they connect to the heart of the book. The first is a coordinate change you now get almost for free. Once $A = QR$, finding the coordinates of any vector $\mathbf{b}$ in the column space relative to the orthonormal basis $Q$ is a single matrix-vector product $Q^{\mathsf{T}}\mathbf{b}$ — no system to solve. This is the orthonormal-basis dividend promised in §20.1: with a general basis you would solve $A\mathbf{c} = \mathbf{b}$ by elimination, but with the orthonormal $Q$ the coordinates are just dot products, because $Q^{\mathsf{T}}Q = I$ makes $Q^{\mathsf{T}}$ a left inverse. We will lean on this in Chapter 21, where orthogonal matrices are revealed as the rigid motions, and again in Chapter 22, where projecting a signal onto an orthonormal basis of sines and cosines is the Fourier transform. The second use is the one that will surprise you most. QR is the engine of the single most important algorithm for computing eigenvalues, the QR algorithm, which you will meet in Chapters 24 and 29. The idea is almost magical in its simplicity: factor $A = Q_1 R_1$, then multiply the factors back in the opposite order to form $A_1 = R_1 Q_1$; factor that as $A_1 = Q_2 R_2$, reverse again, and repeat. Under mild conditions this sequence $A, A_1, A_2, \dots$ converges to an upper-triangular matrix whose diagonal entries are the eigenvalues of $A$. We have no business proving that here — eigenvalues are Part V — but it is worth knowing that the humble factorization you built by repeated projection in this chapter is, iterated, how the world actually computes the spectrum of a matrix. Real-World Application — Orthonormal bases in computer graphics and signals (graphics / signals). A 3D renderer constantly needs orthonormal frames — three mutually perpendicular unit vectors that define a local coordinate system for a camera or a surface. Given a camera's view direction and a rough "up" vector (which is usually not exactly perpendicular to the view), the engine runs a tiny Gram-Schmidt: keep the view direction, subtract its shadow from "up" to get a true perpendicular, and take the cross product for the third axis. The result is the orthonormal view matrix at the heart of every rendering pipeline (Chapter 12). The same orthogonalization appears in signal processing as the QR-based adaptive filter: streaming algorithms keep a QR factorization of the data matrix and update it as each new sample arrives, because updating a triangular $R$ is cheap and stable, whereas re-forming $A^{\mathsf{T}}A$ each step would be slow and ill-conditioned. From the orientation of a virtual camera to the noise-cancelling filter in your headphones, the engineering reaches for orthonormal directions — and Gram-Schmidt is how it builds them. Math-Major Sidebar (optional) — Why reversing the QR factors chases eigenvalues. The step $A_{k} = Q_k R_k \mapsto A_{k+1} = R_k Q_k$ is a similarity transformation: since $A_{k+1} = R_k Q_k = Q_k^{\mathsf{T}}(Q_k R_k) Q_k = Q_k^{\mathsf{T}} A_k Q_k$, and $Q_k^{\mathsf{T}} = Q_k^{-1}$ for an orthogonal matrix, each $A_{k+1}$ is similar to $A_k$ — same eigenvalues (Chapter 25 establishes that similar matrices share eigenvalues). So the whole sequence has the eigenvalues of the original $A$, and the iteration gradually rotates the matrix into upper-triangular form, where (Chapter 24) the eigenvalues sit on the diagonal. It is a beautiful instance of a recurring theme: an orthogonal change of coordinates ($Q$) reveals structure (the eigenvalues) that was always there but hidden by the original basis. Practical implementations add shifts and a preliminary reduction to Hessenberg form to converge fast (Chapter 38), but repeated QR is the conceptual core. Historical Note — The orthogonalization process is named for Jørgen Pedersen Gram (1883) and Erhard Schmidt (1907), though essentially the same construction appears earlier in work of Laplace and Cauchy [verify]. Schmidt introduced it in his study of integral equations, where the "vectors" were functions — the same infinite-dimensional viewpoint that powers Chapter 22's Fourier series. The QR algorithm for eigenvalues is far more recent: it was developed independently by John G. F. Francis and Vera Kublanovskaya around 1961, and is routinely ranked among the most important algorithms of the twentieth century [verify]. We close with a genuine trap, the kind that separates a textbook procedure from production code. The Gram-Schmidt recursion we proved correct in §20.4 — call it classical Gram-Schmidt — is mathematically perfect but numerically unstable. In exact arithmetic the $\mathbf{q}_k$ are exactly orthonormal; in floating-point arithmetic, with vectors that are nearly dependent, the computed $\mathbf{q}_k$ can drift alarmingly far from orthogonal. The later vectors lose their perpendicularity to the earlier ones, sometimes badly enough that $Q^{\mathsf{T}}Q$ is nowhere near the identity. This is called loss of orthogonality, and it is one of the classic cautionary tales of numerical linear algebra. The cause is subtle and worth understanding. Classical Gram-Schmidt computes all the dot products $\mathbf{q}_i \cdot \mathbf{a}_k$ against the original $\mathbf{a}_k$, then subtracts every shadow at once. If $\mathbf{a}_k$ is nearly in the span of the earlier vectors, the remainder $\mathbf{u}_k$ is the small difference of large nearly-equal quantities — the recipe for catastrophic cancellation (you met this hazard in earlier numerical notes; Chapter 38 treats it in full). The few surviving significant digits of $\mathbf{u}_k$ are corrupted, and its computed direction is unreliable, so $\mathbf{q}_k$ is not truly perpendicular to its predecessors. The fix is almost embarrassingly small. Modified Gram-Schmidt subtracts the shadows one at a time, updating the working vector after each subtraction, and computes each subsequent dot product against the updated remainder rather than the original $\mathbf{a}_k$. Algebraically the two are identical (in exact arithmetic they produce the same $\mathbf{q}_k$); numerically they are worlds apart, because each modified subtraction projects against the freshest, already-partly-orthogonalized vector, so cancellation does not accumulate. The change is just where you put the update inside the loop. — Classical Gram-Schmidt is numerically unstable; use modified Gram-Schmidt (or, better, a library QR) in real computation. The two are mathematically equivalent — same formula, same exact-arithmetic answer — but in floating point, classical GS can lose orthogonality catastrophically when the input vectors are nearly dependent, while modified GS (subtract each projection sequentially, updating the vector as you go) stays accurate. The difference is the order of operations inside the inner loop: classical projects the original vector onto each $\mathbf{q}_i$; modified projects the running remainder. Never ship classical Gram-Schmidt for ill-conditioned data. For maximum stability, prefer Householder QR, which is what The numbers tell the whole story, and they are more dramatic than you might expect. For classical Gram-Schmidt, Build Your Toolkit — Implement two functions in Step back and see what one idea accomplished. We started with the orthogonal projection of Chapter 19 — drop a perpendicular, keep the leftover — and applied it repeatedly. Each application stripped a vector of its shadow on everything already built, leaving a fresh perpendicular direction. Normalizing gave an orthonormal basis; recording the bookkeeping gave a matrix factorization, $A = QR$, with $Q$ orthonormal and $R$ upper-triangular. No new geometry, only the old geometry used $n$ times in a row. That factorization turned out to be one of the most useful objects in computational mathematics. It solves least squares without the conditioning penalty of the normal equations $A^{\mathsf{T}}A$, reducing the whole problem to a triangular back-substitution $R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}$. It makes coordinates in the column space into mere dot products. And iterated — factor, reverse, repeat — it becomes the QR algorithm that computes eigenvalues, the subject waiting in Part V. Along the way we learned to respect the difference between mathematics and arithmetic: classical and modified Gram-Schmidt are the same on paper and worlds apart in floating point, a first taste of the numerical-stability concerns that Chapter 38 confronts head-on. It is worth naming what skills you now hold, because they recur for the rest of the book. You can take any independent set and produce an orthonormal basis for its span, by hand or in code, and you can check your work in one line by confirming $Q^{\mathsf{T}}Q = I$. You can read the QR factorization straight off that computation — diagonal of $R$ from the remainder lengths, above-diagonal from the dot products — and reconcile it with a library routine by fixing the sign convention. You can solve least squares the stable way, and you can explain to someone why it is more stable than the formula they learned first. And you can articulate the difference between an algorithm that is correct and one that is correct and numerically trustworthy. These are not five separate tricks; they are five faces of the one idea this chapter was built on. The forward connections are dense. The orthonormal $Q$ is the protagonist of Chapter 21 (orthogonal matrices and rotations) and reappears, in complex form, as the unitary matrices that act as quantum gates (Chapter 27 and beyond). The existence of orthonormal bases, which Gram-Schmidt guarantees, underwrites the Spectral Theorem (Chapter 27) and the singular value decomposition (Chapter 30) — the two great factorizations of the second half of the book. The QR algorithm previewed here is how Chapter 29 will actually compute PageRank's dominant eigenvector at scale. And the inner-product viewpoint of the sidebars opens directly onto Chapter 34's abstract inner product spaces and Chapter 22's Fourier series. Repeated projection, it turns out, is a thread that runs through everything still to come. The recurring theme of Part IV is fully in view now: orthogonality is at once a geometric picture (right angles), an algebraic identity ($Q^{\mathsf{T}}Q = I$), and a computational gift (triangular systems, stable least squares). The same perpendicular-dropping instinct from grade school, made precise and repeated, manufactures the cleanest coordinate systems mathematics has — and hands us a factorization we will use for the rest of the book. In Chapter 21 we ask what transformations preserve this orthogonality, and meet the rotations and reflections; the orthogonal matrices $Q$ you built here are exactly those rigid motions, seen from a new angle.Warning
20.5 What is the QR decomposition, and where does it come from?
np.linalg.qr defaults to the reduced form (mode='reduced'); ask for mode='complete' to get the full one. Know which you have: the reduced $Q$ has $Q^{\mathsf{T}}Q = I_n$ but generally $QQ^{\mathsf{T}} \neq I_m$ (it is the projection onto the column space), whereas the full $Q$ is a genuine orthogonal matrix with both products equal to identity.20.6 How do I read R off the Gram-Schmidt computation?
Answer
# QR built from our Gram-Schmidt output: verify A = Q R exactly.
import numpy as np
A = np.array([[1.0, 1.0, 0.0],
[1.0, 0.0, 1.0],
[0.0, 1.0, 1.0]])
Q = np.zeros_like(A); R = np.zeros((3, 3))
for j in range(3):
v = A[:, j].copy()
for i in range(j):
R[i, j] = Q[:, i] @ A[:, j] # above-diagonal entry = dot product
v -= R[i, j] * Q[:, i]
R[j, j] = np.linalg.norm(v) # diagonal entry = length
Q[:, j] = v / R[j, j]
print("R =\n", np.round(R, 4)) # upper-triangular, matches the hand R
print("max |A - QR| =", np.abs(A - Q @ R).max()) # ~1e-16: A reconstructed
R matches the hand-computed matrix to four places ($1.414, 0.707, 0.707$ across the top row, and so on), and the reconstruction error max |A - QR| is on the order of $10^{-16}$ — machine zero. We have factored $A$ into an orthonormal $Q$ and an upper-triangular $R$, by hand and in code, with matching numbers.20.7 What does QR look like for a tall matrix?
# QR of the tall 4x2 design matrix; check shapes, orthonormality, and reconstruction.
import numpy as np
A = np.array([[1.0, 0.0],
[1.0, 1.0],
[1.0, 2.0],
[1.0, 3.0]])
Q, R = np.linalg.qr(A)
Q, R = Q * np.sign(np.diag(R)), R * np.sign(np.diag(R))[:, None] # canonical positive diagonal
print("Q shape:", Q.shape, " R shape:", R.shape) # (4, 2) and (2, 2): reduced QR
print("R =\n", np.round(R, 4)) # [[2, 3], [0, 2.2361]] (2.2361 = sqrt 5)
print("Q^T Q =\n", np.round(Q.T @ Q, 10)) # 2x2 identity
print("max |A - QR| =", np.abs(A - Q @ R).max()) # ~1e-16
R prints as $\begin{bmatrix} 2 & 3 \\ 0 & 2.2361 \end{bmatrix}$ with $2.2361 \approx \sqrt 5$, matching the hand computation exactly, and $Q^{\mathsf{T}}Q$ is the $2\times 2$ identity. The shapes confirm the reduced QR: a $4\times 2$ orthonormal $Q$ and a $2\times 2$ triangular $R$. We will reuse this very factorization in the next section to fit a line — at which point the abstract $R\hat{\mathbf{x}} = Q^{\mathsf{T}}\mathbf{b}$ becomes two numbers, an intercept and a slope.
Answer
20.8 Does QR agree with numpy, and what about the signs?
np.linalg.qr returns a valid QR but with no sign guarantee, so individual columns of its $Q$ may be the negatives of yours. The fix is a one-liner: flip the sign of any column of $Q$ whose corresponding $R$ diagonal entry is negative (and flip that row of $R$ to match).# Compare our Gram-Schmidt QR to numpy's, reconciling the sign convention.
import numpy as np
A = np.array([[1.0, 1.0, 0.0],
[1.0, 0.0, 1.0],
[0.0, 1.0, 1.0]])
Qnp, Rnp = np.linalg.qr(A) # numpy: 'reduced' mode by default
# Force a positive R-diagonal so it matches the canonical Gram-Schmidt output:
signs = np.sign(np.diag(Rnp))
Qnp, Rnp = Qnp * signs, Rnp * signs[:, None] # flip columns of Q and rows of R together
print("R (sign-fixed) =\n", np.round(Rnp, 4))
print("R[0,0], R[1,1], R[2,2] =", np.round(np.diag(Rnp), 4)) # 1.4142 1.2247 1.1547
print("still factors A? max|A - QR| =", np.abs(A - Qnp @ Rnp).max()) # ~1e-16
R has the positive diagonal $1.4142, 1.2247, 1.1547$ — precisely $\sqrt 2, \sqrt 6/2, 2/\sqrt 3$ from our hand computation — and Qnp now matches our $Q$ column for column. The reconstruction $QR = A$ holds either way; the sign fix only aligns the representation with the canonical one. Whenever you compare a QR against a reference, normalize the signs first, or you will chase phantom discrepancies.
np.linalg.qr(A) returns (Q, R) in reduced mode by default: for a tall $m \times n$ matrix ($m > n$), Q is $m \times n$ and R is $n \times n$. Pass mode='complete' for the full $m \times m$ orthogonal Q. numpy computes QR by Householder reflections, not Gram-Schmidt, so it is numerically stable out of the box (§20.10) — but it makes no promise about the sign of each column. There is no mode that forces a positive diagonal; do that yourself with the np.sign(np.diag(R)) trick above if you need the canonical form.20.9 Why does QR give a more stable least squares than the normal equations?
# The concrete line fit via QR: solve R x = Q^T b by back substitution.
import numpy as np
A = np.array([[1.0, 0.0], [1.0, 1.0], [1.0, 2.0], [1.0, 3.0]])
b = np.array([1.0, 3.0, 4.0, 6.0])
Q, R = np.linalg.qr(A)
Q, R = Q * np.sign(np.diag(R)), R * np.sign(np.diag(R))[:, None] # canonical signs
x_hat = np.linalg.solve(R, Q.T @ b) # R x = Q^T b
print("intercept c0, slope c1 =", np.round(x_hat, 4)) # [1.1 1.6]
print("matches np.linalg.lstsq:", np.allclose(x_hat, np.linalg.lstsq(A, b, rcond=None)[0]))
[1.1 1.6] — the intercept $1.1$ and slope $1.6$ we found by hand — and confirms it equals numpy's own lstsq (which is itself QR/SVD-based). The clean little design matrix has taken us all the way from "factor it with Gram-Schmidt" to "here is the fitted line," and every number along the way was checkable by pencil.
lstsq, MATLAB's backslash, R's lm) is built on QR or its cousin the SVD, not on the textbook normal-equations formula.# Least squares two ways: normal equations vs. QR, on a near-collinear design.
import numpy as np
rng = np.random.default_rng(0)
t = np.linspace(0, 1, 50)
# Two predictors that are ALMOST identical (correlated) -> ill-conditioned A:
A = np.column_stack([t, t + 1e-4 * rng.standard_normal(50), np.ones(50)])
b = 2 * t + 1.0 + 0.01 * rng.standard_normal(50)
x_normal = np.linalg.solve(A.T @ A, A.T @ b) # normal equations (squares kappa)
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b) # QR: R x = Q^T b (back substitution)
print("cond(A) =", f"{np.linalg.cond(A):.1e}") # large
print("cond(A^T A) =", f"{np.linalg.cond(A.T @ A):.1e}") # ~ square of cond(A)
print("normal eqns residual ||A x - b|| =", np.linalg.norm(A @ x_normal - b))
print("QR residual ||A x - b|| =", np.linalg.norm(A @ x_qr - b))
cond(A^T A) is roughly the square of cond(A) — exactly the amplification we warned about — and on this near-collinear design the QR residual is at least as small as (and on harder problems noticeably smaller than) the normal-equations residual. The lesson generalizes: when the design matrix is well-conditioned the two agree, but QR never creates the conditioning problem that squaring $A^{\mathsf{T}}A$ does. This is the bridge from the clean theory of Chapter 17 to the numerical reality of numerical methods, and it is precisely why the QR route is the default for regression in statistics.
lm uses a QR decomposition internally) or by orthogonalizing the predictors first (R's poly() builds orthogonal polynomials by Gram-Schmidt, so the design matrix already has $Q^{\mathsf{T}}Q = I$ and the fit is trivially stable). The same trick underlies stable Kalman filtering and the "QR update" methods that econometricians use to refit models as new data streams in. Orthogonality is not a mathematical nicety here — it is the difference between a fit you can trust and one you cannot.20.10 What else is QR good for besides least squares?
20.11 What can go wrong numerically, and how does modified Gram-Schmidt fix it?
Warning
np.linalg.qr uses. We return to conditioning and stability in depth in Chapter 38.# Classical vs. modified Gram-Schmidt on a near-dependent matrix: orthogonality lost vs. kept.
import numpy as np
eps = 1e-8 # makes columns nearly parallel -> ill-conditioned
A = np.array([[1.0, 1.0, 1.0],
[eps, 0.0, 0.0],
[0.0, eps, 0.0],
[0.0, 0.0, eps]])
def gs(A, modified):
Q = np.zeros_like(A)
for j in range(A.shape[1]):
v = A[:, j].copy()
for i in range(j):
ref = v if modified else A[:, j] # MODIFIED uses running v; CLASSICAL uses original
v = v - (Q[:, i] @ ref) * Q[:, i]
Q[:, j] = v / np.linalg.norm(v)
return Q
for name, mod in [("classical", False), ("modified", True)]:
Q = gs(A, mod)
loss = np.abs(Q.T @ Q - np.eye(3)).max() # how far Q^T Q strays from identity
print(f"{name:9s}: max |Q^T Q - I| = {loss:.2e}")
# classical: ~5e-01 (orthogonality DESTROYED) modified: ~7e-09 (near machine precision)
max |Q^T Q - I| comes back around $0.5$ — not a small error but a complete loss of orthogonality, with computed "orthonormal" vectors that are nowhere near perpendicular (an off-diagonal entry of $Q^{\mathsf{T}}Q$ near $0.5$ means two columns meet at roughly sixty degrees instead of ninety). Modified Gram-Schmidt, running the same arithmetic in a different order, holds orthogonality to about $7\times 10^{-9}$ — essentially machine precision for this $\varepsilon = 10^{-8}$ problem. Same formula on paper; the difference between a usable basis and a worthless one. This is the canonical demonstration that how you compute, not just what you compute, decides whether the answer is trustworthy.
toolkit/gram_schmidt.py, in pure Python (no numpy in the implementation), reusing dot and norm from toolkit/vectors.py (Chapter 18) and the project_onto idea from toolkit/projection.py (Chapter 19): gram_schmidt(vectors), which takes a list of independent vectors and returns the orthonormal set as repeated projection-and-subtraction (use the modified ordering — subtract each shadow from the running remainder), raising a clear error if a remainder collapses to (near) zero; and qr_decompose(A), which takes a matrix as a list of column vectors and returns (Q, R) with Q orthonormal and R upper-triangular, reading $R$'s diagonal off the remainder lengths and its above-diagonal entries off the dot products, exactly as in §20.6. Verify against np.linalg.qr — and remember to reconcile the sign convention (§20.8) by forcing a positive $R$-diagonal before you compare, or the columns of your $Q$ may legitimately be the negatives of numpy's.20.12 Bringing it together