Case Study 1 — When the Normal Equations Lie: Stable Regression by QR

DataField.Dev

Case Study 1 — When the Normal Equations Lie: Stable Regression by QR

Field: data science / statistics. This case study extends the chapter's anchor — building and using an orthonormal basis — to the everyday task of fitting a model to data, and shows in concrete numbers why the QR route is the one the professionals trust.

The setup: a polynomial that should be easy

A data analyst at a climate lab has thirty temperature readings taken over a single day and wants a smooth curve through them. The obvious tool is polynomial regression: model the temperature as $y \approx c_0 + c_1 t + c_2 t^2 + \dots + c_d t^d$ and find the coefficients $c_0, \dots, c_d$ that minimize the squared error. In matrix form this is least squares, $A\mathbf{c} \approx \mathbf{b}$, where the design matrix $A$ has one row per reading and columns $1, t, t^2, \dots, t^d$ (the Vandermonde matrix), and $\mathbf{b}$ holds the measured temperatures.

For a low degree — a straight line or a parabola — everything works. The analyst forms the normal equations $A^{\mathsf{T}}A\,\hat{\mathbf{c}} = A^{\mathsf{T}}\mathbf{b}$ from Chapter 17, solves the small system, and gets sensible coefficients. The trouble begins when the data has genuine structure — a morning dip, a midday peak, an evening fall — that a quadratic cannot capture. The analyst raises the degree to $d = 12$ and beyond to get a flexible enough curve, and the normal-equations route quietly starts to fail: the matrix $A^{\mathsf{T}}A$ becomes numerically singular, the solver returns coefficients corrupted by roundoff, and at still higher degrees the answer is simply not trustworthy. The code did not crash. It confidently returned numbers that no longer mean what they should — and crucially, the analyst has no warning that anything went wrong.

Why the normal equations failed

The villain is the design matrix's condition number, and the chapter named the mechanism precisely in §20.9. Over a time interval — say $t \in [0, 1]$ — the columns $t^{11}$ and $t^{12}$ are nearly indistinguishable: both are tiny near $t = 0$ and both climb steeply toward $t = 1$, tracing almost the same shape. Nearly identical columns mean the design matrix $A$ is ill-conditioned — its columns are close to linearly dependent, so the matrix is close to singular even though it is technically full rank.

That alone would be manageable. The fatal step is forming $A^{\mathsf{T}}A$. As §20.9 explained, this operation squares the condition number: a degree-12 Vandermonde matrix on $[0,1]$ has $\kappa \approx 8\times 10^8$, so $A^{\mathsf{T}}A$ has condition number $\kappa^2 \approx 10^{17}$. Standard double-precision floating point carries only about sixteen significant digits, so a condition number above $10^{16}$ means every digit of the answer can be swamped by roundoff — the matrix $A^{\mathsf{T}}A$ is, for numerical purposes, singular. The analyst threw away the entire precision of the computation in a single matrix multiplication, before solving anything. The normal equations are mathematically exact and numerically suicidal: $A$ itself was merely difficult ($\kappa \approx 10^9$, still workable), but squaring it pushed the problem past the precision cliff.

# Demonstrate the conditioning catastrophe on a degree-12 Vandermonde matrix.
import numpy as np
rng = np.random.default_rng(1)
t = np.linspace(0, 1, 30)
y = np.sin(2 * np.pi * t) + 0.02 * rng.standard_normal(30)   # smooth signal + small noise
A = np.vander(t, 13, increasing=True)        # columns 1, t, t^2, ..., t^12  (30 x 13)
print("cond(A)     =", f"{np.linalg.cond(A):.1e}")        # ~8e8   (A itself: difficult)
print("cond(A^T A) =", f"{np.linalg.cond(A.T @ A):.1e}")  # ~1e17  -- the SQUARE, past the cliff
c_normal = np.linalg.solve(A.T @ A, A.T @ y)              # normal equations
print("largest |coefficient| (normal eqns):", f"{np.abs(c_normal).max():.2e}")

The printout makes the abstract warning visceral: cond(A) is around $10^9$ — large but still inside the range double precision can handle — while cond(A^T A) is its square, above $10^{17}$, which exceeds the roughly $10^{16}$ that sixteen significant digits can resolve. Once the condition number crosses that line, the solution to the normal equations is governed by roundoff rather than by the data; the coefficients it returns have lost their leading digits even though the solver reports no error.

The QR rescue

The analyst's senior colleague rewrites two lines. Instead of forming $A^{\mathsf{T}}A$, factor $A = QR$ by QR decomposition and solve the triangular system $R\hat{\mathbf{c}} = Q^{\mathsf{T}}\mathbf{b}$ by back substitution, exactly as the chapter derived. Because QR never forms $A^{\mathsf{T}}A$, the conditioning of the computation stays at $\kappa \approx 10^9$, not $\kappa^2 \approx 10^{17}$ — comfortably inside what double precision can resolve. The roughly seven surviving significant digits are plenty for a temperature curve.

# Same fit via QR: solve R c = Q^T y by back substitution. No A^T A is ever formed.
Q, R = np.linalg.qr(A)
c_qr = np.linalg.solve(R, Q.T @ y)                  # R c = Q^T y
resid_normal = np.linalg.norm(A @ c_normal - y)
resid_qr     = np.linalg.norm(A @ c_qr - y)
print("residual  normal eqns =", f"{resid_normal:.4e}")
print("residual  QR          =", f"{resid_qr:.4e}")   # no worse, and trustworthy
print("QR matches np.polyfit:",
      np.allclose(c_qr, np.polyfit(t, y, 12)[::-1], atol=1e-6))   # True: polyfit is QR-based

The QR residual is no worse than the normal-equations residual at degree 12, and as the degree climbs it pulls ahead — but the decisive line is the last one: numpy's own np.polyfit agrees with the hand-rolled QR solution to six places, because polyfit (like R's lm, like MATLAB's backslash) is built on QR, not on the normal equations. That agreement is the certification that the QR coefficients are the trustworthy ones. The library writers learned this lesson long ago and baked the stable algorithm into the default; when you call polyfit, you are calling the QR route of this chapter. (A subtlety worth honesty about: at this degree the fitted coefficients are genuinely large in either method, because a degree-12 polynomial on $[0,1]$ is an intrinsically oscillatory, ill-posed basis — large coefficients here reflect the basis, not a bug. The point of QR is not that it makes the coefficients small, but that it computes the correct coefficients for the chosen model without the precision loss of squaring $\kappa$.)

The deeper fix: orthogonal predictors

There is an even cleaner way to think about what went wrong, and it returns us to the chapter's central idea. The problem was that the predictor columns $1, t, t^2, \dots$ were nearly parallel — they overlapped, carrying redundant, mutually contaminating information. The cure for overlap is orthogonality. Statisticians fit high-degree polynomials using orthogonal polynomials: instead of the raw powers, they use a basis built by Gram-Schmidt so that the design matrix already has orthonormal columns, $Q^{\mathsf{T}}Q = I$. R's poly() function does exactly this; the Legendre and Chebyshev families from the chapter's Math-Major Sidebar are the same construction with specific weightings.

Once the predictors are orthonormal, the fit is trivially stable — there is no $A^{\mathsf{T}}A$ to invert, because $A^{\mathsf{T}}A = Q^{\mathsf{T}}Q = I$. Even better, the regression coefficients decouple: each coefficient is just $\mathbf{q}_i \cdot \mathbf{b}$, a single dot product, computed independently of all the others. Adding a higher-degree term does not disturb the lower-degree coefficients at all — a property the raw powers conspicuously lack, where adding one term shuffles every coefficient. This is the "non-interfering directions" theme of Part IV made financial: orthogonal predictors are coordinates that do not talk to each other, so each one's contribution is clean, stable, and interpretable.

What the analyst learned

Three lessons generalize far beyond polynomial fitting. First, a correct formula is not a correct computation — the normal equations are exactly right on paper and catastrophically wrong in floating point for ill-conditioned data, which is why understanding conditioning (Chapter 38) is not optional for anyone who fits models to real data. Second, QR is the default tool for least squares precisely because it avoids the conditioning penalty, and every serious numerical library reflects that choice. Third, and most conceptually, orthogonality is a computational gift, not just a geometric nicety: building an orthonormal basis by Gram-Schmidt turns a fragile, entangled problem into a robust, decoupled one. The same right-angle instinct from grade school, the same repeated projection from Chapter 19, is what stands between the climate lab and a curve full of lies. When a statistical fit misbehaves at high flexibility, the experienced analyst's first thought is not "the model is wrong" but "the basis is ill-conditioned — orthogonalize it."