Case Study 1 — The Regression That Blew Up: Multicollinearity, Conditioning, and a House-Price Model (Data Science)

DataField.Dev

Case Study 1 — The Regression That Blew Up: Multicollinearity, Conditioning, and a House-Price Model (Data Science)

A data scientist at a real-estate analytics startup is asked to build a model predicting home sale prices. She has 200 sales and a handful of features: square footage, number of bedrooms, and — because the listings come from a mix of US and international sources — the home's area in square meters as well. The model is a textbook multiple linear regression: stack the features into a design matrix $A$, the prices into a vector $\mathbf{b}$, and find the coefficients $\hat{\boldsymbol\beta}$ that minimize $\lVert A\boldsymbol\beta - \mathbf{b}\rVert^2$. She remembers the normal equations from Chapter 17 — $A^{\mathsf{T}}A\,\hat{\boldsymbol\beta} = A^{\mathsf{T}}\mathbf{b}$ — codes them up in three lines, and runs the fit.

The coefficients come back as nonsense. The square-footage coefficient is $0.76$ — implying each extra square foot adds 76 cents to the price of a house — while the square-meters coefficient is $1593$. Neither matches her domain knowledge (homes sell for roughly $150 per square foot in this market), and worse, the numbers swing wildly every time she adds or removes a single data point. The model "works" in the sense that its predictions on the training data are accurate, but the coefficients are uninterpretable garbage. What happened?

The hidden near-collinearity

The culprit is that square feet and square meters carry the same information. One square meter is $10.7639$ square feet, so the "sq m" column is, up to a tiny bit of rounding and unit noise, an exact scalar multiple of the "sq ft" column. The two columns of $A$ point in almost exactly the same direction in $\mathbb{R}^{200}$. As Section 38.2 explained geometrically, near-parallel columns make a matrix whose smallest singular value is nearly zero, which means a gigantic condition number. Let us measure it.

# A house-price design matrix with a redundant (near-collinear) column.
import numpy as np
rng = np.random.default_rng(42)
n = 200
sqft = rng.uniform(1000, 3000, n)
sqm  = sqft * 0.092903 + rng.normal(0, 1e-5, n)   # ~exact multiple of sqft -> collinear
beds = rng.integers(1, 5, n).astype(float)
A = np.column_stack([np.ones(n), sqft, sqm, beds])      # [intercept, sqft, sqm, beds]
price = 50000 + 150*sqft + 25000*beds + rng.normal(0, 10000, n)

print('%.4e' % np.linalg.cond(A))        # 2.0887e+08   kappa(A): already bad
print('%.4e' % np.linalg.cond(A.T @ A))  # 5.9385e+16   kappa(A^T A) = kappa(A)^2: catastrophic

There it is: $\kappa(A) \approx 2.1\times 10^8$, which is already dangerous, and $\kappa(A^{\mathsf{T}}A) \approx 5.9\times 10^{16}$ — past $1/\varepsilon_{\text{mach}}$. By forming $A^{\mathsf{T}}A$, the normal equations squared an already-bad condition number into one beyond the reach of double precision. The fit was doomed before it began.

QR rescues the same data

Now solve the identical least-squares problem with QR instead of the normal equations, and compare both against numpy's SVD-based lstsq (the most robust reference):

# Same problem, two algorithms. QR never forms A^T A.
beta_normal = np.linalg.solve(A.T @ A, A.T @ price)
Q, R = np.linalg.qr(A)
beta_qr = np.linalg.solve(R, Q.T @ price)
beta_svd, *_ = np.linalg.lstsq(A, price, rcond=None)

print(np.linalg.norm(beta_normal - beta_svd))   # 428056.02   normal eq: catastrophically off
print(np.linalg.norm(beta_qr     - beta_svd))   # 8.41e-07    QR: essentially exact

The normal-equations coefficients are off from the truth by a norm of about 428,000 — they are not approximately right, they are wrong. The QR coefficients agree with the SVD reference to within $8.4\times 10^{-7}$. Same 200 houses, same features, same mathematical problem; one algorithm produced a usable model and the other produced numerical rubble. The only difference is that the normal equations formed $A^{\mathsf{T}}A$ and squared the conditioning, exactly as Section 38.5 warned.

This is why production statistics and ML libraries — scikit-learn's LinearRegression, R's lm, numpy's lstsq, MATLAB's backslash — never solve least squares by literally inverting $A^{\mathsf{T}}A$, even though the closed-form formula $\hat{\boldsymbol\beta} = (A^{\mathsf{T}}A)^{-1}A^{\mathsf{T}}\mathbf{b}$ invites it. They all use QR or SVD under the hood precisely to dodge the squared condition number. The data scientist's three-line normal-equations implementation reinvented a wheel that the libraries had deliberately abandoned.

Two cures, and the deeper lesson

There are two ways to fix this, and they teach complementary lessons.

Cure 1 — remove the redundancy. The square-meters column adds no information the square-feet column doesn't already carry. Drop it, and the conditioning collapses:

# Drop the redundant column: the problem becomes well-conditioned again.
A2 = np.column_stack([np.ones(n), sqft, beds])
print(np.linalg.cond(A2))                          # 9130.1   thousands of times better
print(np.linalg.solve(A2.T @ A2, A2.T @ price))    # [50107.8, 148.8, 25560.4]  sane!

With the duplicate column gone, $\kappa$ drops from $2.1\times 10^8$ to about $9{,}100$, and even the normal equations now return sensible coefficients: an intercept near $50{,}000$, about $\$149$ per square foot, and about $\$25{,}560$ per bedroom — recovering the true model. The lesson is one every data scientist eventually learns the hard way: multicollinearity — predictors that are linear combinations of each other — inflates the condition number and makes coefficients unstable and uninterpretable. Feature engineering that removes or combines redundant predictors is not just tidy; it is numerical hygiene.

Cure 2 — regularize. Sometimes you cannot simply drop a column (in genomics or text models, predictors are all somewhat correlated and there is no obvious redundancy to remove). The standard fix is ridge regression, which solves $(A^{\mathsf{T}}A + \lambda I)\hat{\boldsymbol\beta} = A^{\mathsf{T}}\mathbf{b}$ for a small $\lambda > 0$. Adding $\lambda I$ shifts every eigenvalue of $A^{\mathsf{T}}A$ up by $\lambda$, which raises the smallest eigenvalue away from zero and therefore lowers the condition number on purpose. It is conditioning repair in the form of a statistical technique: you accept a little bias in the coefficients in exchange for a great deal of stability, and the connection between the two is exactly the condition number of this chapter.

What the data scientist should have done first

The whole episode would have been a non-event if she had run one diagnostic before trusting the fit: compute np.linalg.cond(A). A condition number of $2\times 10^8$ is a flashing red light — it says "your design matrix is nearly rank-deficient; do not believe these coefficients." The checklist from Section 38.8 is not academic ceremony; it is the difference between shipping a model whose coefficients mean something and shipping one that reorganizes itself every time the data changes. Conditioning is a property of the data she collected, and no choice of solver could have made an ill-posed question well-posed — but the right solver (QR/SVD) plus the right diagnostic (the condition number) would have told her immediately that the square-meters column had to go. In numerical data science, the condition number is the first thing you check and the last thing you should ignore.