Case Study 1 — Predicting House Prices: Multiple Regression as Projection in Feature Space
Field: data science / real estate analytics. Concepts used: overdetermined systems, design matrix, normal equations, projection onto $C(A)$, residual, $R^2$, full column rank. Why it matters: every automated valuation model — Zillow's "Zestimate," a bank's mortgage-risk engine, a tax assessor's mass-appraisal system — starts life as a least-squares projection of a price vector onto the column space of a feature matrix. This is the single most common industrial use of the mathematics in this chapter.
The problem: turn features into a price
A house has attributes — square footage, number of bedrooms, age, lot size, school district, distance to downtown — and it has a price. The central question of real-estate analytics is: given a new house's attributes, what is it worth? The simplest honest answer is a linear model: assume the price is, approximately, a weighted sum of the attributes plus a baseline, $$ \text{price} \approx c_0 + c_1(\text{size}) + c_2(\text{bedrooms}) + c_3(\text{age}). $$ The unknowns are the weights $c_0, c_1, c_2, c_3$ — the marginal value of each attribute. Once we know them, pricing a new house is a single dot product. The trouble is that no four numbers make the formula exactly right for every house in our records; real prices scatter around any linear rule because of noise, taste, timing, and a hundred attributes we did not measure. So we have more houses than parameters — an overdetermined system — and we want the weights that fit best. That is least squares, and the geometry is exactly this chapter's: project the price vector onto the column space of the attribute matrix.
Suppose we have records on eight houses (a tiny dataset, chosen so every number is checkable; real models use millions of rows). The attributes and sale prices are:
| House | Size (1000 sqft) | Bedrooms | Age (yrs) | Price ($1000s) |
|---|---|---|---|---|
| 1 | 1.0 | 2 | 20 | 210 |
| 2 | 1.5 | 3 | 5 | 305 |
| 3 | 1.2 | 2 | 30 | 190 |
| 4 | 2.0 | 4 | 10 | 410 |
| 5 | 1.8 | 3 | 15 | 350 |
| 6 | 2.5 | 4 | 2 | 480 |
| 7 | 1.1 | 2 | 40 | 175 |
| 8 | 2.2 | 4 | 8 | 430 |
Building the design matrix and the column space
Each house contributes one equation "predicted price = actual price," so each house is one row of the design matrix $A$. The columns are the four terms of the model: an all-ones column for the intercept $c_0$, then the size, bedrooms, and age columns. The price vector $\mathbf{b}$ holds the eight sale prices. Symbolically, $$ A = \begin{bmatrix} 1 & 1.0 & 2 & 20 \\ 1 & 1.5 & 3 & 5 \\ 1 & 1.2 & 2 & 30 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 2.2 & 4 & 8 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 210 \\ 305 \\ 190 \\ \vdots \\ 430 \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_3 \end{bmatrix}. $$ This is an $8 \times 4$ system: eight equations, four unknowns — tall and overdetermined. The column space $C(A)$ is a (at most) 4-dimensional flat sitting inside $\mathbb{R}^8$. The price vector $\mathbf{b}$ is a point in $\mathbb{R}^8$ that almost certainly does not lie on that flat (the houses do not obey any exact linear price law), so there is no set of weights that prices all eight houses perfectly. Least squares finds the weights whose predictions are collectively closest to the real prices: the projection of $\mathbf{b}$ onto $C(A)$.
The four columns are clearly independent — size, bedroom count, and age are genuinely different attributes, not multiples of each other or of the constant column — so $A$ has full column rank, $A^{\mathsf{T}}A$ is invertible, and the least-squares weights are unique. That is the §17.5 condition, and it is what makes "the best-fit price model" a well-defined object rather than an ambiguous one. (If we had accidentally included both "size in sqft" and "size in square meters," those two columns would be proportional, $A$ would be rank-deficient, and the individual coefficients would become meaningless — a real hazard called perfect multicollinearity.)
Solving by least squares
We form the normal equations $A^{\mathsf{T}}A\,\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ and solve. By hand this is a $4\times 4$ system — tedious but elementary; in practice we hand it to numpy, which (per §17.9) uses the SVD rather than inverting $A^{\mathsf{T}}A$.
# Multiple regression: project the price vector onto the feature column space.
import numpy as np
size = np.array([1.0, 1.5, 1.2, 2.0, 1.8, 2.5, 1.1, 2.2])
beds = np.array([2, 3, 2, 4, 3, 4, 2, 4], dtype=float)
age = np.array([20, 5, 30, 10, 15, 2, 40, 8], dtype=float)
price = np.array([210, 305, 190, 410, 350, 480, 175, 430], dtype=float) # $1000s
A = np.column_stack([np.ones_like(size), size, beds, age]) # 8x4 design matrix
b = price
x_hat, *_ = np.linalg.lstsq(A, b, rcond=None) # least squares (SVD)
print("coefficients [c0, c1, c2, c3] =", np.round(x_hat, 3))
# -> [ 36.442 122.499 35.265 -1.67 ]
r = b - A @ x_hat
R2 = 1 - (r @ r) / np.sum((b - b.mean())**2)
print("R^2 =", round(R2, 4)) # 0.9925
print("RMSE ($1000s) =", round(np.sqrt(r @ r / len(b)), 2)) # 9.54
The fitted weights are
$$
\hat{\mathbf{x}} = (c_0, c_1, c_2, c_3) \approx (36.4,\; 122.5,\; 35.3,\; -1.67),
$$
giving the model
$$
\widehat{\text{price}} = 36.4 + 122.5\,(\text{size}) + 35.3\,(\text{bedrooms}) - 1.67\,(\text{age}) \quad (\$1000\text{s}).
$$
The fit explains $R^2 = 0.992$ of the variation in price — extremely high, as expected for a clean synthetic dataset, and the root-mean-square error is about $\$9{,}500$ per house. Solving the normal equations directly (np.linalg.solve(A.T @ A, A.T @ b)) yields the identical coefficients here, because this small, well-scaled system is well-conditioned; on messier data the two routes would diverge, and the SVD route used by lstsq would be the trustworthy one.
Reading the coefficients: what the projection tells us
The real payoff of regression is not the prediction but the interpretation of the weights, each of which is a marginal effect holding the other features fixed:
- $c_1 \approx 122.5$: each additional 1,000 square feet of living area adds about \$122,500 to the predicted price, holding bedrooms and age constant. This is the largest driver, which matches intuition — size is the dominant determinant of value.
- $c_2 \approx 35.3$: each additional bedroom adds about \$35,300, beyond the effect of the extra square footage it usually brings. (Because size is held fixed, this isolates the value of partitioning space into another bedroom.)
- $c_3 \approx -1.67$: each additional year of age subtracts about \$1,670. The negative sign is the model learning, from data alone, that older houses are worth less — a genuine pattern recovered by the projection, not assumed.
- $c_0 \approx 36.4$: the intercept is the formal price of a hypothetical zero-size, zero-bedroom, brand-new house — not physically meaningful, but a necessary baseline that lets the other slopes fit correctly. (This is why we always include the all-ones column.)
To price a new house — say 1.6k sqft, 3 bedrooms, 12 years old — we evaluate the model, which is a single dot product of the new feature row with $\hat{\mathbf{x}}$:
new_house = np.array([1, 1.6, 3, 12.0]) # [intercept, size, beds, age]
print("predicted price = $%.1fk" % (new_house @ x_hat)) # $318.2k
The model predicts about **\$318,200**. Geometrically, we have taken a new point in feature space and read off its height on the fitted hyperplane — the same hyperplane that is the projection of the training prices onto $C(A)$.
What the residual means, and why this scales
The residual vector $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ has one entry per house: the gap between what the house actually sold for and what the model predicts. House 1, for instance, sold for \$210k but the model predicts about \$196k, a residual of roughly $+\$14$k — the house sold for *more* than its measured attributes explain, perhaps because of a renovated kitchen or a corner lot the model never saw. Those residuals are not errors to be embarrassed by; they are the signal that *something the model omitted* mattered. Analysts read large residuals as leads: a cluster of underpriced houses in one neighborhood might reveal a feature (a school, a view) worth adding as a new column. Adding that column enlarges $C(A)$ and can only shrink the residual (§17.4) — which is exactly how feature engineering improves a model, one column at a time.
The orthogonality at the heart of this chapter has a concrete reading here too: $A^{\mathsf{T}}\mathbf{r} = \mathbf{0}$ says the residuals are uncorrelated with each feature. The model has squeezed out all the linear information in size, bedrooms, and age — whatever is left in the residual is, by construction, orthogonal to those columns. That is the precise sense in which least squares extracts "everything the linear model can know" from the features, leaving a residual that the chosen columns genuinely cannot explain.
Everything here scales from eight houses to eight million with no change in the mathematics: the design matrix grows rows and columns, the column space lives in a higher-dimensional space, but the fit is still the projection of $\mathbf{b}$ onto $C(A)$, and the weights still solve $A^{\mathsf{T}}A\hat{\mathbf{x}} = A^{\mathsf{T}}\mathbf{b}$ — computed by QR or SVD, never by inverting $A^{\mathsf{T}}A$. This is the workhorse behind automated valuation, and it is nothing more exotic than a perpendicular dropped onto a column space. For the data-science framing with regularization and train/test splits, see linear regression; for the statistical inference layer — standard errors and significance of each coefficient — see regression in statistics.