Case Study 1 — Degrees of Freedom: Rank-Nullity Behind Linear Regression

DataField.Dev

Case Study 1 — Degrees of Freedom: Rank-Nullity Behind Linear Regression

Field: statistics / data science. Concepts used: rank, nullity, the four fundamental subspaces, column space and left null space, orthogonality. Anchor tie-in: this is the four-subspaces picture standing behind the most-used model in all of applied statistics — and it explains the mysterious phrase "degrees of freedom" that every regression output reports.

The mystery in every regression output

Run a linear regression in any statistics package — R, Python's statsmodels, even a spreadsheet — and the output reports a number called the residual degrees of freedom, equal to $n - p$: the number of data points minus the number of parameters you fit. It shows up everywhere downstream: in the denominator of the variance estimate, in the $t$-tests on each coefficient, in the $F$-statistic for the whole model. Students are told to compute it and move on, but almost no one is told what it is. It is not a statistical convention pulled from a hat. It is the dimension of a fundamental subspace, handed to you by the rank-nullity theorem of this chapter.

Here is the setup, stripped to its linear-algebra skeleton. You have $n$ data points $(x_i, y_i)$ and you want to fit a straight line $y = \beta_0 + \beta_1 x$. Stack the data into a design matrix $X$ with one row per data point: a column of ones (for the intercept $\beta_0$) and a column of the $x$-values (for the slope $\beta_1$). The model says $\mathbf{y} \approx X\boldsymbol{\beta}$, where $\boldsymbol{\beta} = (\beta_0, \beta_1)$ are the two unknown parameters. With $n$ points and $p = 2$ parameters, $X$ is an $n \times p$ matrix — tall, because you (almost) always have more data than parameters.

Why the data vector usually isn't reachable

The tall shape is the whole story. The column space $C(X)$ — every vector $X\boldsymbol{\beta}$ the model can possibly produce — is a $p$-dimensional subspace (here $p = 2$, a plane) sitting inside the $n$-dimensional space $\mathbb{R}^n$ where the data vector $\mathbf{y}$ lives. When $n > p$, that plane is a thin slice of a much bigger space, and a generic data vector $\mathbf{y}$ does not lie in it. There is no exact solution to $X\boldsymbol{\beta} = \mathbf{y}$; the system is overdetermined, exactly the situation Chapter 14's tall-matrix example flagged.

So what does regression do? It gives up on hitting $\mathbf{y}$ exactly and instead finds the $\boldsymbol{\beta}$ whose prediction $\hat{\mathbf{y}} = X\boldsymbol{\beta}$ is closest to $\mathbf{y}$ — the orthogonal projection of $\mathbf{y}$ onto the column space (the least-squares method, developed fully in Chapters 17 and 19). The leftover, $\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}}$, is the residual vector: the part of the data the model cannot explain. And here is the payoff that connects straight back to this chapter — the residual vector lives in the left null space $N(X^{\mathsf{T}})$.

Why? Because least squares makes the residual orthogonal to the column space: $X^{\mathsf{T}}\mathbf{r} = \mathbf{0}$, which is exactly the defining equation of $N(X^{\mathsf{T}})$. The data vector $\mathbf{y}$ splits into two perpendicular pieces — the fitted part $\hat{\mathbf{y}} \in C(X)$ and the residual part $\mathbf{r} \in N(X^{\mathsf{T}})$ — precisely the orthogonal-complement decomposition of $\mathbb{R}^m$ that §14.10 previewed. The four subspaces are not background; they are the architecture of the fit.

# Regression as the four-subspaces decomposition of the data vector.
import numpy as np
np.set_printoptions(precision=4, suppress=True)
x = np.array([1, 2, 3, 4, 5.])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
X = np.column_stack([np.ones(5), x])         # 5 x 2 design matrix: [1 | x]
n, p = X.shape
r = np.linalg.matrix_rank(X)
print("n =", n, " p =", p, " rank =", r)                 # 5 2 2 (full column rank)
print("residual degrees of freedom n - p =", n - p)      # 3
print("dim of left null space N(X^T) = m - r =", n - r)  # 3  -- the SAME number

beta, *_ = np.linalg.lstsq(X, y, rcond=None)             # least-squares fit
print("fitted (intercept, slope) =", np.round(beta, 4))  # [0.05 1.99]
resid = y - X @ beta                                      # residual vector
print("residual =", np.round(resid, 4))                  # [ 0.06 -0.13 0.18 -0.21 0.1]
print("X^T @ residual =", np.round(X.T @ resid, 6))       # [0. 0.] -> resid in N(X^T)

The code makes the identity concrete. The design matrix is $5 \times 2$ with full column rank $2$. The residual degrees of freedom, $n - p = 3$, is identical to $\dim N(X^{\mathsf{T}}) = m - r = 5 - 2 = 3$. And X.T @ resid is the zero vector — the residual really is orthogonal to the column space, i.e. it lives in the left null space. The phrase "3 degrees of freedom" means precisely: the residual is free to point anywhere in a 3-dimensional subspace of $\mathbb{R}^5$.

Reading the degrees-of-freedom count through rank-nullity

Now the rank-nullity theorem makes the whole accounting transparent, and it does so in both of its forms. Think of the data space $\mathbb{R}^n$ as split by the design matrix:

The column space $C(X)$ has dimension $r = p$ (assuming full column rank — independent predictors). This is the model space: $p$ dimensions of fitted values the model can produce. These are the model degrees of freedom.
The left null space $N(X^{\mathsf{T}})$ has dimension $m - r = n - p$. This is the residual space: the directions the model leaves untouched. These are the residual degrees of freedom.

The output-space form of rank-nullity, $\operatorname{rank}(X) + \dim N(X^{\mathsf{T}}) = m$, reads in this language as $$\underbrace{p}_{\text{model d.f.}} + \underbrace{(n - p)}_{\text{residual d.f.}} = \underbrace{n}_{\text{total d.f.}}.$$ That is the decomposition of degrees of freedom that every analysis-of-variance table reports — and it is rank-nullity, nothing more. Each data point contributes one degree of freedom; fitting $p$ parameters "uses up" $p$ of them (the model space); the remaining $n - p$ are free to wiggle as residual. You cannot estimate the noise variance from fewer than $n - p$ residual degrees of freedom, which is why a model with as many parameters as data points ($p = n$) has zero residual degrees of freedom: it fits perfectly, explains nothing, and tells you nothing about the noise. The left null space has collapsed to $\{\mathbf{0}\}$, and with it your ability to assess the fit.

The geometric picture a statistician carries: the $n$-dimensional data space is carved by the design matrix into a small $p$-dimensional model plane and its large $(n-p)$-dimensional orthogonal complement. The fit is the shadow of $\mathbf{y}$ on the plane; the residual is the perpendicular drop to it. "Degrees of freedom" counts dimensions in each piece, and rank-nullity guarantees they add to $n$.

When predictors are redundant: the rank drops

Everything above assumed $X$ has full column rank — that the predictors are linearly independent. What if they are not? Suppose a careless analyst includes the same predictor twice in different units — say, a temperature in Celsius and in Fahrenheit, or (as in Chapter 6's collinearity story) a height in centimeters and in inches. Those two columns are linearly dependent, so the rank of $X$ drops below $p$, and now the other null space — the ordinary null space $N(X)$ — becomes nonzero.

# Redundant predictors drop the rank and open up the null space N(X).
import numpy as np
x = np.array([1, 2, 3, 4, 5.])
# Same predictor twice: x and x in "different units" (here, scaled by 2.54).
X_bad = np.column_stack([np.ones(5), x, 2.54 * x])   # 5 x 3, but rank only 2
p = X_bad.shape[1]
r = np.linalg.matrix_rank(X_bad)
print("p (columns) =", p, " rank =", r)              # 3 2  -> rank-deficient
print("nullity n - r (of N(X)) =", p - r)            # 1  -> a redundant direction
# rank-nullity (input form): rank + nullity = number of columns
print("rank + nullity =", r + (p - r), " == p =", p) # 3 == 3

The third column added a parameter ($p = 3$) but no rank ($r = 2$), so by rank-nullity the null space $N(X)$ now has dimension $p - r = 1$. That nonzero null space is fatal for the regression: it means infinitely many coefficient vectors $\boldsymbol{\beta}$ produce the exact same predictions (add any null-space vector to $\boldsymbol{\beta}$ and $X\boldsymbol{\beta}$ is unchanged), so the coefficients are not identifiable. This is multicollinearity, and the diagnosis is pure Chapter 14: the input-space form of rank-nullity says a deficient rank forces a nonzero null space, and a nonzero null space destroys the uniqueness of the coefficients. The cure is to drop redundant predictors until the columns are independent and $N(X) = \{\mathbf{0}\}$ again.

The lesson

Linear regression is the four-subspaces picture wearing a statistician's hat. The design matrix $X$ carries all four spaces, and each has a statistical name. The column space is the model space (its dimension, the rank, is the model degrees of freedom). The left null space is the residual space (its dimension, $n - p$, is the residual degrees of freedom). The null space is the space of unidentifiable parameter combinations (zero when predictors are independent, nonzero under collinearity). And the two forms of rank-nullity are the two degrees-of-freedom decompositions every regression reports: $p + (n - p) = n$ in the data space, and $r + (\text{nullity}) = p$ in the parameter space.

The practical upshot is that you can predict the entire degrees-of-freedom bookkeeping of a model before fitting it, from two numbers: the shape of $X$ and its rank. Full column rank means identifiable coefficients and $n - p$ residual degrees of freedom; rank deficiency means trouble. The next time a regression output shows "$\text{df} = n - p$," you will recognize it for what it is — the dimension of the left null space of the design matrix, dictated by the rank-nullity theorem. The abstraction of Chapter 14 is not abstract at all; it is the silent engine of applied statistics, and it returns in full force in Chapter 17, where we build least-squares regression as projection onto the column space.