Case Study 1 — When Features Are Redundant: Linear Independence and Collinearity in Data Science

DataField.Dev

Case Study 1 — When Features Are Redundant: Linear Independence and Collinearity in Data Science

Field: data science / statistics. Concepts used: linear independence, span, rank, uniqueness of coordinates. Anchor tie-in: this is the feature spaces in machine learning story — the dimensionality of a feature space is governed by how many independent features it really has.

The setup: a model that won't behave

Imagine you are a data scientist at a health-tech startup, building a simple linear model to predict a person's resting metabolic rate from a few body measurements. You collect three features for each of 200 people — height_cm, height_inches, and weight_kg — and you stack them, with a column of ones for the intercept, into a design matrix $X$ with one row per person and one column per feature. Then you hand $X$ to a least-squares solver and get back nonsense: coefficients that are astronomically large, that flip sign and explode when you add one more data point, and that no two software packages agree on.

Nothing is wrong with your data in the ordinary sense — there are no typos, no missing values. The problem is linear-algebraic, and it is exactly the subject of this chapter. Two of your feature columns are linearly dependent, and that dependence quietly poisons the entire model. Understanding why turns an inscrutable numerical failure into an obvious, fixable mistake.

The dependence hiding in plain sight

Look at the two height columns. A person's height in inches is just their height in centimeters divided by $2.54$ — every entry, exactly. As vectors (columns of $X$), that means $$\text{height\_inches} = \tfrac{1}{2.54}\,\text{height\_cm}.$$ One column is a scalar multiple of the other: they are parallel, hence linearly dependent, exactly like the pair $(1, 2)$ and $(3, 6)$ from §6.5. The two columns carry identical information — knowing a person's height in centimeters tells you their height in inches with zero new content. In the language of this chapter, the second feature is redundant: it lies in the span of the first.

Let's confirm the dependence with a rank check, the one-line diagnostic from §6.5.

# A design matrix with two dependent (redundant) feature columns.
import numpy as np
np.random.seed(0)
n = 200
height_cm = np.random.normal(170, 10, n)
height_in = height_cm / 2.54                 # EXACTLY a multiple of height_cm
weight    = np.random.normal(70, 12, n)
X = np.column_stack([np.ones(n), height_cm, height_in, weight])  # 4 feature columns
print("design matrix shape:", X.shape)        # (200, 4)
print("rank(X) =", np.linalg.matrix_rank(X))  # 3  -- NOT 4!
# Just the two height columns, on their own:
print("rank of the two height columns =",
      np.linalg.matrix_rank(np.column_stack([height_cm, height_in])))  # 1

The design matrix has four columns but rank 3, not 4. One column is doing no independent work. Isolating the two height columns, their rank is 1 — they span only a line, confirming they are parallel. The feature space you think is 4-dimensional is really only 3-dimensional; you have described a 3-dimensional space with 4 vectors, and one of them is wasted.

Why dependence breaks the model: coordinates stop being unique

Here is the deep reason the model misbehaves, and it comes straight from §6.10. A linear model tries to write the target (metabolic rate) as a unique linear combination of the feature columns — the regression coefficients are the coordinates of the prediction in the feature "basis." But uniqueness of coordinates requires the columns to be independent. When two columns are dependent, the coordinates are no longer unique: there are infinitely many coefficient vectors that produce the exact same predictions.

Concretely, suppose the true relationship needs "$2$ units of height." Because height_in $= \tfrac{1}{2.54}$height_cm, the model can satisfy this with $2$ on the cm column and $0$ on the inches column, or $0$ on cm and $2 \times 2.54 = 5.08$ on inches, or any of infinitely many splits in between — all giving identical fitted values. The solver has no way to choose among them, so it returns wild, unstable numbers. This is the practical face of "dependence destroys uniqueness of coordinates": the regression problem has no single answer.

The geometric picture: with independent features, the target's shadow on the feature space lands at one well-defined point with one set of coordinates. With a redundant feature, the "coordinate axes" overlap, and the same point has endlessly many coordinate readings — the model cannot pin down which to report.

Exact versus near dependence: the condition number

Real data rarely has exactly duplicated columns; more often two features are almost dependent — height in centimeters and height in inches recorded with slight rounding, or two lab measurements that track each other at $0.999$ correlation. Then the rank is technically full, but the matrix is ill-conditioned: nearly rank-deficient, and almost as troublesome. The standard measure is the condition number (Chapter 38 develops this properly), which blows up as columns approach dependence.

# Exact dependence vs. near-dependence vs. fixed: watch the condition number.
import numpy as np
np.random.seed(0)
n = 200
height_cm = np.random.normal(170, 10, n)
weight    = np.random.normal(70, 12, n)
height_in_exact = height_cm / 2.54
height_in_noisy = height_cm / 2.54 + np.random.normal(0, 1e-3, n)   # almost dependent

X_exact = np.column_stack([np.ones(n), height_cm, height_in_exact, weight])
X_noisy = np.column_stack([np.ones(n), height_cm, height_in_noisy, weight])
X_fixed = np.column_stack([np.ones(n), height_cm, weight])           # drop the duplicate

print("rank exact :", np.linalg.matrix_rank(X_exact),
      " cond = {:.1e}".format(np.linalg.cond(X_exact)))   # rank 3, cond ~1e16 (singular)
print("rank noisy :", np.linalg.matrix_rank(X_noisy),
      " cond = {:.1e}".format(np.linalg.cond(X_noisy)))   # rank 4, cond ~2e5 (ill-conditioned)
print("rank fixed :", np.linalg.matrix_rank(X_fixed),
      " cond = {:.1e}".format(np.linalg.cond(X_fixed)))   # rank 3, cond ~3e3 (healthy)

The exact-duplicate matrix has rank 3 and a condition number around $10^{16}$ — effectively singular, numerically hopeless. The near-duplicate matrix is technically full-rank, but its condition number near $2 \times 10^5$ still signals trouble: small changes in the data produce large swings in the coefficients. Dropping the redundant column gives rank 3 with a healthy condition number around $3 \times 10^3$ — a model that behaves. The cure is to remove dependent features so the remaining columns are independent.

The lesson, and what a data scientist actually does

This phenomenon has a name in statistics — multicollinearity — and it is one of the most common practical failures in linear modeling. The two height columns are an obvious case, but collinearity sneaks in subtly: total income built from its components, a "BMI" feature alongside the height and weight it is computed from, one-hot encoded categories that sum to a constant. Every one of these is a linear dependence among feature columns, and every one is detected by the same rank check you learned in §6.5.

A practicing data scientist therefore treats the feature matrix as a set of vectors and asks the chapter's central question: are these independent, or is some feature redundant? The workflow is exactly the linear algebra of this chapter:

Span: the feature columns span a subspace of $\mathbb{R}^{200}$ — the set of all predictions the model can possibly make. Adding a redundant feature does not enlarge that span (§6.3's pitfall: more vectors need not mean a bigger span), so it adds capability nowhere.
Independence: check np.linalg.matrix_rank(X) against the number of columns. If rank $<$ columns, some feature is a linear combination of the others — find it (via the null space, as in §6.5) and drop it.
Basis: the surviving independent columns form a basis for the feature subspace, and now the regression coefficients are unique — the model is identifiable and stable.

The payoff is conceptual as much as practical. A "feature space" is not just a bag of columns; it is a vector space, and its true dimensionality is the rank — the number of independent features, not the number of features you happened to collect. Two researchers can describe the same data with different feature sets; what is invariant is the dimension of the span. That is the same insight that makes a basis the right object to coordinatize with (§6.10), and it scales all the way up to the high-dimensional embeddings of modern machine learning, where "intrinsic dimensionality" — how many directions the data really varies in — is precisely the rank of a data matrix and the subject of principal component analysis in Chapter 32.