Case Study 28.1 — The Shape of Data: Covariance, Mahalanobis Distance, and Outlier Detection
Field: statistics & data science. Anchor tie-in: this is the elliptical-contour anchor of Figure 28.1, now describing the spread of data rather than the level sets of a bowl; it builds directly on covariance matrices and previews PCA (Chapter 32).
The problem
A fraud-detection team at a bank monitors two features of each transaction: the dollar amount and the time-of-day deviation from the account's usual pattern. Most legitimate transactions cluster together, but the cluster is not a circle — large amounts tend to happen at unusual hours, so the two features are correlated, and the cloud of normal transactions forms a tilted ellipse. The team needs an automatic rule for "how unusual is this transaction?" that respects the shape of the cloud. A transaction that is moderately large and at a moderately odd hour might be perfectly normal (it sits along the long axis of the ellipse), while one that is large but at a typical hour might be a glaring outlier (it juts out along the short axis). Plain Euclidean distance from the mean cannot tell these apart — it treats all directions alike. The right tool is the Mahalanobis distance, and it is a quadratic form built from the inverse of a positive definite covariance matrix.
This is the central lesson of the chapter applied to real data: the shape of a data cloud is a positive (semi)definite matrix, and measuring distance "correctly" means measuring it in the metric that matrix defines. The covariance matrix's positive definiteness is not a technicality — it is exactly what guarantees the distance is real and the cloud is a genuine bounded ellipse rather than an impossible saddle.
The covariance matrix and why it is positive definite
Suppose the team estimates, from a large sample of normal transactions (with features centered to mean zero), the covariance matrix $$\Sigma = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}.$$ The diagonal entries are the variances of the two features (amount has variance 4, hour-deviation has variance 3); the off-diagonal entry 2 is their covariance — positive, so the two features rise together. As §28.7 proved, any covariance matrix is positive semidefinite, and this one is strictly positive definite because both features genuinely vary and no exact linear dependence ties them together. Let us confirm it three ways and read off its geometry.
# The covariance matrix is symmetric positive definite; its eigenvectors are the data axes.
import numpy as np
Sigma = np.array([[4.0, 2.0],
[2.0, 3.0]])
print("symmetric?", np.allclose(Sigma, Sigma.T))
print("eigenvalues (all > 0 => PD):", np.round(np.linalg.eigvalsh(Sigma), 4))
print("leading minors:", Sigma[0,0], np.linalg.det(Sigma)) # Sylvester: 4 and 8
w, V = np.linalg.eigh(Sigma)
print("eigenvectors (columns):\n", np.round(V, 4))
symmetric? True
eigenvalues (all > 0 => PD): [1.4384 5.5616]
leading minors: 4.0 7.999999999999998
eigenvectors (columns):
[[ 0.6154 -0.7882]
[-0.7882 -0.6154]]
All three tests agree: the eigenvalues $1.44$ and $5.56$ are positive, and Sylvester's leading minors $4$ and $8$ are positive, so $\Sigma$ is positive definite. (The determinant printing as $7.999\ldots$ rather than exactly $8$ is the usual floating-point rounding from np.linalg.det.) The geometry is now readable: the data cloud is an ellipse whose long axis points along the eigenvector for the larger eigenvalue $5.56$ — roughly the direction $(0.79, 0.62)$, the up-and-to-the-right diagonal along which amount and hour-deviation increase together — and whose short axis is the perpendicular eigenvector. The cloud is stretched along the correlated direction and squeezed across it, exactly the tilted ellipse the team observed. This is the data version of Figure 28.1: same ellipse, same eigenvector axes, same $\sqrt{\lambda}$ extents, now drawn by the data rather than by a bowl's contours.
Mahalanobis distance: a quadratic form that respects the shape
The Mahalanobis distance of a point $\mathbf{x}$ from the mean $\boldsymbol\mu$ measures distance in units of standard deviations along each principal direction: $$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol\mu)^{\mathsf{T}}\,\Sigma^{-1}\,(\mathbf{x} - \boldsymbol\mu)}.$$ The squared distance is a quadratic form $\mathbf{r}^{\mathsf{T}}\Sigma^{-1}\mathbf{r}$ in the residual $\mathbf{r} = \mathbf{x} - \boldsymbol\mu$, and the matrix of the form is $\Sigma^{-1}$. Here positive definiteness pays off twice. First, because $\Sigma$ is positive definite, so is $\Sigma^{-1}$ (Exercise 28.23 — the eigenvalues invert but keep their positive sign), so $D_M$ is the square root of a strictly positive quantity and is a genuine distance. Second, the inverse is exactly the right move geometrically: where $\Sigma$ stretches the cloud along its long axis, $\Sigma^{-1}$ compresses that direction, so a long step along the stretched axis counts as a small Mahalanobis distance, while the same-length step across the cloud counts as large. The metric automatically discounts variation in the directions where the data naturally spreads.
Let us compare two points that are identically far from the mean in ordinary Euclidean terms but worlds apart in Mahalanobis terms. Take $\boldsymbol\mu = \mathbf{0}$ and the two points $\mathbf{a} = (2, 0)$ (large amount, typical hour) and $\mathbf{b} = (0, 2)$ (typical amount, odd hour). Both sit at Euclidean distance 2 from the center.
# Mahalanobis distance respects the cloud's shape; Euclidean distance does not.
import numpy as np
Sigma = np.array([[4.0, 2.0],
[2.0, 3.0]])
Si = np.linalg.inv(Sigma)
print("Sigma^-1 =\n", np.round(Si, 4))
for name, x in [("a = (2,0)", np.array([2.0, 0.0])),
("b = (0,2)", np.array([0.0, 2.0]))]:
eucl = np.linalg.norm(x)
maha = np.sqrt(x @ Si @ x)
print(f"{name}: Euclidean = {eucl:.4f}, Mahalanobis = {maha:.4f}")
Sigma^-1 =
[[ 0.375 -0.25 ]
[-0.25 0.5 ]]
a = (2,0): Euclidean = 2.0000, Mahalanobis = 1.2247
b = (0,2): Euclidean = 2.0000, Mahalanobis = 1.4142
Euclidean distance calls both points equally far (both $2.0$). Mahalanobis distance does not: point $\mathbf{a}$ scores $1.22$ while point $\mathbf{b}$ scores $1.41$. Point $\mathbf{b}$ — an unusual hour with an ordinary amount — is the more anomalous of the two, because it lies more across the grain of the correlated cloud. The difference is modest in this gentle example, but with strongly correlated features (a long thin ellipse) the gap becomes dramatic, and that gap is precisely what lets the system flag the right transactions. The quadratic form has taught the distance metric the shape of normal behavior.
The Mahalanobis circle is an ellipse — the anchor returns
What does "all points at Mahalanobis distance 1" look like? It is the level set $\mathbf{r}^{\mathsf{T}}\Sigma^{-1}\mathbf{r} = 1$ — a quadratic form equal to a constant — which §28.5 tells us is an ellipse. And its axes are the eigenvectors of $\Sigma^{-1}$, which are the same eigenvectors as $\Sigma$ (an inverse keeps the eigenvectors and inverts the eigenvalues), with half-lengths $\sqrt{\lambda_i}$ along the $\Sigma$-eigenvectors. So the unit-Mahalanobis ellipse is stretched along the long axis of the data cloud, with extent set by the standard deviation $\sqrt{5.56} \approx 2.36$ in that direction and $\sqrt{1.44} \approx 1.20$ across it. This is identical to the confidence-ellipse picture of §28.5's Real-World Application: the curve of constant Mahalanobis distance is the error ellipse, the contour of equal statistical "unusualness," and it is a level set of a positive definite quadratic form read exactly as we read Figure 28.1.
This closes the loop with the chapter's anchor. The bowl's contour ellipses, the confidence ellipses of an estimate, the spread ellipse of a data cloud, and the equal-Mahalanobis-distance curves are all the same geometric object — level sets of a positive definite quadratic form, with eigenvector axes and $\sqrt{\lambda}$ extents. Once you see one, you can read all of them.
Why definiteness is the load-bearing assumption
It is worth being explicit about where positive definiteness does the work, because every step quietly depends on it. If $\Sigma$ were merely positive semidefinite — if one feature were an exact linear combination of the other, giving a zero eigenvalue — then $\Sigma^{-1}$ would not exist, the Mahalanobis distance would be undefined, and the data cloud would collapse onto a line (the flat direction of §28.3.1). Real datasets brush against this constantly: two nearly-redundant features make $\Sigma$ nearly singular (a tiny smallest eigenvalue, a very long thin ellipse), and then $\Sigma^{-1}$ has a huge largest eigenvalue and the Mahalanobis distance becomes numerically explosive and unstable — a large condition number, in the language of §28.6's Real-World Application and Chapter 38. Practitioners guard against this by regularizing the covariance, adding a small multiple of the identity, $\Sigma + \varepsilon I$, which bumps every eigenvalue up by $\varepsilon$ and so guarantees strict positive definiteness (and hence invertibility and a bounded condition number). That fix is the same "add $\varepsilon I$ to make it positive definite" trick that appears in ridge regression, Gaussian-process models, and the normal equations of least squares — a direct application of the fact that adding $\varepsilon I$ shifts the spectrum into the strictly-positive range.
Takeaways
- A covariance matrix is symmetric positive semidefinite, and positive definite exactly when the data spreads in every direction (no exact linear dependence among features); its eigenvectors are the principal axes of the data ellipse and its eigenvalues are the variances along them.
- The Mahalanobis distance $\sqrt{\mathbf{r}^{\mathsf{T}}\Sigma^{-1}\mathbf{r}}$ is a quadratic form that measures distance in the metric of the data, discounting variation along the cloud's long axis — the right notion of "unusualness" for correlated features, and a tool plain Euclidean distance cannot replace.
- The curve of constant Mahalanobis distance is an ellipse — the same level-set-of-a-positive-definite-form anchor as Figure 28.1 — with eigenvector axes and standard-deviation extents.
- Positive definiteness is the load-bearing assumption: it guarantees $\Sigma^{-1}$ exists and the distance is real; near-singular covariance (near-zero eigenvalue) is cured by regularizing $\Sigma + \varepsilon I$, the same spectrum-shifting trick used across statistics and machine learning.