Case Study 31.2 — Separating Motion from Stillness: SVD Background Removal in Video
Field: computer vision & graphics / signal processing. Anchor tie-in: this is the denoising idea of §31.7 turned into a tool — the background-removal application named there, worked end to end. Where the chapter's image-compression anchor discarded small singular values to save space, here we split a video at the spectral gap to separate the static scene from everything that moves.
The problem: what is moving in this scene?
Point a fixed security camera at a parking lot and record for an hour. Almost nothing in the resulting video changes: the asphalt, the painted lines, the building behind, the parked cars all sit perfectly still, frame after frame. The interesting content — a person walking through, a car pulling in — is a small, transient deviation from that unchanging backdrop. A foundational task in computer vision is background subtraction: automatically separate the static background (the still scene) from the dynamic foreground (the moving objects), so that downstream systems can detect, track, and count whatever moves. The same operation cleans up time-lapse photography, isolates a presenter from a fixed slide, and removes the static structure from medical and scientific imaging.
The connection to linear algebra appears the moment we decide how to store the video as a matrix. Flatten each frame — an $H \times W$ grid of pixels — into a single long column of $H W$ numbers, and stack the columns side by side, one per frame. A one-minute clip at $30$ frames per second becomes a matrix $A$ with $H W$ rows and $1800$ columns. And now the structure of the scene becomes the structure of the matrix, in a way that the SVD reads off immediately.
Why the background is low rank and the foreground is not
Here is the key observation, and it is the §31.7 signal-versus-noise picture in a new guise. The background is nearly rank 1. Because the static scene is identical in every frame, every column of the background matrix is the same vector — the flattened background image — repeated. A matrix whose columns are all copies of one vector $\mathbf{b}$ is exactly $\mathbf{b}\mathbf{1}^{\mathsf{T}}$, an outer product: rank 1. Even with the small lighting flicker and sensor noise of a real camera, the background stays nearly rank 1, concentrating almost all of the video's energy into a single dominant singular value with singular vector equal to the background image.
The foreground is the opposite: incoherent across frames. A person walking through appears in a different place in each frame, so the foreground contributes a different small pattern to each column — there is no single repeated structure, so its energy spreads across many small singular values, just like the noise of §31.7. Add genuine sensor noise on top, equally incoherent, and the picture is unmistakable in the spectrum: one giant singular value for the static background, and a bulk of small singular values for everything that moves or flickers. Truncating at the gap separates the two — the rank-1 approximation is the background, and what is left over is the foreground.
A concrete demonstration
Let us simulate a short clip and watch the separation happen. We build a static background, add a small bright "object" that moves across the scene frame by frame, add sensor noise, and then ask the SVD to recover the background it never saw in isolation.
# Background removal by rank-1 truncation: separate the still scene from the motion.
import numpy as np
rng = np.random.default_rng(11)
H = W = 64; pixels = H * W; n_frames = 100
bg = rng.uniform(0.2, 0.8, pixels) # the static background (flattened)
video = np.tile(bg[:, None], (1, n_frames)) # background repeated in every column
for t in range(n_frames): # add a moving 8x8 bright block
obj = np.zeros((H, W))
r, c = (5 + t // 2) % (H - 8), (10 + t) % (W - 8)
obj[r:r+8, c:c+8] = 0.9
video[:, t] += obj.ravel()
video += 0.05 * rng.standard_normal((pixels, n_frames)) # sensor noise
U, s, Vt = np.linalg.svd(video, full_matrices=False)
print("top 8 singular values:", np.round(s[:8], 1))
print("background energy fraction (rank 1):", round(s[0]**2 / np.sum(s**2), 4))
background = (U[:, :1] * s[:1]) @ Vt[:1] # rank-1 ≈ static background
foreground = video - background # leftover = motion + noise
bg_err = np.linalg.norm(background[:, 0] - bg) / np.linalg.norm(bg)
print("rank-1 background recovery, relative error:", round(bg_err, 4))
print("foreground energy fraction:", round(np.sum(foreground**2) / np.sum(video**2), 4))
top 8 singular values: [345.4 18.7 18.5 18.2 18. 17.5 17. 16.4]
background energy fraction (rank 1): 0.9531
rank-1 background recovery, relative error: 0.0548
foreground energy fraction: 0.0469
The spectrum is exactly the picture we predicted. The first singular value, $345$, dwarfs everything else; the rest form a flat bulk hovering near $18$, the combined contribution of the moving object and the noise spread across the frames. There is a single, enormous gap after $\sigma_1$, which is the spectral signature of "one static thing plus a lot of incoherent change." Truncating to rank 1 captures $95.3\%$ of the video's energy and recovers the background image to a relative error of just $5.5\%$ — and the recovered background is essentially identical in every column (the object, being in a different place each frame, averages out and barely perturbs the dominant pattern). Subtracting that rank-1 background from the video leaves the foreground, which holds the remaining $4.7\%$ of the energy: precisely the moving block plus the sensor noise, isolated automatically. We separated motion from stillness without ever labeling a single pixel, purely by splitting the SVD at the gap.
Reading the result through the chapter's lens
Every piece of this is the chapter's machinery, relabeled for video. The "signal we keep" is the background, living in the one large singular value, exactly as the clean signal of §31.7 lived in its large singular values. The "thing we separate out" is the foreground, spread across the small singular values like the noise of §31.7 — except that here the small-singular-value content is not garbage to discard but the very thing we want (the moving object). This is the deepest point of the case study: truncation does not destroy the small-singular-value content; it isolates it. Compression throws the tail away; denoising throws the tail away; background subtraction keeps both halves separately — the rank-1 part as the background, the residual as the foreground. Same decomposition, different use of the pieces.
The choice of where to truncate follows §31.7.1 precisely. We cut at the obvious gap after $\sigma_1$ because the background is rank 1. If the scene had a slowly changing background — a gradual sunset, swaying trees — the background would be rank 2 or 3 rather than rank 1, the gap would appear a few singular values later, and we would truncate there instead. The scree plot, with its giant first value and flat tail, is the diagnostic that tells us the rank of the static structure, just as it told us the rank of the signal in the chapter and the number of taste factors in Case Study 31.1.
From the toy to the real thing: robust PCA
Real surveillance video adds one complication our simulation glossed over, and naming it shows where the field goes next. A moving object is not just "small" — it is sparse: it occupies a few pixels intensely rather than perturbing every pixel a little, which is different from the spread-out Gaussian noise of plain truncation. The state-of-the-art method, robust PCA, refines the SVD idea to handle exactly this: it decomposes the video matrix as a low-rank part (the background) plus a sparse part (the moving foreground), $A = L + S$, by an optimization that minimizes the rank of $L$ and the sparsity of $S$ simultaneously. Plain rank-1 truncation, as we did above, is the simple first cousin of robust PCA — it works beautifully when the foreground is small and the background is genuinely static, and it conveys the whole idea. The same low-rank-plus-sparse decomposition cleans scanned documents (low-rank page, sparse ink), separates reflections from photographs, and removes structured artifacts from MRI and astronomical images. In every case the engine is the SVD's sorting of structure into large singular values and clutter into small ones.
Takeaways
- Flattening each video frame into a column makes a video a matrix; the static background, being identical across frames, is the rank-1 outer product $\mathbf{b}\mathbf{1}^{\mathsf{T}}$ and dominates the spectrum.
- The moving foreground is incoherent across frames, so (like noise) it spreads across many small singular values — producing a giant gap after $\sigma_1$ that the SVD reads off directly.
- Rank-1 truncation recovers the background (here to $5.5\%$ error, capturing $95.3\%$ of the energy); the residual $A - A_1$ isolates the moving object — separation with no pixel labeling.
- Unlike compression and denoising, which discard the small-singular-value tail, background subtraction keeps both halves: the low-rank part as background, the residual as foreground. Same decomposition, different use.
- The production method, robust PCA, refines this to a low-rank-plus-sparse split $A = L + S$, but the SVD's separation of structure (large singular values) from clutter (small) is its foundation — the denoising idea of §31.7, made into a tool.