Case Study 1 — How MP3 Throws Away Sound You Cannot Hear

DataField.Dev

Case Study 1 — How MP3 Throws Away Sound You Cannot Hear

Field: audio signal processing and compression. This case study is tied to the chapter's anchor — decomposing and reconstructing a signal — and shows the projection-and-truncate idea doing commercially consequential work.

The problem: a CD is enormous

A single minute of CD-quality stereo audio is about $10$ megabytes. Sampled $44{,}100$ times per second, two channels, $16$ bits per sample, the raw numbers pile up fast: a three-minute song is roughly $30$ MB. In the late 1990s, when home internet ran at a few kilobits per second and a hard drive held a few gigabytes, storing or sending music at that size was impractical. The MP3 format (and its successors AAC, Opus, Vorbis) shrank that same song to two or three megabytes — a tenfold reduction — with quality most listeners cannot distinguish from the original. The mathematics that makes this possible is the mathematics of this chapter: a sound is a vector, the pure frequencies are an orthogonal basis, and the encoder keeps the projections that matter and discards the ones that do not.

Step 1: a sound is a vector; its frequencies are an orthogonal basis

A digital sound is a long list of pressure samples — exactly the sampled-signal vector $\mathbf{f} \in \mathbb{R}^N$ of §22.8. By itself that list is opaque; you cannot look at a million numbers and see "a flute playing A." But projected onto the frequency basis, the sound reveals its structure. Most musical sounds are dominated by a handful of frequencies — a fundamental pitch and a few overtones — so when you compute the Fourier (more precisely, the closely related cosine-transform) coefficients, most of the energy concentrates into a small number of coefficients, and the vast majority are nearly zero.

This concentration is the whole game, and it is a direct consequence of Parseval's identity from §22.10. Parseval says the total energy of the sound equals the sum of the energies of its frequency coefficients. If $95\%$ of that energy sits in $10\%$ of the coefficients — which is typical for natural sound — then keeping that $10\%$ reconstructs the sound with $95\%$ of its energy intact, and the error (the discarded $5\%$) is spread thinly across the frequencies you dropped. Because the basis is orthogonal, dropping a small coefficient costs exactly its small energy and contaminates nothing else: this is the clean, controllable trade-off that the chapter's Key Insight promised. In a non-orthogonal representation there would be no such guarantee — discarding one component could corrupt the reconstruction unpredictably.

In practice MP3 does not transform the whole song at once. It slides a short window — about $26$ milliseconds — along the signal and transforms each window separately, because the frequency content of music changes over time (a song is not one chord held forever). Each window is projected onto a frequency basis via a modified discrete cosine transform, a windowed, overlap-friendly relative of the Fourier cosine series we built in §22.4. The overlap between consecutive windows is engineered so the windows blend smoothly back together on reconstruction, avoiding clicks at the window seams. But within each window, the operation is precisely ours: project onto an orthogonal frequency basis, obtaining a spectrum.

Step 2: the psychoacoustic twist — keep only what the ear hears

Here MP3 does something cleverer than plain truncation, and it is what separates audio compression from the generic "keep the big coefficients" of image compression. The encoder does not merely keep the largest coefficients; it keeps the audible ones, using a psychoacoustic model of human hearing. Two facts about the ear drive the model. First, the ear's sensitivity varies with frequency — we hear midrange frequencies (where speech lives) far more acutely than very low or very high ones, so coefficients in the insensitive bands can be stored coarsely or dropped. Second, and more subtly, a loud tone masks quieter tones at nearby frequencies: if a loud $1000$ Hz note is present, a quiet $1100$ Hz component right next to it is literally inaudible, drowned out by the louder neighbor.

The encoder exploits masking ruthlessly. For each window's spectrum, it computes a masking threshold — the loudness below which a coefficient cannot be heard given the louder coefficients around it — and then it allocates bits accordingly: many bits (fine precision) to coefficients well above threshold, few or zero bits to coefficients below it. Coefficients that are masked are quantized so coarsely that they are effectively discarded. The decoder reconstructs the sound by summing the retained frequency coefficients back into a waveform — the reconstruction step of §22.6, $\mathbf{f} \approx \sum_k \hat f_k \mathbf{w}_k$, restricted to the kept coefficients. Because what was thrown away was inaudible to begin with, the reconstructed sound is, to the ear, indistinguishable from the original despite being a tenth the size.

Step 3: why orthogonality is non-negotiable here

It is worth dwelling on why the orthogonality of the frequency basis, rather than just the use of frequencies, is what makes the scheme work — because this is the chapter's recurring theme. The masking model decides, frequency by frequency, how many bits each coefficient deserves. That decision is only sound if the coefficients are independent: the encoder must be able to coarsen one coefficient without that error bleeding into the others. Orthogonality guarantees exactly this. Each coefficient is a projection onto its own axis, blind to the rest (§22.4); the energy of the quantization error in one coefficient adds independently to the total error (Parseval, §22.10); and so the encoder can reason about each frequency in isolation, confident that the sum of small independent errors stays small. If the frequency components overlapped — if the basis were not orthogonal — the masking decisions would be coupled, the error analysis would not factor, and the careful bit-allocation would be unreliable.

This is the same reason regression coefficients are clean when the design is orthogonal (Chapter 17) and the same reason Gram–Schmidt's orthonormal coordinates are independent (Chapter 20). The application changes — fitting a line, building a basis, compressing a song — but the structural fact is identical: orthogonality decouples, and decoupling is what lets you treat each coordinate on its own terms.

A small experiment you can run

You can reproduce the heart of the idea — concentration of energy and graceful reconstruction from few coefficients — on a synthetic signal in a few lines, using the fourier_coeffs and reconstruct you build in this chapter, or np.fft. Take a signal that is a sum of three tones plus a little noise. Transform it, sort the coefficients by magnitude, keep only the largest $k$ of them, zero the rest, and reconstruct. Plot the reconstruction error against $k$ and you will see it drop steeply at first — the few large coefficients capture the tones — then flatten, because the remaining tiny coefficients (mostly the noise) carry little energy. That elbow in the error curve is the compression sweet spot: the point past which extra coefficients buy almost no fidelity. Real codecs find a domain-specific version of that elbow using the psychoacoustic model rather than raw magnitude, but the shape of the trade-off is the one Parseval predicts.

What to take away

MP3 is orthogonal projection with a human-hearing model bolted on top. Strip away the engineering — the windowing, the overlap, the entropy coding that packs the surviving bits efficiently — and the load-bearing mathematics is the chapter's: a sound is a vector in an inner product space, the pure frequencies are an orthogonal basis, the spectrum is the set of projections onto that basis, and compression is the controlled discarding of coefficients that carry little (or, here, inaudible) energy. The reason you can fit a thousand songs on a device that once held a handful is, at bottom, that the right angle organizes the spectrum of a sound — and orthogonal coordinates can be thrown away one at a time without the rest noticing.

The same projection-and-truncate philosophy returns, in a richer form, when we meet the singular value decomposition in Chapter 30: there too a signal (or an image, or a data matrix) is expanded in an orthogonal basis, and there too keeping the largest components yields the best low-rank approximation. Fourier compression is your first encounter with that idea; the SVD will make it universal.