Case Study 2 — Inside JPEG: The Discrete Cosine Transform on an 8×8 Block

DataField.Dev

Case Study 2 — Inside JPEG: The Discrete Cosine Transform on an 8×8 Block

Field: image compression and data engineering. Where Case Study 1 compressed a one-dimensional sound, this one compresses a two-dimensional image — and shows that the projection-onto-an-orthogonal-basis idea generalizes from functions of one variable to functions of two, almost without change. It is also a direct preview of the low-rank approximation you will meet with the SVD in Chapter 30.

The problem: photographs are mostly redundant

A modest photograph — say $4000 \times 3000$ pixels, three color channels — is $36$ million numbers, well over $30$ MB raw. Yet the JPEG version on your phone is perhaps $2$–$4$ MB and looks, to the eye, identical. The reason is that natural images are enormously redundant: neighboring pixels are almost always similar, large regions are nearly uniform, and sharp edges are the exception rather than the rule. That redundancy means an image's information is concentrated in a small number of slowly-varying patterns, with little in the fine, high-frequency detail. JPEG exposes and exploits this concentration with a two-dimensional cousin of the Fourier cosine series: the discrete cosine transform (DCT).

Step 1: tile the image, and treat each tile as a vector

JPEG does not transform the whole image at once. It chops it into a grid of $8 \times 8$ pixel blocks and processes each block independently. An $8 \times 8$ block of grayscale values is, for our purposes, a vector with $64$ components — a point in $\mathbb{R}^{64}$ — exactly the "a signal is a vector" move of §22.2, now with $64$ samples arranged in a square instead of a line. The question JPEG asks of each block is the question this chapter asks of every signal: what is its content in an orthogonal frequency basis?

The basis JPEG uses is a set of $64$ two-dimensional cosine patterns. Picture them in an $8 \times 8$ grid of little tiles. The top-left pattern is a flat, constant gray — the DC term, the average brightness of the block, the two-dimensional analogue of our $a_0$. Moving right, the patterns oscillate faster horizontally (vertical stripes, finer and finer). Moving down, they oscillate faster vertically (horizontal stripes). The bottom-right patterns are fine checkerboards oscillating fast in both directions — the highest spatial frequencies. Every one of the $64$ patterns is a product of a horizontal cosine and a vertical cosine, $\cos\!\big(\tfrac{(2x+1)u\pi}{16}\big)\cos\!\big(\tfrac{(2y+1)v\pi}{16}\big)$, the direct two-dimensional descendant of the $\cos kx$ basis we projected onto in §22.4.

Step 2: the basis is orthogonal, so a coordinate is a projection

The crucial fact — the reason this is the same mathematics — is that these $64$ cosine patterns are mutually orthogonal under the natural inner product on $8 \times 8$ blocks (sum the products of matching entries, the discrete dot product of Chapter 18). Distinct patterns are perpendicular; each has a known norm. Therefore the DCT coefficient of a block along a given pattern is the projection of the block onto that pattern — the inner product of the $64$-vector block with the $64$-vector pattern, divided by the pattern's squared norm. This is, verbatim, the projection-onto-an-orthogonal-basis formula of §22.4, applied in $\mathbb{R}^{64}$ rather than in function space. The DCT of a block is its list of $64$ coordinates in the orthogonal cosine basis — a change of basis (Chapter 16) into a frame where the image's structure is laid bare.

And just as with sound, the coordinates concentrate. For a typical photographic block — smooth, with maybe a gentle gradient — almost all the energy lands in the top-left handful of coefficients (the low spatial frequencies: average brightness and slow variation), while the bottom-right coefficients (fine texture) are tiny. By Parseval (§22.10), the energy of the block equals the sum of the squares of its $64$ DCT coefficients, so a block whose energy sits in, say, six coefficients can be reconstructed from those six with almost no loss. The high-frequency coefficients are the redundant fine detail the eye barely registers.

Step 3: quantization — the lossy truncation

JPEG now does its version of the truncation from §22.6, tuned to human vision. It divides each DCT coefficient by an entry of a quantization table and rounds to the nearest integer. The table assigns small divisors to the low-frequency coefficients (preserving them precisely) and large divisors to the high-frequency coefficients (crushing them, often to zero). Because the human visual system is far less sensitive to fine high-frequency detail than to coarse low-frequency structure — the optical analogue of the auditory masking in Case Study 1 — zeroing most of the high-frequency coefficients is nearly invisible. After quantization, a typical $8 \times 8$ block that began as $64$ nonzero numbers has only a few nonzero coefficients clustered in the top-left, and a long run of zeros elsewhere. Those zeros compress trivially (JPEG reads the block in a zig-zag order precisely to group them into one long run), which is where the dramatic size reduction comes from.

To view the image, the decoder reverses the steps: multiply each coefficient back by its quantization-table entry (approximately undoing the division), then run the inverse DCT — which, because the cosine basis is orthogonal, is again just a sum of projections: rebuild the block as $\sum_{u,v} (\text{coefficient})_{uv} \times (\text{pattern})_{uv}$. This is the reconstruction $\mathbf{f} \approx \sum_k \hat f_k \mathbf{w}_k$ of §22.6, in two dimensions and restricted to the surviving (mostly low-frequency) coefficients. The reconstructed block is the orthogonal projection of the original onto the retained patterns — the best approximation using those patterns, by the closest-point property of Chapter 19.

Step 4: where you can see the truncation — and where it tells the truth

The fingerprints of this projection-and-truncate scheme are visible if you know where to look. Crank JPEG quality very low and you will see two artifacts, both predicted by the mathematics. First, blocking: the $8 \times 8$ tiles become visible as little squares, because each block was approximated independently and they no longer agree perfectly at their seams (MP3 dodges the analogous seam problem with overlapping windows; baseline JPEG does not). Second, ringing near sharp edges — faint ripples beside a crisp boundary. That ringing is the two-dimensional Gibbs phenomenon of §22.6: a sharp edge is a jump, smooth cosine patterns overshoot at jumps, and truncating the high-frequency coefficients that would tame the overshoot leaves visible ripples. The Common Pitfall of the chapter — Gibbs ringing that truncation cannot fully remove — is literally printed on every over-compressed image with a hard edge in it.

There is an honest caveat worth stating, in the spirit of the book's verification standards. JPEG's DCT is closely related to the Fourier cosine series but is not identical to it — it uses a specific finite cosine basis on $8 \times 8$ blocks with particular boundary conventions chosen to avoid spurious edge discontinuities, and the quantization tables are empirically tuned to vision rather than derived from first principles. What transfers exactly from this chapter is the structure: an orthogonal basis of cosine patterns, coefficients as projections, energy concentration measured by Parseval, and lossy compression as optimal truncation. The engineering details are JPEG-specific; the linear algebra is ours.

A connection forward, and what to take away

JPEG and MP3 are the same idea in different dimensions: project a signal onto an orthogonal basis of frequencies, observe that natural signals concentrate their energy into few coefficients, and discard the rest — the small ones (image) or the inaudible ones (sound). Orthogonality is what makes the discarding safe, because it makes the coefficients independent and the energy additive, so you can throw coordinates away one at a time without the survivors noticing.

This is also a stepping stone to one of the book's destinations. In Chapter 30 the singular value decomposition will expand any matrix — including an image, treated as a matrix of pixels rather than a grid of $8 \times 8$ blocks — in an orthogonal basis whose coordinates are ordered by importance, and keeping the largest few yields the best low-rank approximation: a blurry-but-recognizable image from a handful of components, sharpening as you add more. JPEG's block-DCT is, in spirit, a fixed-basis version of that adaptive idea. When you watch a rank-$10$ SVD image resolve into a rank-$200$ one in Chapter 31, remember that you first met the principle here — keep the projections that carry the energy, discard the rest, and orthogonality guarantees the trade is clean.