Case Study 30.1 — Latent Semantic Analysis: Finding Meaning with the SVD

DataField.Dev

Case Study 30.1 — Latent Semantic Analysis: Finding Meaning with the SVD

Field: data science & natural language processing. Anchor tie-in: this is the SVD's first "wow" application — the same decomposition we proved factors every matrix into rotate–stretch–rotate here uncovers hidden topics in a collection of documents, and it is the direct ancestor of the low-rank approximation of Chapter 31 and the dimensionality reduction techniques of data science.

The problem: a computer that does not know what words mean

Imagine you run a small library's search engine. A user types "rocket," and you want to return documents about space — including ones that never use the word "rocket" but talk about "orbit," "planet," and "spacecraft." A naive keyword search fails completely: it matches the literal string "rocket" and misses every synonym and related concept. The deeper problem is that a computer has no idea that "rocket" and "orbit" are about the same thing, while "rocket" and "recipe" are not. Words are just symbols to it; meaning is invisible.

Latent Semantic Analysis (LSA), introduced in the late 1980s, solves this with a startling idea: meaning is hidden in the patterns of which words co-occur in which documents, and the SVD extracts it. Words that appear together in the same documents are probably about the same topic; documents that share vocabulary are probably about the same subject. These co-occurrence patterns form a matrix, and the singular value decomposition of that matrix reveals the underlying topics — the "latent semantic" structure — without anyone ever telling the computer what any word means. It is one of the cleanest demonstrations that the SVD does not just factor a matrix; it finds structure in data.

From documents to a matrix

The bridge from text to linear algebra is the term-document matrix $A$. Each row corresponds to a word (a term), each column to a document, and the entry $a_{ij}$ counts how many times term $i$ appears in document $j$ (real systems weight these counts, but raw counts suffice to see the mechanism). A document becomes a column vector — a point in "term space" — and the whole corpus becomes a matrix. Crucially, this matrix is almost always rectangular (a vocabulary of 50,000 words, a corpus of 10,000 documents — a $50000 \times 10000$ matrix) and rank-deficient (most documents use a tiny fraction of the vocabulary, so the matrix is mostly zeros). This is exactly the kind of matrix that has no eigen-decomposition — it is not square — but always has an SVD (§30.5). LSA could not work without the universality we proved this chapter.

To see the mechanism on something hand-checkable, take a tiny corpus of six documents and six terms. Documents 1–3 are about space (using "rocket," "planet," "orbit") and documents 4–6 are about cooking (using "recipe," "oven," "bake"), with each document leaning on one main term plus the others in its topic:

$$A = \begin{array}{r|cccccc} & d_1 & d_2 & d_3 & d_4 & d_5 & d_6 \\\hline \text{rocket} & 2 & 1 & 1 & 0 & 0 & 0 \\ \text{planet} & 1 & 2 & 1 & 0 & 0 & 0 \\ \text{orbit} & 1 & 1 & 2 & 0 & 0 & 0 \\ \text{recipe} & 0 & 0 & 0 & 2 & 1 & 1 \\ \text{oven} & 0 & 0 & 0 & 1 & 2 & 1 \\ \text{bake} & 0 & 0 & 0 & 1 & 1 & 2 \end{array}$$

The block structure is deliberate and visible to us, but the computer sees only numbers. The question is whether the SVD can recover the two-topic structure from those numbers alone.

The SVD reveals the topics

# Latent Semantic Analysis: the SVD of a term-document matrix reveals topics.
import numpy as np
A = np.array([
    [2, 1, 1, 0, 0, 0],   # rocket
    [1, 2, 1, 0, 0, 0],   # planet
    [1, 1, 2, 0, 0, 0],   # orbit
    [0, 0, 0, 2, 1, 1],   # recipe
    [0, 0, 0, 1, 2, 1],   # oven
    [0, 0, 0, 1, 1, 2],
], dtype=float)            # bake
U, S, Vt = np.linalg.svd(A, full_matrices=False)
print("singular values:", np.round(S, 4))
print("energy in top 2 σ:", round((S[:2]**2).sum() / (S**2).sum(), 4))
# Document coordinates in the top-2 "latent semantic" space (Σ V^T, transposed).
doc_coords = (np.diag(S[:2]) @ Vt[:2, :]).T
print("document latent coordinates (rows = d1..d6):\n", np.round(doc_coords, 3))

singular values: [4. 4. 1. 1. 1. 1.]
energy in top 2 σ: 0.8889
document latent coordinates (rows = d1..d6):
 [[-2.309  0.   ]
 [-2.309  0.   ]
 [-2.309  0.   ]
 [ 0.    -2.309]
 [ 0.    -2.309]
 [ 0.    -2.309]]

Read this output carefully, because the SVD has just discovered the topics on its own. The singular values are $\{4, 4, 1, 1, 1, 1\}$: there are two dominant singular values (both equal to 4), and they tower over the rest. Those two singular values are the two topics — space and cooking. The remaining small singular values capture only the minor within-topic variation (which document leans on which specific word). The top two singular values carry $\frac{4^2 + 4^2}{4^2 + 4^2 + 1^2 + 1^2 + 1^2 + 1^2} = \frac{32}{36} \approx 89\%$ of the matrix's total "energy" (the Frobenius-norm-squared, §30.9), so two dimensions describe almost everything.

Now look at the document coordinates in the 2D latent space spanned by the top two singular directions. Documents $d_1, d_2, d_3$ all land at $(-2.309, 0)$ — they pile onto the first latent axis. Documents $d_4, d_5, d_6$ all land at $(0, -2.309)$ — they pile onto the second. The space documents and the cooking documents have separated into two perpendicular clusters, even though the computer was given nothing but word counts. The first latent dimension is the "space" topic; the second is the "cooking" topic, and each document's coordinates say how much it belongs to each. The right singular vectors $\mathbf{v}_i$ have become topic detectors.

Why this works: topics are singular directions

The reason the SVD finds topics is exactly the geometry of this chapter. The term-document matrix maps "document space" to "term space," and the SVD identifies the perpendicular input directions ($\mathbf{v}_i$, the right singular vectors) that get stretched the most. A direction stretched by a large singular value is a strong, recurring co-occurrence pattern — a topic. A direction stretched by a small singular value is noise or idiosyncrasy. So ranking the singular values ranks the topics by importance, and keeping only the top few — a low-rank approximation (Chapter 31) — keeps only the meaningful topics and discards the noise. This is the same truncation that compresses images; here it compresses meaning.

The "latent" in latent semantic analysis is the key word. The topics were never labeled in the data; they are latent — hidden in the co-occurrence patterns — and the SVD surfaces them as its dominant singular directions. This is why LSA can match "rocket" to a document about "orbit": both words load heavily on the same latent dimension (the space topic), so a query and a document that share no literal words can still be close in the low-dimensional latent space. LSA replaced brittle keyword matching with geometry: similarity becomes closeness in the space of topics, measured by the cosine similarity (Chapter 18) of the documents' latent coordinates.

We can also confirm what the rank-2 truncation does to the matrix itself:

# Rank-2 truncation: keep only the two topic directions.
A2 = U[:, :2] @ np.diag(S[:2]) @ Vt[:2, :]
print("rank-2 reconstruction:\n", np.round(A2, 3))
print("Frobenius error =", round(np.linalg.norm(A - A2, 'fro'), 4),
      " = sqrt(sum of dropped σ²) =", round(np.sqrt((S[2:]**2).sum()), 4))

rank-2 reconstruction:
 [[1.333 1.333 1.333 0.    0.    0.   ]
 [1.333 1.333 1.333 0.    0.    0.   ]
 [1.333 1.333 1.333 0.    0.    0.   ]
 [0.    0.    0.    1.333 1.333 1.333]
 [0.    0.    0.    1.333 1.333 1.333]
 [0.    0.    0.    1.333 1.333 1.333]]

The rank-2 reconstruction has smoothed away the within-topic differences — every space document now looks like the average space document, every cooking document like the average cooking document. The within-topic variation (the four small singular values) has been discarded, and the Frobenius error is exactly the root-sum-of-squares of the dropped singular values, $\sqrt{1^2+1^2+1^2+1^2} = 2$ — the Eckart–Young error formula of §30.9 and Chapter 31. What remains is pure topic structure. For search, that is a feature, not a bug: the smoothed matrix says "these three documents are interchangeable as far as their topic goes," which is precisely the generalization that lets a query about "rocket" retrieve a document about "orbit."

From toy to real, and the cautions

Real LSA scales this to enormous matrices — tens of thousands of terms and documents — and keeps perhaps 100–300 singular values out of thousands, compressing the corpus to a few hundred latent dimensions. The same pipeline (build a sparse term-document matrix, take a truncated SVD, work in the low-dimensional latent space) underlies early search engines, document clustering, essay-grading systems, and the "people who read this also read…" recommendations on book sites. It is also the conceptual ancestor of modern word embeddings: methods like word2vec and GloVe can be understood as factorizing a (transformed) word co-occurrence matrix, and the embeddings of Chapter 33 are their deep-learning descendants. The thread from this chapter's $A = U\Sigma V^{\mathsf{T}}$ to the embeddings powering today's language models is direct.

Two honest cautions. First, LSA's latent dimensions are not guaranteed to be human-interpretable topics — they are the directions of maximum co-occurrence variance, which often align with topics but sometimes mix several or capture stylistic artifacts. The SVD finds statistically dominant structure, not necessarily the structure a human would name. Second, choosing how many singular values to keep is a genuine modeling decision: too few and you merge distinct topics, too many and you keep noise. The singular value spectrum guides the choice (look for a "gap" or "elbow" where the values drop off), but there is no formula — this is the same "how many components?" question that PCA faces in Chapter 32. These caveats do not diminish the central lesson, which is the one this chapter has been building toward: the SVD does not merely factor a matrix; pointed at data, it extracts the data's dominant structure, and it does so for the rectangular, rank-deficient matrices where no other decomposition even applies.

Takeaways

A term-document matrix turns a text corpus into a (rectangular, rank-deficient) matrix — exactly the kind that has no eigen-decomposition but always has an SVD (§30.5).
The dominant singular values are the topics: each large singular value marks a strong co-occurrence pattern, and ranking the singular values ranks the topics by importance.
The right singular vectors give each document's coordinates in a low-dimensional "latent semantic" space, where documents sharing a topic cluster together — even with no shared literal words.
Truncating to the top few singular values (low-rank approximation, Chapter 31) keeps the meaningful topics and discards noise; the error is the root-sum-of-squares of the discarded singular values (§30.9).
This is the same machinery as image compression and the direct ancestor of word embeddings (Chapter 33) — one decomposition, many fields, the defining theme of Part VI.