Case Study 1 — How a Search Engine Finds the Right Document: Cosine Similarity in NLP

DataField.Dev

Case Study 1 — How a Search Engine Finds the Right Document: Cosine Similarity in NLP

Field: natural language processing / data science. Concepts used: the dot product, the norm, cosine similarity, orthogonality, the angle in high dimensions. Anchor tie-in: this is the chapter's anchor — cosine similarity between document and embedding vectors — shown doing the job it was built for, ranking text by relevance. It is also the operation behind word embeddings and the semantic search inside modern chatbots.

The problem: ranking text by meaning, not length

You type a few words into a search box and, out of millions of documents, the engine returns the handful most relevant to what you asked. How does a machine, which understands no English, decide that one document is "about" your query and another is not? The modern answer is geometric: turn every document and the query into a vector, and measure the angle between them. Small angle — high cosine similarity — means "points the same way," which here means "about the same thing." The entire chapter we just finished is the machinery of that one decision.

The first step is to represent text as numbers. The simplest scheme, the bag of words, fixes a vocabulary and counts how often each word appears, giving each document a vector in $\mathbb{R}^V$ where $V$ is the vocabulary size. A real engine uses a smarter weighting (TF-IDF, which down-weights words like "the" that appear everywhere) and modern systems use learned embeddings of a few hundred dimensions, but the comparison step is identical: take the cosine similarity between vectors. We will use small integer count vectors so every number is checkable by hand, then explain how the same arithmetic scales to embeddings.

Why not just use distance?

The naive idea is to call two documents similar when their vectors are close — small Euclidean distance $\lVert\mathbf{d}_1-\mathbf{d}_2\rVert$. This fails badly, and the reason is length. A 50-word abstract and a 5,000-word paper on the very same topic use the same words in the same proportions, so their vectors point the same direction — but one is roughly a hundred times longer than the other, so the Euclidean distance between them is enormous. Distance would rank the long paper as wildly dissimilar to its own abstract. That is exactly backwards.

Cosine similarity fixes this by ignoring length and keeping only direction. Because it is length-invariant (§18.9 — scaling a vector leaves its cosine similarity unchanged), the abstract and the paper score near $1$: same direction, same topic, regardless of size. This is the reason cosine similarity, not Euclidean distance, is the default measure of textual relevance.

The geometric picture a search engineer carries: each document is an arrow from the origin into a very high-dimensional "word space." Length encodes how long the document is; direction encodes what it is about. Relevance is a question about direction only, so we measure the angle — and the cosine of that angle is the relevance score.

Working it by hand

Fix the eight-word vocabulary $[\textit{neural}, \textit{network}, \textit{deep}, \textit{learning}, \textit{recipe}, \textit{oven}, \textit{bake}, \textit{sugar}]$ — four machine-learning words and four baking words. Here are four documents and a query, as word-count vectors:

$$ \begin{aligned} \mathbf{D}_1\ (\text{deep-learning note}) &= (2,2,3,3,0,0,0,0), &\quad \mathbf{D}_2\ (\text{neural-nets note}) &= (3,3,1,2,0,0,0,0),\\ \mathbf{D}_3\ (\text{baking recipe}) &= (0,0,0,0,3,2,2,2), &\quad \mathbf{D}_4\ (\text{ML-meets-baking blog}) &= (1,1,1,1,1,1,1,0),\\ \mathbf{q}\ (\text{query: ``deep learning networks''}) &= (1,1,2,2,0,0,0,0). && \end{aligned} $$

To rank the documents against the query, compute $\operatorname{cossim}(\mathbf{q},\mathbf{D}_i)$ for each. Take $\mathbf{D}_1$ first. The dot product is $\mathbf{q}\cdot\mathbf{D}_1=1\cdot2+1\cdot2+2\cdot3+2\cdot3+0+0+0+0=2+2+6+6=16$. The norms are $\lVert\mathbf{q}\rVert=\sqrt{1+1+4+4}=\sqrt{10}$ and $\lVert\mathbf{D}_1\rVert=\sqrt{4+4+9+9}=\sqrt{26}$. So

$$ \operatorname{cossim}(\mathbf{q},\mathbf{D}_1)=\frac{16}{\sqrt{10}\,\sqrt{26}}=\frac{16}{\sqrt{260}}\approx 0.9923 . $$

A cosine very close to $1$ — the query and the deep-learning note point almost the same way, as they should. Now the baking recipe $\mathbf{D}_3$: the query has zeros in every baking slot and $\mathbf{D}_3$ has zeros in every ML slot, so every product in $\mathbf{q}\cdot\mathbf{D}_3$ is zero. The dot product is $0$, the cosine similarity is $0$ — the query and the recipe are orthogonal. They share no vocabulary, so they share no direction, and the geometry reports "completely unrelated" with a perfect right angle. The chapter's orthogonality condition (§18.5) is doing real work: disjoint vocabularies are perpendicular vectors.

Verifying and ranking in code

# Ranking documents against a query by cosine similarity (the search-engine core).
import numpy as np
def cosine_similarity(u, v):
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

# vocab: [neural, network, deep, learning, recipe, oven, bake, sugar]
docs = {
    "D1 deep-learning note": np.array([2, 2, 3, 3, 0, 0, 0, 0]),
    "D2 neural-nets note":   np.array([3, 3, 1, 2, 0, 0, 0, 0]),
    "D3 baking recipe":      np.array([0, 0, 0, 0, 3, 2, 2, 2]),
    "D4 ML-meets-baking":    np.array([1, 1, 1, 1, 1, 1, 1, 0]),
}
query = np.array([1, 1, 2, 2, 0, 0, 0, 0])   # "deep learning networks"

ranked = sorted(docs.items(), key=lambda kv: -cosine_similarity(query, kv[1]))
for name, vec in ranked:
    print(f"{name:24s}  cossim = {cosine_similarity(query, vec):.4f}")
# D1 deep-learning note     cossim = 0.9923
# D2 neural-nets note       cossim = 0.7913
# D4 ML-meets-baking        cossim = 0.7171
# D3 baking recipe          cossim = 0.0000

The ranking is exactly what intuition demands: the deep-learning note ($0.9923$) edges out the neural-nets note ($0.7913$) because the query emphasized "deep" and "learning" (count $2$ each), which $\mathbf{D}_1$ matches more strongly; the mixed ML-and-baking blog ($0.7171$) is moderately relevant because half its words are on-topic; and the pure baking recipe scores a flat $0.0000$, orthogonal to the query. The engine would return $\mathbf{D}_1$, then $\mathbf{D}_2$, then $\mathbf{D}_4$, and never show $\mathbf{D}_3$. Every number traces back to one operation — the dot product, normalized.

What TF-IDF adds, and why it does not change the geometry

Raw word counts have an obvious flaw: common words like "the," "is," and "of" appear in every document and dominate the counts, swamping the rare words that actually distinguish topics. The standard repair is TF-IDF weighting (term frequency times inverse document frequency), which multiplies each count by a factor that shrinks toward zero for words appearing in many documents and stays large for words appearing in few. The effect is to re-weight the components of each document vector before comparison — a word that appears everywhere contributes little to the direction, a distinctive word contributes a lot. Crucially, this is still just a vector in $\mathbb{R}^V$, and documents are still compared by cosine similarity. TF-IDF changes which directions in word space matter, not the operation that measures alignment. Our small counts already behave like a crude TF-IDF, because the vocabulary was chosen to contain only discriminating words — no "the" to drown out "neural." The lesson generalizes: every refinement of how text becomes a vector, from raw counts to TF-IDF to learned embeddings, feeds into the same cosine-similarity comparison, which is exactly why this chapter's geometry is the durable, transferable skill.

From word counts to embeddings: the same geometry, learned directions

Bag-of-words vectors are sparse and literal: two documents are orthogonal the instant they share no exact words, even if one says "physician" and the other "doctor." Modern systems fix this with embeddings — dense vectors of a few hundred numbers, learned by a neural network so that words and documents used in similar contexts get vectors pointing in similar directions, whether or not they share surface words. "Physician" and "doctor" land at a small angle; "deep learning" and "neural networks" land at a small angle. But — and this is the point — the comparison step does not change at all. You still rank by cosine similarity, still compute $\frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}$, still read a number in $[-1,1]$. The vectors got smarter; the geometry of §18.9 stayed the same.

This is why cosine similarity is the single most-deployed formula of the chapter. When a chatbot retrieves relevant context to answer your question (retrieval-augmented generation), it embeds your question and the candidate passages and keeps the highest-cosine matches. When a "find similar" button surfaces related articles, it is returning nearest neighbors by angle. The famous word-vector arithmetic, $\mathbf{king}-\mathbf{man}+\mathbf{woman}\approx\mathbf{queen}$, is checked by cosine similarity: the result vector is closest in angle to $\mathbf{queen}$. The broader menu of similarity measures includes alternatives, but in embedding spaces cosine is king, precisely because direction encodes meaning and length usually just encodes incidental magnitude.

There is even a high-dimensional reason cosine works so well as a signal detector, straight from §18.6. In a 768-dimensional embedding space, two unrelated vectors are nearly orthogonal — their cosine hovers near $0$ — simply because random high-dimensional vectors almost always are. So a genuinely relevant document, with a cosine of $0.7$ or $0.9$, stands out sharply against a background of near-zero scores from everything irrelevant. The curse of dimensionality becomes a blessing: the noise floor sits near $90^\circ$, and real matches rise far above it.

The lesson

A search engine's relevance ranking, a chatbot's context retrieval, and a "more like this" recommendation are all the same geometric act: represent text as vectors, then sort by cosine similarity — the cosine of the angle between query and document. The dot product supplies the alignment, the two norms cancel out length so that a short abstract and a long paper on one topic score alike, orthogonality ($\cos=0$) flags documents that share nothing, and Cauchy–Schwarz guarantees every score is a comparable number in $[-1,1]$. The arrows you measured in the plane in §18.6 are, in a few hundred dimensions, how machines decide what you meant. And the very same projection idea that powers this — keeping the component of one vector along another — becomes, in Chapter 19, the orthogonal projection that solves least squares. Direction is meaning; the dot product reads it off.