Case Study 2.2 — Words as Vectors: A First Look at Embeddings in Data Science
Field: data science / natural language processing. Ties to: the list view of a vector and the "vectors as data" idea (§2.1, §2.9), plus vector addition and subtraction (§2.2). This is a preview — the full machinery (dot products, cosine similarity) arrives in Chapter 18 — but the core move is already in your hands: meaning lives in vector arithmetic.
The setting
How does a computer "understand" a word? It cannot read; it can only compute with numbers. So the first thing any modern language system does is turn each word into a vector of numbers — a word embedding — and from then on, every question about words becomes a question about vectors. "Which words are similar?" becomes "which vectors are close?" "What is the relationship between king and queen?" becomes "what vector connects them?" The arrow picture is hopeless here, because real embeddings live in 100 to 1000 dimensions, but the list picture — a vector is just an ordered tuple of numbers you can add and scale — is exactly right, and it is the reason §2.9 insisted you take the list view seriously even when you cannot draw it.
The remarkable empirical discovery, around 2013, was that if you learn these vectors well (from billions of words of text), relationships between words show up as consistent directions in the vector space, and you can do arithmetic with meaning. The most famous example: take the vector for king, subtract the vector for man, add the vector for woman, and you land very close to the vector for queen. The "male → female" relationship is a direction you can add. [verify — the precise behavior depends on the embedding model and training corpus; the king/queen analogy is the canonical illustration and holds approximately, not perfectly, in real systems.]
We will demonstrate the mechanism on a tiny, hand-built 2D example so you can verify every number, then explain how it scales.
A toy in two dimensions
Real embeddings have hundreds of components; for a verifiable demonstration we hand-pick four vectors in $\mathbb{R}^2$ arranged so the analogy works exactly. Imagine our two coordinates loosely encode "royalty" (horizontal) and "femaleness vs. maleness" — though in real systems the axes have no such clean labels.
$$ \mathbf{king} = \begin{bmatrix} 4 \\ 3 \end{bmatrix}, \quad \mathbf{man} = \begin{bmatrix} 3 \\ 1 \end{bmatrix}, \quad \mathbf{woman} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad \mathbf{queen} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}. $$
Now compute the analogy "king is to man as ? is to woman" by the standard recipe — subtract man, add woman:
$$ \mathbf{king} - \mathbf{man} + \mathbf{woman} = \begin{bmatrix} 4 \\ 3 \end{bmatrix} - \begin{bmatrix} 3 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \mathbf{queen}. $$
It lands exactly on queen. Read the arithmetic geometrically: $\mathbf{king} - \mathbf{man}$ is the displacement vector $\begin{bmatrix} 1 \\ 2 \end{bmatrix}$ that means, roughly, "strip away maleness, keep royalty." Adding that same displacement to woman moves it to queen. The relationship is a direction, and applying it is vector addition — the exact operation from §2.2, now carrying semantic weight.
# Word analogy as vector arithmetic (toy 2D embeddings, exact for illustration).
import numpy as np
king = np.array([4.0, 3.0])
man = np.array([3.0, 1.0])
woman = np.array([1.0, 1.0])
queen = np.array([2.0, 3.0])
result = king - man + woman
print("king - man + woman =", result) # [2. 3.]
print("matches 'queen'? ", np.allclose(result, queen)) # True
The output [2. 3.] and True confirm the analogy resolves to queen exactly in this constructed example. In a real embedding the result would be near (not exactly on) the queen vector, and the system would return the closest actual word vector — but the operation is identical: add and subtract vectors.
Measuring "close": a peek at Chapter 18
Once words are vectors, "similarity" needs a definition. One natural choice uses the magnitude of the difference: words whose vectors are close (small $\lVert \mathbf{a} - \mathbf{b}\rVert$) are similar. Consider a tiny document-search example: three documents represented by counts of three keywords, and a query.
$$ \mathbf{A} = \begin{bmatrix} 5 \\ 0 \\ 1 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 4 \\ 1 \\ 0 \end{bmatrix}, \quad \mathbf{C} = \begin{bmatrix} 0 \\ 5 \\ 4 \end{bmatrix}, \quad \mathbf{q} = \begin{bmatrix} 5 \\ 0 \\ 0 \end{bmatrix}. $$
Which document best matches the query $\mathbf{q}$? Compute the distance (magnitude of the difference) from the query to each:
$$ \lVert \mathbf{A} - \mathbf{q}\rVert = \sqrt{0 + 0 + 1} = 1, \quad \lVert \mathbf{B} - \mathbf{q}\rVert = \sqrt{1 + 1 + 0} = \sqrt{2} \approx 1.41, \quad \lVert \mathbf{C} - \mathbf{q}\rVert = \sqrt{25 + 25 + 16} \approx 8.12. $$
# Nearest-document search: smaller difference-magnitude = better match.
import numpy as np
docs = {"A": np.array([5,0,1]), "B": np.array([4,1,0]), "C": np.array([0,5,4])}
q = np.array([5, 0, 0])
for name, d in docs.items():
print(name, round(float(np.linalg.norm(d - q)), 4))
# A 1.0 B 1.4142 C 8.124
The output (A 1.0, B 1.4142, C 8.124) ranks document A as the closest match to the query and document C as wildly off-topic — which matches intuition, since A's keyword profile most resembles the query's. This nearest-neighbor idea — represent items as vectors, then rank by distance — is the backbone of search engines, recommendation systems, and the "find similar images" button, and it is built entirely on the magnitude of a difference vector from §2.5.
A subtlety worth flagging now and resolving in Chapter 18: distance is not the only notion of similarity, and often not the best one for text. Two documents about the same topic but of very different lengths can be far apart in raw distance yet point in nearly the same direction. Measuring the angle between vectors (cosine similarity) ignores length and captures direction alone, which is usually what we want for words. We do not have the tool for angles yet — it needs the dot product — but you can already see why we'll want it: this chapter gives magnitude (length); Chapter 18 adds angle, and together they complete the geometry of vectors.
Why this matters
The leap here is conceptual, not computational: once data is represented as vectors, the operations of this chapter become operations on meaning. Adding and subtracting word vectors composes and decomposes relationships; the magnitude of a difference measures dissimilarity; averaging document vectors (a linear combination, §2.6) summarizes a collection. None of this requires anything beyond add, scale, and magnitude — the trio you implemented in toolkit/vectors.py. What changes is the interpretation: the arrows now live in a space of meaning rather than physical space.
This is also a clean illustration of the book's recurring theme that linear algebra is the most applied branch of mathematics. The same vector arithmetic that steered a game character in Case Study 2.1 here powers a search engine and previews how large language models represent text. The embeddings themselves are learned by an optimization process that repeatedly nudges the vectors to make good analogies hold — and that process runs on the gradient (calculus) of a loss measured in this very vector space; the connection between moving along vectors and the derivatives that guide the movement is developed in vectors in calculus. For now, the headline is enough: feed a computer vectors, and the geometry of $\mathbb{R}^n$ — addition, scaling, distance — quietly becomes a theory of similarity and analogy. Chapters 18 and 33 return to make this rigorous and to push it to full-scale machine learning.