Case Study 2 — Meaning as Direction: How Word and Image Embeddings Turn Concepts into Vectors

DataField.Dev

Case Study 2 — Meaning as Direction: How Word and Image Embeddings Turn Concepts into Vectors

Field: natural language processing & computer vision (data science). Concepts used: embeddings as learned vectors, cosine similarity (Chapter 18), the dot product as alignment, nearest-neighbor search, vector arithmetic on concepts, normalization and the angle–distance link. Why it matters: embeddings are the connective tissue of modern AI — they are how text, images, audio, and users are fed into systems that only multiply matrices. Search engines, translation, face recognition, duplicate detection, and the retrieval behind question-answering all rest on placing meaning in a vector space and reading it back with an angle.

The core idea: a dictionary you can do geometry on

A computer cannot multiply a matrix by the word "cat" or by a photograph. Something must turn these objects into vectors first, and the something is an embedding: a learned map that assigns every word, image, or item a point in a high-dimensional space — a few hundred real numbers — arranged so that geometric relationships mirror semantic relationships. Things that mean similar things get vectors pointing in nearly the same direction; unrelated things get vectors at large angles. The meaning is not stored as a definition; it is stored as a position, and the position is everything.

This is the place Chapter 18's cosine similarity stops being a geometry exercise and becomes the central tool of an industry. Recall that the cosine of the angle between two vectors, $$ \cos\theta = \frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\,\lVert\mathbf{v}\rVert}, $$ measures alignment independent of length: $+1$ same direction, $0$ perpendicular, $-1$ opposite. In an embedding space this one number is a meter for relatedness. "How similar are these two words?" becomes "what is the cosine of the angle between their vectors?" — a single dot product and two norms. Everything downstream is built on it.

Word embeddings: the company a word keeps

The first great embeddings were for words. A model such as word2vec (Mikolov and colleagues, 2013) starts every word with a random vector and reads billions of words of text, nudging each word's vector to better predict the words around it. The linguistic principle is old — "you shall know a word by the company it keeps" — and the geometric consequence is striking: words that appear in similar contexts drift toward similar vectors. After training, "Monday" sits near "Tuesday," "happy" near "joyful," and "Paris" near "London," because each pair keeps similar company.

The dimensions of this space are not human-labeled. No one tells the model "axis $42$ is formality" or "axis $7$ is gender"; the axes are whatever directions the training discovers, and an individual coordinate rarely means anything tidy to a person. What is interpretable, remarkably, is certain directions — consistent difference vectors that encode relationships. The displacement from a country to its capital is roughly constant ($\text{Paris}-\text{France}\approx\text{Rome}-\text{Italy}$); so is singular-to-plural, and present-to-past tense. This consistency is exactly what makes the famous analogy arithmetic work: $$ \text{king} - \text{man} + \text{woman} \approx \text{queen}. $$ Subtracting "man" removes the male-human component; adding "woman" attaches the female-human component; the royalty component rides along untouched. The result lands near "queen" because the "gender" direction is (approximately) a fixed vector you can add to any word. In the chapter's hand-built toy the equality is exact; in a real learned embedding it is approximate — hence the "$\approx$" — and the nearest neighbor is found by an actual cosine search that is sometimes wrong. But the mechanism is genuine: a relationship is a direction, and an analogy is vector addition.

A nearest-neighbor query makes the geometry concrete. "Find words similar to ocean" means "find the vectors at the smallest angle to the ocean vector" — sort all words by cosine similarity and take the top few. This is the engine of semantic search: your query becomes a vector, and the system returns documents whose embeddings align best with it, matching by meaning rather than by exact keyword. The same query answers "related products," "similar songs," and "is this a duplicate question?"

Image embeddings: the same trick, different sense

The identical idea works for pictures, which is where it feels almost magical. A vision model maps each image to an embedding vector — a few hundred numbers summarizing its visual content — trained so that images that look alike, or contain the same kind of thing, get nearby vectors. Two photos of the same cat land at a tiny angle; a cat and a dog land fairly close (both furry mammals); a cat and a car land nearly perpendicular (nothing in common). Watch it on a transparent toy where the four coordinates stand for visual features — roughly (fur, whiskers, wheels, metal):

# Image embeddings: cosine similarity ranks visual relatedness.
import numpy as np
emb = {
    "cat_photo1": np.array([0.90, 0.80, 0.00, 0.10]),   # (fur, whiskers, wheels, metal)
    "cat_photo2": np.array([0.85, 0.90, 0.05, 0.00]),
    "dog_photo":  np.array([0.80, 0.30, 0.00, 0.10]),
    "car_photo":  np.array([0.00, 0.00, 0.95, 0.90]),
}
def cosine(u, v):
    return float(u @ v / (np.linalg.norm(u) * np.linalg.norm(v)))
print("cos(cat1, cat2) =", round(cosine(emb["cat_photo1"], emb["cat_photo2"]), 3))  # 0.992
print("cos(cat1, dog)  =", round(cosine(emb["cat_photo1"], emb["dog_photo"]),  3))  # 0.933
print("cos(cat1, car)  =", round(cosine(emb["cat_photo1"], emb["car_photo"]),  3))  # 0.057

The numbers rank relatedness exactly as your eye would: the two cat photos are nearly the same direction ($0.992$), a cat and a dog are still close ($0.933$ — both are furry, whiskered-ish animals), and a cat and a car are essentially orthogonal ($0.057$). Real image embeddings live in hundreds of dimensions and are learned from millions of photos rather than hand-assigned, but the readout is this cosine, unchanged. Face recognition is the same computation with a threshold: embed two face photos, and if their cosine similarity exceeds a learned cutoff, declare them the same person. Reverse image search is a nearest-neighbor query in image-embedding space. Content moderation flags an upload whose embedding is too close to a known prohibited image. One angle, many products.

The deepest modern twist is that text and images can be embedded into the same space, so that the vector for the word "cat" lands near the vectors for photos of cats. Then "find images matching this caption" becomes a cross-modal cosine search — text vector against image vectors — which is how you search a photo library by typing a description. The space is shared; the meter is still the angle.

One practical wrinkle is worth naming, because it is where linear algebra meets engineering. Computing the cosine of a query against every vector in a catalog of a billion items, one dot product at a time, is too slow for an interactive search. So systems precompute the item embeddings, normalize them, and store them in approximate nearest-neighbor indexes that find the highest-cosine matches without scanning everything — but the quantity being approximated is still Chapter 18's cosine, and the operation at the bottom is still the dot product. Often the whole catalog of embeddings is stacked as the rows of one large matrix $E$, and scoring a query $\mathbf{q}$ against all items at once is the single matrix-vector product $E\mathbf{q}$ — every entry a dot product, the entire similarity ranking in one multiply (Chapter 8). The geometry sets the goal; the matrix algebra makes it fast.

Why the angle, and a caution about what it really measures

Two practical points round out the picture, both straight from Chapter 18. First, why the angle and not the distance? Because in most embedding spaces the direction carries the meaning while the length carries something incidental — how frequent or how strongly-activated the item is. Cosine divides out both lengths, so it judges meaning, not magnitude. When vectors are normalized to unit length (a common step), cosine and Euclidean distance agree on the ranking of neighbors, via the law-of-cosines identity $\lVert\hat{\mathbf{u}}-\hat{\mathbf{v}}\rVert^2 = 2 - 2\cos\theta$ — so systems often normalize first and then use fast distance-based search structures while still ranking by meaning.

Second, a caution that matters. An embedding learns the statistics of its training data, not objective truth, and it faithfully reproduces whatever associations live in that data — including human biases. Gender, racial, and occupational stereotypes show up as real, measurable directions in word-embedding space, because the model absorbed the patterns of the text it read. The analogy machinery that gives "king − man + woman ≈ queen" will, on biased data, also complete "doctor − man + woman" toward "nurse" — not because the linear algebra is wrong, but because the geometry honestly encodes a biased corpus. Treating embedding geometry as neutral "meaning" rather than as learned association is a documented hazard, and mitigating it is an active area of research. For why a model's internal representations reflect its data rather than ground truth, see how models work; the data-science treatment of representation learning and neural networks develops how these vectors are produced inside larger models.

The throughline

Strip away the application and every case here is one move: place objects in a vector space so that direction encodes meaning, then read meaning back with the cosine of the angle. Words, images, faces, products, even users — all become points, and "similar," "related," "the same," and "analogous to" all become statements about angles and dot products. It is Chapter 18 running the modern world. And it connects straight back to this chapter's anchor: the item factor vectors a recommender learns are embeddings, discovered by the low-rank factorization of Part VI, so "more like this" on a streaming service and "similar words" in a search box are, underneath, the very same geometry.