> Learning paths. Math majors — read everything; the collapse theorem in §33.4 (a stack of linear layers is one matrix) is the cleanest proof in the chapter, and the Math-Major Sidebar on the SVD-versus-gradient view of matrix factorization ties the...
Prerequisites
- chapter-30-singular-value-decomposition
- chapter-18-dot-products-and-norms
Learning Objectives
- Explain why a neural-network layer is a matrix multiplication plus a bias plus a nonlinearity, a = σ(Wx + b), and identify each piece geometrically.
- Show that stacking layers composes transformations (Chapter 8), and prove that WITHOUT the nonlinearity any stack of linear layers collapses to a single matrix.
- Interpret embeddings as learned vectors in which geometry encodes meaning, and use cosine similarity (Chapter 18) to measure relatedness, including the king − man + woman ≈ queen analogy.
- Describe a recommendation system as a low-rank matrix factorization R ≈ UVᵀ (Chapters 30–31), and explain how it predicts missing ratings from observed ones.
- Run a forward pass through a tiny network by hand and confirm it against numpy, and fit a small matrix-factorization recommender that recovers known ratings and predicts unknown ones.
- Describe, at a high level, why training is the matrix calculus of gradients (backpropagation) and how it connects to the rest of this book.
In This Chapter
- 33.1 Why is "linear algebra in machine learning" almost the whole story?
- 33.2 What is a neural-network layer, geometrically?
- 33.3 How do we compute a forward pass, by hand and in numpy?
- 33.4 Why does the nonlinearity matter — what happens without it?
- 33.5 What is an embedding, and why is geometry the same as meaning?
- 33.6 How does "king − man + woman ≈ queen" actually work?
- 33.7 What is a recommendation system, geometrically?
- 33.8 How does matrix factorization predict missing ratings?
- 33.9 A worked recommender: factoring a small rating matrix
- 33.10 How are the weights learned? Training in one breath
- 33.11 Build it from scratch: a matrix-factorization recommender
- 33.12 What should you carry forward from this chapter?
Application: Machine Learning — How Neural Networks, Embeddings, and Recommendation Systems Use Linear Algebra
Learning paths. Math majors — read everything; the collapse theorem in §33.4 (a stack of linear layers is one matrix) is the cleanest proof in the chapter, and the Math-Major Sidebar on the SVD-versus-gradient view of matrix factorization ties the recommender back to Chapter 30. CS / Data Science — this is the chapter the whole book has been pointing at: focus on the forward-pass computation in §33.3, the embeddings geometry in §33.5–33.6, the matrix-factorization recommender in §33.7–33.9, and both case studies; the numpy is the main event. Physics / Engineering — focus on the geometric pictures (a layer warps space; an embedding places meaning in space; a recommender is low-rank structure in noisy data) and the high-level account of training in §33.10. This chapter closes Part VI by showing that neural networks, embeddings, and recommendation systems all rest on the matrix multiplication of Chapter 8 and the low-rank approximation of Chapters 30–31. It needs the SVD of Chapter 30, cosine similarity from Chapter 18, and the matrix-as-function idea of Chapter 7.
33.1 Why is "linear algebra in machine learning" almost the whole story?
You have spent thirty-two chapters learning that a matrix is a function that transforms space, that every matrix factors as rotate–stretch–rotate, and that keeping only the largest singular values gives the best low-rank approximation to a pile of data. This chapter is the moment all of that turns into the technology reshaping the world. The honest one-sentence summary of modern machine learning is this: it is linear algebra with a little nonlinearity sprinkled between the matrix multiplies. That is not a slogan to make you feel good about the subject — it is a literal description of how neural networks, word embeddings, and recommendation systems actually work, and this chapter will make the claim precise enough that you could build small versions of all three with the tools you already own.
Let me name the three systems we will dissect, because they look, from the outside, like utterly different magic. A neural network takes an image and says "cat," or takes a sentence and continues it — and we will see that each of its layers is nothing more than a matrix multiplication $W\mathbf{x}$, plus a shift by a bias vector $\mathbf{b}$, followed by bending the result through a simple nonlinear function. An embedding turns a word, an image, or a user into a vector of a few hundred numbers, and the geometry of those vectors encodes meaning — words with similar meanings sit at small angles, and you can do arithmetic on concepts. A recommendation system decides which movie to suggest next, and at its core it is a single low-rank factorization $R \approx UV^{\mathsf{T}}$ of a giant table of who-liked-what — the exact low-rank idea from Chapter 31, applied to a matrix of ratings instead of an image.
The reason to study these together, in the final chapter of Part VI, is that they are not three tricks but one idea seen from three angles: vectors live in space, matrices move them, and the structure worth keeping is low-rank. A neural network moves vectors (Part II); an embedding places them so geometry means something (Part IV, dot products and angles); a recommender factors a data matrix into low rank (Part VI). Once you see that, the field stops being a bag of acronyms and becomes applied linear algebra with very large matrices and a great deal of compute.
The Key Insight — Three of the most consequential systems in modern computing reduce to operations you already understand. A neural-network layer is $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$ — matrix multiply, then bias, then a nonlinearity. An embedding is a learned vector where angle measures meaning (cosine similarity, Chapter 18). A recommender is a low-rank factorization $R \approx UV^{\mathsf{T}}$ of a rating matrix (the SVD idea of Chapters 30–31). Strip the branding and machine learning is the linear algebra of this book, run at scale.
A word on accuracy before we begin, because this is a chapter where it is easy to overclaim. Real production systems have details we will not capture in a 2×3 toy matrix: convolutions that share weights across an image, attention mechanisms that let a model weigh different words, normalization layers, dropout, and training pipelines that run for weeks on thousands of processors. We will be honest about what our small examples do and do not show. But the linear-algebraic skeleton — multiply by a matrix, add a bias, bend, repeat; place meaning in the geometry of vectors; approximate a data matrix by low rank — is genuinely the load-bearing structure, not a simplification we are imposing. The engineers building these systems think in exactly these terms. For the data-science framing of neural networks as predictive models and a non-technical account of how models work, the two cross-linked treatments complement this chapter's linear-algebra view.
33.2 What is a neural-network layer, geometrically?
Start, as always, with the picture. Picture a layer of a neural network as a machine that reshapes space. It takes in a vector $\mathbf{x}$ — say a list of pixel brightnesses, or a list of numbers summarizing a sentence — and it produces a new vector $\mathbf{a}$, the activations, which become the input to the next layer. The reshaping happens in two stages that you already know cold and one that is new. First the layer applies a linear transformation: it multiplies the input by a weight matrix $W$, exactly the matrix-as-function of Chapter 7. Then it shifts the result by adding a bias vector $\mathbf{b}$ — a translation, sliding the whole output off the origin. Finally — and this is the only genuinely new ingredient in the chapter — it bends each coordinate of the result through a fixed nonlinear function $\sigma$, applied entry by entry.
In symbols, a single layer computes $$ \mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b}). $$ This compact formula is the entire definition of a (fully-connected) neural-network layer, and it deserves to be read slowly. The matrix $W$ is $m \times n$ if the layer takes an $n$-dimensional input to an $m$-dimensional output. The bias $\mathbf{b}$ is a vector in $\mathbb{R}^m$. The function $\sigma$ is a scalar function — it takes one number and returns one number — applied to each of the $m$ entries of the vector $W\mathbf{x} + \mathbf{b}$ independently. The quantity inside, $\mathbf{z} = W\mathbf{x} + \mathbf{b}$, is often called the pre-activation (or "logits" at the final layer); applying $\sigma$ gives the activation $\mathbf{a} = \sigma(\mathbf{z})$.
Geometric Intuition — Watch what each piece does to space. The matrix multiply $W\mathbf{x}$ is the full vocabulary of Part II: it can rotate, stretch, shear, project, and reflect the input space, sending the grid of inputs to a new (possibly squashed or flipped) grid — exactly what the recurring 2D visualizer of Chapter 1 has been showing you all along. Adding $\mathbf{b}$ then slides that warped grid bodily away from the origin, a pure translation. So $W\mathbf{x} + \mathbf{b}$ is an affine map — a linear transformation followed by a shift. Up to this point the layer can only do "straight-line" geometry: parallel lines stay parallel, the grid stays a grid. The nonlinearity $\sigma$ is what curves the picture — it lets the layer carve out regions, bend boundaries, and do the genuinely non-flat things that separating cats from dogs requires.
Each row of $W$ has a concrete reading worth holding onto. The $i$-th entry of $W\mathbf{x}$ is the dot product of row $i$ of $W$ with the input $\mathbf{x}$ (Chapter 8's row-times-column view, which we earned as composition, not as a memorized rule). So each output coordinate is a weighted sum of the inputs — row $i$ of $W$ asks "how much of each input feature do I care about?", the bias $b_i$ sets a threshold, and $\sigma$ decides how strongly to fire. A single such unit — one row of $W$, one bias, one $\sigma$ — is a neuron; a layer stacks $m$ of them, which is exactly why we package them as a matrix. The dot-product reading is the bridge from Chapter 18: a neuron measures the alignment between its weight vector and the input, and fires when they point the same way.
Real-World Application — Image classification (computer vision). A network that labels a photo as "cat" begins by flattening the image into a long vector of pixel values (a $224\times 224$ color image is a vector in $\mathbb{R}^{150528}$). The first layer multiplies that vector by a weight matrix, shifts by a bias, and bends through a nonlinearity, producing a new vector of "features." After several such layers, a final layer outputs a short vector of scores, one per category, and the largest score is the prediction. Every arrow in the familiar "neural network diagram" is one multiply-add inside a matrix-vector product; the whole forward pass is a chain of the operation $\sigma(W\mathbf{x} + \mathbf{b})$.
33.2.1 What do the shapes of the weight matrices tell you?
The dimensions of the weight matrices are not arbitrary — they are the architecture of the network, and reading them is the first thing an engineer does with someone else's model. A layer with weight matrix $W$ of shape $m \times n$ takes an $n$-dimensional vector in and puts an $m$-dimensional vector out, so the columns count inputs and the rows count outputs (the same convention as every matrix in this book since Chapter 7). A network is therefore a sequence of matrices whose shapes must chain: if layer $1$ outputs $\mathbb{R}^{h}$ and layer $2$ takes $\mathbb{R}^{h}$, then $W_1$ is $h \times n$ and $W_2$ is $m \times h$ — the inner dimensions match, exactly the conformability rule for matrix multiplication from Chapter 8. Our tiny net was $2 \xrightarrow{W_1} 3 \xrightarrow{W_2} 2$, so $W_1$ was $3\times 2$ and $W_2$ was $2\times 3$; the hidden dimension $3$ is the width of the hidden layer, and the number of matrices is the depth.
This shape bookkeeping is also how you count a network's parameters — the numbers that training must learn. A layer with an $m \times n$ weight matrix and an $m$-vector bias has $mn + m$ learnable numbers. Our two layers together have $(3\cdot 2 + 3) + (2\cdot 3 + 2) = 9 + 8 = 17$ parameters — a number you could fit by hand. A real network has billions: a single layer mapping $\mathbb{R}^{4096} \to \mathbb{R}^{4096}$ already carries $4096^2 + 4096 \approx 16.8$ million weights, and large language models stack hundreds of such layers. The operation is identical to our toy; only the count of numbers — and therefore the amount of data and compute needed to learn them — explodes. When you hear that a model "has $70$ billion parameters," that is a count of entries across all its weight matrices and biases.
Geometric Intuition — Use the recurring 2D visualizer from Chapter 1 one last time, now as a window into a single layer. Feed it the linear part of our collapsed network, the matrix $\begin{bmatrix} 0 & -2 \\ 2 & 0\end{bmatrix}$ (which we will meet in §33.4):
visualize_2d([[0, -2], [2, 0]])shows the unit square rotated a quarter turn and stretched — a clean rotate-and-scale, exactly the rotate–stretch–rotate story of Chapter 30, since this matrix has SVD with singular values $2$ and $2$. That is what the linear part of a layer does to space: one of the affine motions you have been watching all book. The bias then slides the whole transformed square off the origin, and the nonlinearity — which the visualizer cannot draw, because it is not linear — folds the plane along the axes where ReLU clamps to zero. A layer is "a visualizer transformation, shifted, then folded." Depth stacks fold upon fold, which is how flat building blocks assemble curved decision boundaries.
33.3 How do we compute a forward pass, by hand and in numpy?
Definitions become real when you push numbers through them. Let us build the smallest network that has everything — two layers, a nonlinearity between them — and compute its output by pencil, then confirm every digit with numpy. This single worked example is the computational heart of the chapter.
Our network takes a 2-dimensional input, expands it to 3 dimensions in a hidden layer, and contracts back to a 2-dimensional output. The first layer uses $$ W_1 = \begin{bmatrix} 1 & -1 \\ 0 & 2 \\ 1 & 1 \end{bmatrix}, \qquad \mathbf{b}_1 = \begin{bmatrix} 0 \\ -1 \\ 0.5 \end{bmatrix}, $$ a $3\times 2$ weight matrix and a bias in $\mathbb{R}^3$. Its nonlinearity is the ReLU (rectified linear unit), $\sigma(t) = \max(0, t)$ — the most common activation in modern networks, which simply zeroes out negative numbers and leaves positive ones alone. Feed it the input $\mathbf{x} = (2, 1)$.
Step 1 — the first pre-activation $\mathbf{z}_1 = W_1\mathbf{x} + \mathbf{b}_1$. Each entry is a row of $W_1$ dotted with $\mathbf{x}$, plus the matching bias: $$ \mathbf{z}_1 = \begin{bmatrix} 1\cdot 2 + (-1)\cdot 1 \\ 0\cdot 2 + 2\cdot 1 \\ 1\cdot 2 + 1\cdot 1 \end{bmatrix} + \begin{bmatrix} 0 \\ -1 \\ 0.5 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0 \\ -1 \\ 0.5 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 3.5 \end{bmatrix}. $$
Step 2 — the first activation $\mathbf{a}_1 = \mathrm{ReLU}(\mathbf{z}_1)$. Every entry is already positive, so ReLU leaves them unchanged: $\mathbf{a}_1 = (1, 1, 3.5)$. (Had any entry been negative, it would have become $0$ — that is the entire action of ReLU, and it is the bend that makes the network nonlinear.)
Step 3 — the second pre-activation. The second layer maps $\mathbb{R}^3 \to \mathbb{R}^2$ with $$ W_2 = \begin{bmatrix} 1 & 0 & -1 \\ 2 & 1 & 0 \end{bmatrix}, \qquad \mathbf{b}_2 = \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix}, $$ so $$ \mathbf{z}_2 = W_2\mathbf{a}_1 + \mathbf{b}_2 = \begin{bmatrix} 1\cdot 1 + 0\cdot 1 + (-1)\cdot 3.5 \\ 2\cdot 1 + 1\cdot 1 + 0\cdot 3.5 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} -2.5 \\ 3 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} -2 \\ 2.5 \end{bmatrix}. $$
Step 4 — the output activation. Suppose the output layer uses the sigmoid $\sigma(t) = 1/(1 + e^{-t})$, which squashes any real number into the interval $(0, 1)$ — handy when you want outputs that read as probabilities. Then $$ \mathbf{a}_2 = \begin{bmatrix} \sigma(-2) \\ \sigma(2.5) \end{bmatrix} = \begin{bmatrix} 1/(1+e^{2}) \\ 1/(1+e^{-2.5}) \end{bmatrix} \approx \begin{bmatrix} 0.1192 \\ 0.9241 \end{bmatrix}. $$ That is the network's output for the input $(2, 1)$: roughly $(0.12, 0.92)$. The whole computation was four matrix-vector products and two entrywise bends — nothing the reader of Chapter 8 cannot do.
Now confirm it in numpy. Recall the indexing reminder from the style guide: the math writes $W_1$'s first row as row $1$, while numpy stores it at W1[0].
# Forward pass through a tiny 2-3-2 network: a = sigma(W x + b), layer by layer.
import numpy as np
W1 = np.array([[1., -1.], [0., 2.], [1., 1.]]) # layer 1 weights (3x2)
b1 = np.array([0., -1., 0.5]) # layer 1 bias
W2 = np.array([[1., 0., -1.], [2., 1., 0.]]) # layer 2 weights (2x3)
b2 = np.array([0.5, -0.5]) # layer 2 bias
x = np.array([2., 1.]) # input
relu = lambda z: np.maximum(0.0, z) # ReLU activation
sigmoid = lambda z: 1.0 / (1.0 + np.exp(-z)) # sigmoid activation
z1 = W1 @ x + b1; a1 = relu(z1) # hidden layer
z2 = W2 @ a1 + b2; a2 = sigmoid(z2) # output layer
print("z1 =", z1) # [1. 1. 3.5]
print("a1 =", a1) # [1. 1. 3.5]
print("z2 =", z2) # [-2. 2.5]
print("a2 =", a2) # [0.119203 0.924142]
Every number matches the hand computation: $\mathbf{z}_1 = (1, 1, 3.5)$, $\mathbf{a}_1 = (1, 1, 3.5)$, $\mathbf{z}_2 = (-2, 2.5)$, and the output $\mathbf{a}_2 \approx (0.1192, 0.9241)$. That is a complete neural network inference — what happens every time a model answers a query — reduced to matrix multiplications you can verify by hand. A real network differs only in scale (matrices with millions of entries) and in the values of $W$ and $\mathbf{b}$, which are learned from data rather than written down by us. The arithmetic is identical.
Check Your Understanding — In the forward pass above, the hidden pre-activation came out $\mathbf{z}_1 = (1, 1, 3.5)$ — all positive — so ReLU did nothing. Suppose instead the input had been $\mathbf{x} = (0, 0)$. What would $\mathbf{z}_1$, then $\mathbf{a}_1 = \mathrm{ReLU}(\mathbf{z}_1)$, be? Which neuron gets "switched off"?
Answer
With $\mathbf{x} = (0,0)$, $W_1\mathbf{x} = \mathbf{0}$, so $\mathbf{z}_1 = \mathbf{b}_1 = (0, -1, 0.5)$. Then $\mathrm{ReLU}(\mathbf{z}_1) = (\max(0,0), \max(0,-1), \max(0,0.5)) = (0, 0, 0.5)$. The second neuron is switched off — its pre-activation $-1$ is negative, so ReLU clamps it to $0$. This is exactly how ReLU introduces nonlinearity: depending on the input, different neurons fire or stay silent, and the pattern of which neurons are active changes the effective linear map the network applies. That input-dependent switching is impossible for a pure matrix.
33.4 Why does the nonlinearity matter — what happens without it?
Here is the question that separates "a neural network" from "an expensive way to multiply matrices," and it has a clean, provable answer. Without the nonlinearity, stacking layers gains you nothing: the whole network collapses to a single matrix. This is the most important structural fact in the chapter, and it is pure Chapter 8 — composition of linear maps.
Recall from Chapter 8 that composing two linear transformations is itself a linear transformation, and its matrix is the product of the two matrices: applying $W_1$ then $W_2$ is the single map $W_2 W_1$. Now imagine a network with the nonlinearities removed — just affine layers $\mathbf{z} \mapsto W\mathbf{z} + \mathbf{b}$ stacked up. Two such layers give $$ W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1)\mathbf{x} + (W_2\mathbf{b}_1 + \mathbf{b}_2) = W'\mathbf{x} + \mathbf{b}', $$ where $W' = W_2 W_1$ is a single matrix and $\mathbf{b}' = W_2\mathbf{b}_1 + \mathbf{b}_2$ is a single bias. The composition of two affine maps is one affine map. By induction, a stack of any number of linear layers — ten, a hundred, a thousand — is exactly equivalent to one affine layer $\mathbf{x} \mapsto W'\mathbf{x} + \mathbf{b}'$. All that depth buys you precisely nothing: you could replace the entire deep network with a single matrix and get identical outputs.
Let me make the collapse concrete with the very matrices from §33.3, this time with the nonlinearity stripped out (and biases set aside for clarity). Applying $W_1$ then $W_2$ to $\mathbf{x} = (2, 1)$: $$ W_2 W_1 = \begin{bmatrix} 1 & 0 & -1 \\ 2 & 1 & 0 \end{bmatrix}\begin{bmatrix} 1 & -1 \\ 0 & 2 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 0 & -2 \\ 2 & 0 \end{bmatrix}, $$ so the two "layers" together are just multiplication by the single $2\times 2$ matrix $\begin{bmatrix} 0 & -2 \\ 2 & 0\end{bmatrix}$.
# Without a nonlinearity, two linear layers collapse to ONE matrix W2 @ W1.
import numpy as np
W1 = np.array([[1., -1.], [0., 2.], [1., 1.]])
W2 = np.array([[1., 0., -1.], [2., 1., 0.]])
x = np.array([2., 1.])
print("two layers, no sigma: W2 @ (W1 @ x) =", W2 @ (W1 @ x)) # [-2. 4.]
print("single matrix W2 @ W1: (W2 @ W1) @ x =", (W2 @ W1) @ x) # [-2. 4.]
print("the collapsed matrix W2 @ W1 =\n", W2 @ W1) # [[ 0. -2.] [ 2. 0.]]
The two outputs are identical — $(-2, 4)$ either way — because the two linear layers are the one matrix $W_2 W_1$. No matter how many linear layers you stack, you can multiply their matrices together and get a single equivalent transformation. Depth without nonlinearity is an illusion.
Common Pitfall — "A deep network with many layers is automatically more powerful than a shallow one." Only with the nonlinearity. A stack of purely linear layers $W_k \cdots W_2 W_1$ is mathematically identical to the single matrix $W' = W_k\cdots W_1$ — by the composition rule of Chapter 8 — so a "deep linear network" can represent only what one matrix can: a linear (affine) map, with all the limits that implies (it can never carve a curved decision boundary). Without the nonlinearity, deep = shallow. The activation function $\sigma$ between layers is precisely what breaks the collapse: because $\sigma(W_2\sigma(W_1\mathbf{x})) \neq (W_2 W_1)\mathbf{x}$ in general, each extra nonlinear layer genuinely enlarges what the network can express. The matrices do the moving; the nonlinearities are what let depth accumulate power.
The Key Insight — The nonlinearity is not a numerical afterthought; it is the load-bearing wall. Linear operations compose into a single linear operation (Chapter 8), so a network needs something between the matrix multiplies to escape the flat world of affine maps. That "something" is the entrywise bend $\sigma$ — ReLU, sigmoid, tanh, and friends. With it, a network of enough width and depth can approximate essentially any continuous function (the universal approximation theorem) [verify]; without it, you have reinvented matrix multiplication. Machine learning is linear algebra plus nonlinearity, and the "plus" is doing real work.
Math-Major Sidebar — The reason a single nonlinearity rescues expressive power, while zero leaves you with one matrix, is worth stating precisely. The set of affine maps $\mathbb{R}^n \to \mathbb{R}^m$ is closed under composition (it forms a monoid), so no finite composition of affine maps escapes it — that is the collapse. ReLU networks instead produce continuous piecewise-linear functions: ReLU partitions input space into polyhedral regions (where each neuron is on or off), and the network is a different affine map on each region, stitched continuously across the boundaries. Adding layers multiplies the number of regions, which is the precise sense in which depth buys expressive power. The universal approximation theorems (Cybenko 1989 for sigmoid; Hornik 1991 more generally) make the approximation claim rigorous, though they say nothing about how many neurons you need or whether training will find the right weights [verify].
33.5 What is an embedding, and why is geometry the same as meaning?
Switch fields, from vision to language, and meet the second great use of linear algebra in machine learning. How do you feed a word to a network that only multiplies matrices? You cannot multiply by "cat." The answer — one of the most beautiful ideas in the subject — is the embedding: assign every word (or image, or user, or product) a vector of a few hundred real numbers, a point in a high-dimensional space, and arrange these vectors so that geometric relationships encode semantic relationships. Words with similar meanings get vectors that point in nearly the same direction; unrelated words get vectors at large angles. The meaning lives in the geometry.
This is the payoff of Chapter 18 promised in advance. There we learned that the cosine of the angle between two vectors, $$ \cos\theta = \frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\,\lVert\mathbf{v}\rVert}, $$ measures their alignment regardless of length: $+1$ for the same direction, $0$ for perpendicular, $-1$ for opposite. In an embedding space this single number — the cosine similarity — becomes a meter for relatedness. "King" and "queen" point in similar directions (high cosine); "king" and "apple" point in unrelated directions (cosine near zero). The model never stores a dictionary definition; it stores a position in space, and closeness in angle is closeness in meaning.
Geometric Intuition — Picture a vast cloud of points, one per word, floating in a 300-dimensional space (we can only draw three of those dimensions, but the geometry is the same). Words about royalty cluster in one region; words about food cluster in another; words about emotions in a third. "Closeness" is measured by angle, not by walking distance — two vectors pointing the same way are "synonyms" even if one is longer. A query like "find words similar to ocean" becomes "find the vectors at the smallest angle to the ocean vector" — a nearest-neighbor search using cosine similarity. The embedding has turned the fuzzy human notion of meaning into the crisp geometric notion of direction in space.
Where do these vectors come from? They are learned, not assigned by hand. A model such as word2vec (Mikolov and colleagues, 2013) reads billions of words of text and adjusts each word's vector so that words appearing in similar contexts end up with similar vectors — the linguistic principle that "you shall know a word by the company it keeps." Crucially, the dimensions of the resulting space are not labeled by humans; no one tells the model "dimension 7 means royalty." The axes are whatever directions the training discovers, and they usually do not correspond to any tidy human concept. We will use hand-assigned, interpretable coordinates in the next section purely to illustrate the geometry — but you should hold in mind that real embeddings are learned, opaque in their individual axes, and high-dimensional.
Real-World Application — Semantic search and retrieval (data science / NLP). When you search a modern document store or ask a question of a retrieval system, your query is converted to an embedding vector, and the system returns the documents whose embeddings have the highest cosine similarity to it — finding matches by meaning rather than by exact keyword. The same machinery powers "related products," "similar songs," duplicate-question detection, and the retrieval step in retrieval-augmented question answering. Every one of these is a nearest-neighbor search under cosine similarity in an embedding space — Chapter 18's angle formula, run over millions of vectors.
33.6 How does "king − man + woman ≈ queen" actually work?
The single most famous demonstration that embeddings capture meaning is the word analogy: take the vector for "king," subtract the vector for "man," add the vector for "woman," and the result lands closest to the vector for "queen." Vector arithmetic on words produces a meaningful word. This looks like sorcery; it is linear algebra, and once you see the geometry it is almost obvious.
The key idea is that a well-trained embedding encodes relationships as consistent difference vectors. If the space has learned a "gender" relationship, then the vector you add to go from a male word to its female counterpart is roughly the same everywhere: $\text{queen} - \text{king} \approx \text{woman} - \text{man}$. Rearrange that approximate equation and you get exactly the analogy: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$. The arithmetic works because the direction "make it female" is a fixed displacement in the space — a single vector you can add to any word to shift its gender while leaving its other attributes alone.
Let me build a tiny, fully transparent embedding to watch this happen. I will hand-assign four interpretable coordinates — (royalty, gender, human, fruit) — to five words. This is a teaching cartoon: real embeddings are learned and their axes are not labeled, but a designed example lets you check every number and see the mechanism. Take $$ \begin{aligned} \text{king} &= (1, 1, 1, 0), & \text{queen} &= (1, -1, 1, 0), \\ \text{man} &= (0, 1, 1, 0), & \text{woman} &= (0, -1, 1, 0), \\ \text{apple} &= (0, 0, 0, 1). && \end{aligned} $$ Read the axes: "king" is royal ($1$), male ($+1$ in the gender slot), human ($1$), not fruit ($0$). "Queen" is the same but female ($-1$ in the gender slot). "Apple" is none of the human attributes and fully fruit. Now the analogy: $$ \text{king} - \text{man} + \text{woman} = (1,1,1,0) - (0,1,1,0) + (0,-1,1,0) = (1, -1, 1, 0) = \text{queen}. $$ The result is exactly the queen vector. Subtracting "man" removes the human-male part shared by king and man, leaving the pure "royalty" displacement $(1,0,0,0)$; adding "woman" then attaches the female-human part — and royalty + female human = queen. The gender axis flipped from $+1$ to $-1$ while royalty and humanity were preserved.
# Embeddings: cosine similarity and the king - man + woman analogy.
import numpy as np
emb = {
"king": np.array([1., 1., 1., 0.]), # (royalty, gender, human, fruit)
"queen": np.array([1., -1., 1., 0.]),
"man": np.array([0., 1., 1., 0.]),
"woman": np.array([0., -1., 1., 0.]),
"apple": np.array([0., 0., 0., 1.]),
}
def cosine(u, v): # Chapter 18's angle meter
return u @ v / (np.linalg.norm(u) * np.linalg.norm(v))
print("cos(king, queen) =", round(cosine(emb["king"], emb["queen"]), 3)) # 0.333
print("cos(king, man) =", round(cosine(emb["king"], emb["man"]), 3)) # 0.816
print("cos(king, apple) =", round(cosine(emb["king"], emb["apple"]), 3)) # 0.0
result = emb["king"] - emb["man"] + emb["woman"] # vector arithmetic on meaning
sims = {w: round(float(cosine(result, e)), 3)
for w, e in emb.items() if w not in ("king", "man", "woman")}
print("nearest to (king - man + woman):", sims) # {'queen': 1.0, 'apple': 0.0}
The numbers tell the whole story. "King" and "queen" share royalty and humanity but differ in gender, giving a moderate cosine of $0.333$; "king" and "man" are both male humans, giving a high $0.816$; "king" and "apple" share nothing, giving exactly $0$ (orthogonal — unrelated). And the analogy result is the queen vector itself, so its cosine with "queen" is $1.0$ and it is the unambiguous nearest neighbor. In a real embedding the equality is approximate (hence the "$\approx$" in the famous formula) — the learned gender direction is only roughly constant across word pairs — but the mechanism is precisely this: relationships are directions, and analogy is vector addition.
Check Your Understanding — Using the toy embedding above, predict (by hand, no code) where $\text{queen} - \text{woman} + \text{man}$ should land, and confirm it is "king."
Answer
$\text{queen} - \text{woman} + \text{man} = (1,-1,1,0) - (0,-1,1,0) + (0,1,1,0) = (1, 1, 1, 0) = \text{king}$. Subtracting "woman" strips the female-human part from "queen," leaving the royalty displacement $(1,0,0,0)$; adding "man" attaches the male-human part. Royalty + male human = king. The analogy runs both directions because the gender displacement $\text{man} - \text{woman} = (0, 2, 0, 0)$ is the same vector that separates king from queen — which is exactly the consistency that makes embedding arithmetic work.
Warning — Do not over-read the analogy magic. The clean "$=$" here is an artifact of hand-built, orthogonal axes; with learned embeddings the relation is only approximate, the nearest neighbor is found by an actual cosine search (and is sometimes wrong), and the famous examples were curated. Embeddings also faithfully encode the biases present in their training text — gender, racial, and other stereotypes show up as real directions in the space, because the model learned the statistics of human language, warts and all. Treating embedding geometry as objective "meaning" rather than as learned statistical association is a genuine and well-documented hazard; see how models work for why a model's representations reflect its data, not ground truth.
33.6.1 Cosine similarity or Euclidean distance — which measures meaning?
A natural question, and a good place to sharpen Chapter 18. We have measured relatedness by the cosine of the angle between embedding vectors, not by the Euclidean distance between their tips. Why the angle? Because in most embedding spaces the direction of a vector carries the meaning and its length carries something incidental — roughly, how often or how strongly the word appeared, not what it means. Two vectors pointing the same way are "the same meaning" even if one is twice as long; cosine similarity ignores length by dividing out both norms, exactly as Chapter 18 defined it. Euclidean distance, by contrast, would call a long "king" vector and a short "king" vector far apart merely because their magnitudes differ, which is usually not what we want.
There is a clean relationship between the two when the vectors are normalized to unit length, a common preprocessing step. For unit vectors $\hat{\mathbf{u}}, \hat{\mathbf{v}}$, the squared Euclidean distance and the cosine are tied by the law of cosines from Chapter 18: $$ \lVert\hat{\mathbf{u}} - \hat{\mathbf{v}}\rVert^2 = \lVert\hat{\mathbf{u}}\rVert^2 + \lVert\hat{\mathbf{v}}\rVert^2 - 2\,\hat{\mathbf{u}}\cdot\hat{\mathbf{v}} = 2 - 2\cos\theta. $$ So on the unit sphere, small angle and small distance say the same thing — ranking neighbors by cosine and by distance give the identical order. This is why systems often normalize embeddings first: it lets them use fast Euclidean nearest-neighbor data structures while still ranking by meaning. The lesson is the Chapter 18 lesson exactly: when scale is noise and direction is signal, use the angle.
33.6.2 How are embeddings actually learned?
It is worth a paragraph on where learned embeddings come from, so the "learned, not assigned" claim is concrete. In a model like word2vec, every word starts with a random vector. The model then slides a window across a huge text corpus and, at each position, nudges the vectors so that a word's vector becomes a better predictor of the words around it (or, in the converse "skip-gram" setup, so the context words predict the center word). Words that keep appearing in the same contexts — "cat" and "dog," "Monday" and "Tuesday" — get pushed toward similar vectors, because similar vectors make similar predictions. After billions of such nudges (gradient steps, §33.10), the geometry has self-organized: related words cluster, and consistent relationships like singular–plural or country–capital emerge as repeated difference vectors, which is why the analogy arithmetic works at all. Modern embeddings from large language models are richer still — the same word gets different vectors in different sentences (contextual embeddings), so "bank" near "river" lands far from "bank" near "money" — but the principle is unchanged: meaning is learned as position in a vector space.
Real-World Application — Recommendation by item embeddings (data science). Embeddings and the recommender of this chapter are two faces of one idea. The item factor vectors $\mathbf{v}_j$ we will learn in §33.9 are embeddings of the items: movies that many of the same users enjoy get nearby vectors, so "more like this" becomes a cosine-similarity search over item embeddings — no genre labels required. Music and video services build exactly such item-embedding spaces, and "because you watched X" is a nearest-neighbor query in that space. The neural-network, embedding, and recommender threads of this chapter are not three separate applications; they are one geometry — vectors whose arrangement encodes relatedness — used three ways.
33.7 What is a recommendation system, geometrically?
Now the chapter's anchor, and the place where Part VI's low-rank idea pays off most directly. Every time a streaming service suggests your next show, a shopping site recommends a product, or a music app builds a playlist, a recommendation system is at work. The dominant linear-algebra approach to recommendation is matrix factorization, and it is the low-rank approximation of Chapter 31 applied to a table of preferences instead of an image. This is the system we will build by hand and in code.
Set up the picture. Arrange all the data as one giant matrix $R$, the rating matrix: one row per user, one column per item, and the entry $R_{ij}$ is the rating user $i$ gave item $j$ — a number of stars, a thumbs up, a watch-time fraction. The catch that makes the problem interesting is that $R$ is mostly empty: no user has rated more than a tiny fraction of the catalog, so the overwhelming majority of entries are unknown. The job of a recommender is to fill in the blanks — to predict the rating each user would give each item they have not yet seen — and then recommend the items with the highest predicted ratings.
Geometric Intuition — Here is the leap. Even though the catalog has thousands of items, people's tastes are not thousands of independent whims — they are governed by a handful of hidden themes. A few dozen factors (how much you like action versus romance, indie versus blockbuster, recent versus classic) largely determine your ratings. In linear-algebra terms: the full rating matrix $R$, though enormous, is approximately low-rank — it is close to a matrix of small rank $k$, because only $k$ underlying taste-dimensions are really at work. Chapter 31 taught us that the best rank-$k$ approximation of a matrix throws away the small singular values and keeps the large ones. A recommender does exactly this: it finds the low-rank structure hiding in the sea of ratings and uses it to predict the missing entries.
That low-rank structure is made explicit by factoring $R$ into a thin product. Write $$ R \approx U V^{\mathsf{T}}, $$ where $U$ is $m \times k$ (one row of $k$ numbers per user) and $V$ is $n \times k$ (one row of $k$ numbers per item), with $k$ small — say $k = 20$ for a real system, $k = 2$ in our toy. Each user gets a latent factor vector (a row of $U$), each item gets a latent factor vector (a row of $V$), and the predicted rating of user $i$ for item $j$ is the dot product of their two vectors: $$ \widehat{R}_{ij} = \mathbf{u}_i \cdot \mathbf{v}_j = \sum_{f=1}^{k} u_{if}\,v_{jf}. $$ This is the four-fundamental-theme picture made literal: $u_{if}$ is "how much user $i$ likes theme $f$," $v_{jf}$ is "how much item $j$ expresses theme $f$," and their dot product is high when the user's tastes align with the item's character — exactly Chapter 18's alignment reading of the dot product, now in a learned latent space.
The Key Insight — A recommendation system is a low-rank matrix factorization $R \approx UV^{\mathsf{T}}$ (Chapters 30–31). Users and items both become vectors in a small shared latent space; a predicted rating is the dot product of a user vector with an item vector. The "magic" of recommendation — predicting how you would rate a film you have never seen — is just reconstructing a missing entry of a low-rank matrix from the entries you do have. The same low-rank idea that compressed an image in Chapter 31 fills in a rating here; learn it once, use it everywhere.
33.8 How does matrix factorization predict missing ratings?
The mechanism deserves a careful walk-through, because it is the chapter's anchor and because it shows why low rank is what lets a recommender generalize. The discipline that does this is called collaborative filtering: users collaborate, without knowing it, by collectively revealing the latent themes, so that one user's ratings help predict another's.
Here is the logic. Suppose we have observed some of user $i$'s ratings and some of everyone else's. We look for user vectors $\mathbf{u}_i$ and item vectors $\mathbf{v}_j$ such that the dot products $\mathbf{u}_i \cdot \mathbf{v}_j$ reproduce the observed ratings as closely as possible. Because $k$ is small, there are far fewer numbers to choose (the entries of $U$ and $V$) than there are observed ratings — so the factors cannot memorize each rating individually; they are forced to discover genuine shared structure, the handful of themes that explain many ratings at once. Then — and this is the payoff — we use those same factors to predict the unobserved entries: for a film $j$ that user $i$ never rated, we simply compute $\mathbf{u}_i \cdot \mathbf{v}_j$. The user's taste vector, learned from the films they did rate, combined with the film's character vector, learned from the users who did rate it, yields a prediction for the pair that was never observed.
Low rank is exactly what makes this honest rather than circular. If we allowed $U V^{\mathsf{T}}$ to have full rank, we could fit the observed entries perfectly and learn nothing about the missing ones — the unobserved entries would be unconstrained (this is the overfitting trap of Chapter 17's high-degree polynomials, in matrix form). By forcing the rank down to $k$, we demand that the predictions for a user lie in the same $k$-dimensional pattern as everyone with similar taste, which is precisely what lets us generalize from the ratings we have seen to the ones we have not.
Two refinements that real systems use, and that we will include because they materially improve the predictions. First, fit only the observed entries — never the blanks — so the unknown ratings exert no false pull on the factors. Second, subtract a baseline before factoring: a global mean $\mu$ (the average rating across all observed pairs) so the factors model deviations from average rather than absolute levels, which keeps predictions in a sensible range. The model becomes $$ \widehat{R}_{ij} = \mu + \mathbf{u}_i \cdot \mathbf{v}_j, $$ and the factors $U, V$ are chosen to minimize the squared error on the observed entries, with a small penalty on the size of the factors to prevent overfitting (a regularizer, the same idea that tames ill-conditioning in Chapter 17). Real systems add per-user and per-item bias terms too — some users rate generously, some items are universally loved — but the global mean alone captures the heart of it.
Computational Note — There are two routes to the factorization, and it is worth knowing both. The SVD route treats $R$ (with the blanks filled by a baseline) directly with
np.linalg.svd, keeps the top $k$ singular triples, and reads off $U$ and $V$ — clean and exact, but it requires some filling of the unknowns, which biases the result. The gradient route (used by essentially every production recommender, including the systems that won the Netflix Prize) never fills the blanks: it initializes $U, V$ randomly and nudges them downhill to reduce the error on the observed entries only, using the gradient-descent idea sketched in §33.10. The gradient route scales to matrices with billions of entries where a full SVD is impossible, and it handles the missing-data structure exactly. Our toolkit demo uses the gradient route; both are matrix factorization, and both rest on the low-rank picture of Chapter 31.
33.9 A worked recommender: factoring a small rating matrix
Let us make the anchor concrete with numbers you can inspect. Take five users and four movies. Movies $0$ and $1$ are action films; movies $2$ and $3$ are romances. Users $0$ and $1$ love action; users $3$ and $4$ love romance; user $2$ is a centrist who rates everything middling. Ratings run $1$–$5$, and a $0$ marks an unknown entry we want to predict: $$ R = \begin{bmatrix} 5 & 4 & 1 & \bullet \\ 4 & 5 & 2 & 1 \\ 3 & 3 & 3 & 3 \\ 1 & \bullet & 5 & 4 \\ 1 & 2 & 4 & 5 \end{bmatrix}, $$ where $\bullet$ marks the two missing ratings: user $0$ never rated movie $3$ (a romance), and user $3$ never rated movie $1$ (an action film). We will factor $R \approx \mu + UV^{\mathsf{T}}$ with rank $k = 2$ — two latent themes, which we hope the model will discover correspond to "action-ness" and "romance-ness" — fitting only the $18$ observed entries, and then read off the two predictions.
# Matrix-factorization recommender: R ~= mu + U V^T, gradient descent on KNOWN entries.
import numpy as np
R = np.array([[5., 4., 1., 0.], # 0 marks an UNKNOWN rating to predict
[4., 5., 2., 1.],
[3., 3., 3., 3.],
[1., 0., 5., 4.],
[1., 2., 4., 5.]])
mask = (R > 0).astype(float) # 1 where rating is known, 0 where unknown
m, n, k = 5, 4, 2
mu = R[mask > 0].mean() # global mean of observed ratings (~3.111)
rng = np.random.default_rng(0) # fixed seed -> reproducible numbers
U = rng.normal(0, 0.1, (m, k)) # latent user factors (taste)
V = rng.normal(0, 0.1, (n, k)) # latent item factors (genre mix)
lr, reg = 0.01, 0.1
for _ in range(6000): # gradient descent (see 33.10)
E = mask * (R - (mu + U @ V.T)) # error, ZEROED on unknown entries
U += lr * (E @ V - reg * U)
V += lr * (E.T @ U - reg * V)
P = mu + U @ V.T # full predicted matrix (fills the blanks)
rmse = np.sqrt((mask * (R - P) ** 2).sum() / mask.sum())
np.set_printoptions(precision=3, suppress=True)
print("reconstruction mu + U V^T:\n", P)
print("train RMSE on known entries:", round(rmse, 3)) # 0.210
print("predict (user0, movie3) =", round(P[0, 3], 2)) # 1.50 (action fan, romance film)
print("predict (user3, movie1) =", round(P[3, 1], 2)) # 2.98 (romance fan, action film)
The reconstruction comes out $$ \mu + UV^{\mathsf{T}} \approx \begin{bmatrix} 5.19 & 3.97 & 1.32 & \mathbf{1.50} \\ 4.18 & 4.89 & 2.22 & 1.01 \\ 3.13 & 3.11 & 3.10 & 3.11 \\ 1.04 & \mathbf{2.98} & 4.92 & 4.03 \\ 1.46 & 1.93 & 4.52 & 4.87 \end{bmatrix}, $$ and three things are worth savoring. First, the observed entries are reproduced closely — a root-mean-square error of about $0.21$ stars on the $18$ known ratings, so the rank-$2$ model genuinely captures the data with just two themes. Second, the predictions for the missing entries are sensible and interpretable: user $0$, an action fan, is predicted to give the romance movie $3$ a low $\mathbf{1.50}$ (the model learned they dislike romance from their other ratings), while user $3$, a romance fan, is predicted to give the action movie $1$ a middling $\mathbf{2.98}$. Third — and this is the whole point — the model was never told which movies are action or romance. It discovered the two-theme structure purely from the pattern of ratings, then used it to predict pairs it had never seen. That is collaborative filtering: the action-loving users collectively taught the model what "action-ness" is, and that knowledge transferred to predict an entry no single user supplied.
To recommend for user $0$, we look at their row of predictions, ignore the movies they have already rated, and suggest the unrated movie with the highest predicted score. Here user $0$'s only unrated movie is movie $3$, predicted $1.50$ — so the system correctly learns it should not push a romance on an action fan. With a real catalog of thousands of titles, the same row-scan over predicted ratings produces the ranked recommendation list you see in the app.
33.9.1 Reading the learned factors: what did the model discover?
The reconstruction is satisfying, but the factors themselves are where the geometry becomes vivid. After training, inspect the item matrix $V$ (one row of two numbers per movie). Because rank is $2$, each movie is a point in a $2$-dimensional latent plane, and the two action movies land near each other in one region while the two romances land together in another — the model has spontaneously arranged the catalog so that "similar movies are nearby vectors," precisely the embedding geometry of §33.5. Likewise each user's row of $U$ places them in the same plane, near the movies they like. A predicted rating $\mu + \mathbf{u}_i\cdot\mathbf{v}_j$ is high exactly when user vector and item vector point the same way — Chapter 18's alignment reading of the dot product, now in a space the model invented for itself.
# After training: the latent factors ARE embeddings. Similar items cluster.
np.set_printoptions(precision=2, suppress=True)
print("item factors V (one row per movie):\n", V)
def cos(a, b): # cosine similarity (Chapter 18)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
print("cos(movie0, movie1) action-action:", round(cos(V[0], V[1]), 3)) # 0.346 (positive)
print("cos(movie0, movie3) action-romance:", round(cos(V[0], V[3]), 3)) # -0.677 (opposite)
print("cos(movie2, movie3) romance-romance:",round(cos(V[2], V[3]), 3)) # 0.662 (positive)
The two same-genre pairs have positive cosine similarity (their factor vectors lean the same way), while the action–romance pair has a clearly negative one ($-0.677$) — the two genres point in opposite directions in the latent plane. Nobody labeled the genres — the factorization discovered them as the directions that best explain the ratings, the same way the SVD of Chapter 31 discovered the directions that best explain an image. This is the deep unity of the chapter: the recommender's item factors are item embeddings, and the low-rank factorization is what learned them. (With only $18$ ratings and a random initialization the exact factor values are not unique — the factorization is non-convex, per the Math-Major Sidebar — but the sign structure that separates the genres is robust.)
33.9.2 The cold-start problem: where the geometry runs out
An honest chapter names where the method fails. Matrix factorization predicts a missing entry $\mathbf{u}_i\cdot\mathbf{v}_j$ from a user's other ratings and an item's other raters — so it is helpless when there are none. A brand-new user with zero ratings has no learned vector $\mathbf{u}_i$ (their row is undetermined); a brand-new movie nobody has rated has no $\mathbf{v}_j$. This is the cold-start problem, and it is a direct consequence of the geometry: you cannot place a point in the latent space with no observations to pin it down, just as you cannot fit a line through zero data points. Real systems patch the gap with side information — a new user's signup survey, a new movie's genre tags and cast — folded in as extra features, or they fall back to non-personalized popularity ("trending now") until a few ratings arrive. The lesson is the same one Chapter 17 taught about rank: you need enough independent observations to determine your unknowns, and matrix factorization is no exception.
Check Your Understanding — In the $5\times 4$ rating matrix, suppose we removed every rating from user $2$ (the centrist), leaving their whole row unknown. Could the factorization still predict user $2$'s ratings? Why or why not?
Answer
No — this is exactly the cold-start problem. With user $2$'s entire row unobserved, the masked error $E$ has all zeros in row $2$, so the gradient updates never move user $2$'s factor vector $\mathbf{u}_2$ away from its random initialization. There is nothing to pin it down, and the predictions $\mu + \mathbf{u}_2\cdot\mathbf{v}_j$ would be essentially the baseline $\mu$ plus noise. You need at least a few of a user's ratings to locate them in the latent space; the method interpolates among observed entries, it cannot conjure a position from none.
Real-World Application — Streaming and shopping recommendations (the anchor). The recommender systems behind video streaming, music apps, and online retail are, at their core, matrix factorizations of exactly this form, scaled to hundreds of millions of users and millions of items. The famous \$1M Netflix Prize (2006–2009) was won by an ensemble built on matrix factorization — learning user and item latent vectors so that $\mathbf{u}_i \cdot \mathbf{v}_j$ predicts ratings — and the technique remains a workhorse, now often combined with neural networks that learn richer factor representations [verify]. The $R \approx UV^{\mathsf{T}}$ you just ran by hand is the same equation, the same low-rank idea from Chapter 31, the same dot-product-as-alignment from Chapter 18 — only bigger.
Math-Major Sidebar — How does the gradient route relate to the SVD of Chapter 30? If $R$ were fully observed and we used no regularization, the rank-$k$ factorization minimizing $\lVert R - UV^{\mathsf{T}}\rVert_F^2$ would be exactly the truncated SVD: by the Eckart–Young theorem (Chapter 31), the best rank-$k$ approximation is $U_k\Sigma_k V_k^{\mathsf{T}}$, and you can absorb $\Sigma_k$ into the factors, e.g. $U = U_k\Sigma_k^{1/2}$, $V = V_k\Sigma_k^{1/2}$. So with complete data the recommender is the SVD. The reasons production systems use gradient descent instead are the missing entries (the SVD wants a full matrix; ratings are $99\%$ empty) and scale (you cannot form, let alone factor, a billion-by-million matrix). The factorization is non-convex in $U$ and $V$ jointly, so gradient descent finds a good local optimum, not the unique global one — but in practice that is enough. The low-rank target, though, is pure Eckart–Young: the recommender seeks the rank-$k$ matrix closest to the observed ratings.
33.10 How are the weights learned? Training in one breath
We have treated the weight matrices $W$, the embedding vectors, and the latent factors $U, V$ as if they were handed to us. In reality they are learned from data, and the linear algebra of how is a forward reference — it is the matrix calculus of derivatives and gradients, the subject of a calculus course rather than this book. But the shape of the idea is simple enough to state in a breath, and it connects directly to everything you have learned.
Learning means minimizing a loss. Pick a loss function $L$ that measures how wrong the model's outputs are on training data — for the recommender, the squared error on observed ratings; for a classifier, how far the predicted probabilities sit from the true labels. The loss is a function of all the model's numbers (every entry of every $W$, every $\mathbf{b}$, every factor). Training adjusts those numbers to push the loss downhill. The downhill direction is the negative gradient $-\nabla L$ — the vector of partial derivatives of the loss with respect to every parameter — and gradient descent takes small repeated steps in that direction:
$$
W \leftarrow W - \eta\,\frac{\partial L}{\partial W},
$$
where $\eta$ is a small learning rate. You saw this exact update in the recommender of §33.9: U += lr * (E @ V - reg * U) is a gradient step, the matrix $E @ V$ being (up to sign) the gradient of the squared error with respect to $U$.
The reason this is tractable for a deep network is backpropagation — and here is the linear-algebra punchline. The forward pass is a chain of matrix multiplications and nonlinearities (§33.3); computing the gradient runs the same chain backward, multiplying by transposed weight matrices $W^{\mathsf{T}}$ and by the derivatives of the nonlinearities. Backpropagation is, at heart, the chain rule of calculus organized as a sequence of matrix products — the transpose $W^{\mathsf{T}}$ (Chapter 8) appearing because the gradient flows in the opposite direction to the forward signal. The forward pass sends information through $W$; the backward pass sends error-sensitivity through $W^{\mathsf{T}}$. We will not derive it here, but you should recognize that training is matrix calculus, built on the very transpose and matrix-multiplication operations this book has developed.
Real-World Application — Training a large model (machine learning engineering). Training a modern network means computing the loss on a batch of examples (a forward pass — matrix multiplies), running backpropagation to get the gradient (the same matrix multiplies in reverse, with transposes), and taking a gradient-descent step — then repeating billions of times across many processors. The dominant cost is matrix multiplication, which is why specialized hardware (GPUs and tensor-processing units) is, fundamentally, hardware for multiplying matrices very fast. The entire training pipeline is the linear algebra of this book, executed at enormous scale; the calculus supplies the gradient, the linear algebra supplies the operations. See neural networks for the optimization-and-training view in depth.
33.11 Build it from scratch: a matrix-factorization recommender
You now have the entire anchor in hand: a rating matrix, a low-rank target $R \approx \mu + UV^{\mathsf{T}}$, a way to fit it on observed entries, and a way to read off predictions. Time to make it your capstone toolkit contribution — the runnable demo that brings together the SVD/low-rank thread of Part VI with the dot-product geometry of Part IV.
Build Your Toolkit — Implement
recommender.pyintoolkit/capstone/, a matrix-factorization recommender for a small rating matrix with missing entries:
matrix_factorize(R, mask, k, lr, reg, steps)— factor $R \approx \mu + UV^{\mathsf{T}}$ by gradient descent on the observed entries only (maskis $1$ where known, $0$ where unknown). Subtract the global mean $\mu$ of the known ratings, initialize $U$ ($m\times k$) and $V$ ($n\times k$) with small random values, and repeatedly update $U \mathrel{+}= \eta(E V - \lambda U)$ and $V \mathrel{+}= \eta(E^{\mathsf{T}} U - \lambda V)$ where $E = \text{mask}\odot(R - (\mu + UV^{\mathsf{T}}))$ is the masked error. Return $U, V, \mu$.predict(U, V, mu)— return the full predicted matrix $\mu + UV^{\mathsf{T}}$, which fills in the blanks.rmse_on_known(R, mask, P)— root-mean-square error on the observed entries, your fit diagnostic.You may use numpy here (this is the capstone demo, not a from-scratch primitive — though you can build it on your
toolkit/vectors.pydot product if you want the pure version). The two non-negotiables: fit only the known entries (multiply the error by the mask), and verify reconstruction on the known entries before trusting any prediction.Verify: on the $5\times 4$ matrix of §33.9 with $k=2$, your
rmse_on_knownshould be about $0.21$, and the two predictions should beP[0,3] ≈ 1.5(action fan, romance film — low) andP[3,1] ≈ 3.0(romance fan, action film — moderate). As a second check, build a fully observed matrix, factor it, and confirm the reconstruction error shrinks as you raise $k$ toward $\min(m, n)$ — recovering the truncated-SVD behavior of Chapter 31. Optionally, compare against an SVD-route version (fill blanks with the column mean, take the rank-$k$ truncated SVD via yourtoolkit/svd.py) and note that both give sensible predictions.
A sketch of the core loop, to show how little code the anchor takes once the geometry is clear (implement it yourself before peeking):
# Sketch: matrix factorization on observed entries. The whole recommender is here.
def matrix_factorize(R, mask, k=2, lr=0.01, reg=0.1, steps=6000, seed=0):
m, n = R.shape
mu = R[mask > 0].mean() # baseline: global mean of known ratings
rng = np.random.default_rng(seed)
U = rng.normal(0, 0.1, (m, k)) # user factors
V = rng.normal(0, 0.1, (n, k)) # item factors
for _ in range(steps):
E = mask * (R - (mu + U @ V.T)) # error, zeroed on unknown entries
U += lr * (E @ V - reg * U) # gradient step (Chapter 8 + 33.10)
V += lr * (E.T @ U - reg * V)
return U, V, mu
# On the 5x4 matrix of 33.9 this gives RMSE ~0.21 and P[0,3]~1.5, P[3,1]~3.0.
Notice what the recommender is: a low-rank factorization (Part VI) whose predictions are dot products (Part IV), fit by gradient steps that are themselves matrix multiplications (Part II). It is the perfect capstone for Part VI — and a natural seed for the full capstone of Chapter 39, where the from-scratch toolkit is assembled on a chosen application and a recommender is one of the offered domains.
33.12 What should you carry forward from this chapter?
Step back and see what just happened. Three systems that define modern computing — the neural networks behind vision and language, the embeddings behind search and analogy, the recommenders behind streaming and shopping — turned out to be the linear algebra of this book, run at scale, with a little nonlinearity in the mix. A layer is $\sigma(W\mathbf{x} + \mathbf{b})$: matrix multiply (Chapter 7), bias, bend. A deep network is a composition of those (Chapter 8) — and without the bend, the composition collapses to a single matrix, which is why the nonlinearity is the load-bearing wall. An embedding is a learned vector where the cosine of the angle (Chapter 18) measures meaning, and relationships are consistent difference vectors, so analogy is vector addition. A recommender is a low-rank factorization $R \approx UV^{\mathsf{T}}$ (Chapters 30–31) whose predicted ratings are dot products of latent user and item vectors.
The single sentence to engrave: machine learning is linear algebra plus nonlinearity, and approximation by low rank. Every theme of the book converges here. Matrices are transformations — a layer transforms a vector of features. Geometry and algebra are one — an embedding turns meaning into direction, and cosine similarity reads it back. Linear algebra is the most applied branch of pure mathematics — the SVD that compressed an image in Chapter 31 fills in a missing rating here, unchanged. And computation validates theory — every number in this chapter, from the forward pass to the recommender's predictions, was confirmed against numpy.
We were honest about the limits, too. Real systems add convolutions, attention, normalization, and training pipelines we did not build; embeddings encode the biases of their training data and their axes are not human-interpretable; the gradient route to factorization finds a good local optimum, not a guaranteed global one. But the skeleton is genuine, not a simplification we imposed: the engineers who build these systems reason in exactly the terms of this chapter. What we deferred — how the weights are learned — is the matrix calculus of gradients and backpropagation, where the transpose $W^{\mathsf{T}}$ of Chapter 8 reappears to carry error backward through the same chain the forward pass ran forward. That is a calculus course's worth of material, flagged here and waiting.
This chapter closes Part VI and, with it, the core of the book. Part VII opens the doors that lead onward — abstract inner product spaces that generalize the geometry an embedding lives in, the Jordan form for matrices diagonalization could not tame, the matrix exponential that solves differential equations, and the numerical realities of doing all this in finite precision. And in Chapter 39, the capstone, the from-scratch toolkit you have built decomposition by decomposition gets assembled on a real application — the recommender of this chapter being one of the choices, the place where rotate–stretch–rotate, low rank, and the dot product all come together in one runnable system. You have seen, end to end, that the mathematics of everything is, in a precise sense, the mathematics of the machines now reshaping everything.