Chapter 33 Exercises — Machine Learning and Linear Algebra

Twenty-five problems in four tiers. ⭐ conceptual · ⭐⭐ hand computation · ⭐⭐⭐ proof (A) or coding (C) · ⭐⭐⭐⭐ application. Notation is locked: a layer is $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$; a recommender is $R \approx UV^{\mathsf{T}}$; cosine similarity is $\cos\theta = \dfrac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}$. Where a problem says "verify with numpy," compute by hand first, then check; your hand answer and the code must agree.


Tier 1 — Conceptual (⭐)

1. ⭐ Write down the equation for a single neural-network layer and name each of the four symbols ($\mathbf{a}$, $\sigma$, $W$, $\mathbf{b}$). In one sentence each, say what the matrix multiply and the bias do geometrically.

2. ⭐ A network has layers of shapes $\mathbb{R}^{10} \to \mathbb{R}^{32} \to \mathbb{R}^{32} \to \mathbb{R}^{4}$. Give the shape ($\text{rows}\times\text{cols}$) of each of the three weight matrices.

3. ⭐ Explain in your own words why a stack of linear layers with no nonlinearity has exactly the same expressive power as a single matrix. Which earlier chapter's result is this?

4. ⭐ True or false, with one sentence of justification: "ReLU, $\sigma(t) = \max(0, t)$, is a linear function."

5. ⭐ What does it mean to say an embedding "encodes meaning in its geometry"? Which quantity from Chapter 18 do we use to read relatedness off two embedding vectors, and what do the values $+1$, $0$, and $-1$ signify?

6. ⭐ In the recommender model $R \approx UV^{\mathsf{T}}$, what does a single row of $U$ represent? A single row of $V$? What does the dot product of one row of each compute?

7. ⭐ Why must a recommender's factorization be low rank? What goes wrong (with respect to predicting unseen ratings) if we let $UV^{\mathsf{T}}$ have full rank?


Tier 2 — Hand computation (⭐⭐)

8. ⭐⭐ A layer has $W = \begin{bmatrix} 2 & 0 \\ 1 & -1 \\ 0 & 3 \end{bmatrix}$, $\mathbf{b} = \begin{bmatrix} 1 \\ 0 \\ -2 \end{bmatrix}$, and ReLU activation. Compute the pre-activation $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ and the activation $\mathbf{a} = \mathrm{ReLU}(\mathbf{z})$ for the input $\mathbf{x} = (1, 2)$. Which neuron, if any, is switched off?

9. ⭐⭐ Using the layer from Problem 8, now feed $\mathbf{x} = (0, 0)$. Compute $\mathbf{z}$ and $\mathbf{a}$. Explain why the bias alone determines the output here.

10. ⭐⭐ Two linear layers (no nonlinearity, no bias) use $A = \begin{bmatrix} 2 & 1 \\ 0 & 1 \end{bmatrix}$ (applied first) and $B = \begin{bmatrix} 1 & 0 \\ 3 & 2 \end{bmatrix}$ (applied second). Find the single matrix equivalent to the two layers, then apply it to $\mathbf{x} = (1, 1)$. (Careful about the order of multiplication.)

11. ⭐⭐ Compute the cosine similarity of $\mathbf{u} = (3, 4)$ and $\mathbf{v} = (4, 3)$ by hand. Are they more aligned, less aligned, or equally aligned compared with $\mathbf{u} = (3,4)$ and $\mathbf{w} = (-4, 3)$? Compute both.

12. ⭐⭐ Toy embeddings (axes = country-ness, Italy-ness, capital-ness): $\text{Paris} = (1,0,1)$, $\text{France} = (1,0,0)$, $\text{Italy} = (0,1,0)$. Compute $\text{Paris} - \text{France} + \text{Italy}$ by hand. If $\text{Rome} = (0,1,1)$, did the analogy land on Rome?

13. ⭐⭐ Compute the sigmoid $\sigma(t) = 1/(1 + e^{-t})$ at $t = 0$ and confirm it equals $0.5$. Without a calculator, state whether $\sigma(-3)$ is closer to $0$ or to $1$, and why.

14. ⭐⭐ A tiny rating matrix is $R = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ (no missing entries). Show that $R$ has rank $1$ by exhibiting column vectors $\mathbf{u}\in\mathbb{R}^2$ and $\mathbf{v}\in\mathbb{R}^2$ with $R = \mathbf{u}\mathbf{v}^{\mathsf{T}}$. (A rank-1 recommender would reproduce this $R$ exactly.)

15. ⭐⭐ In the recommender model with a global mean baseline, the prediction is $\widehat{R}_{ij} = \mu + \mathbf{u}_i\cdot\mathbf{v}_j$. If $\mu = 3$, $\mathbf{u}_i = (1, -2)$, and $\mathbf{v}_j = (0.5, -1)$, compute the predicted rating.


Tier 3 — Proof (A) / Coding (C) (⭐⭐⭐)

16. ⭐⭐⭐ (A) Prove that the composition of two affine maps is affine. That is, if $f(\mathbf{x}) = A\mathbf{x} + \mathbf{c}$ and $g(\mathbf{y}) = B\mathbf{y} + \mathbf{d}$, show $g(f(\mathbf{x}))$ has the form $W\mathbf{x} + \mathbf{e}$ and give $W$ and $\mathbf{e}$ explicitly. Conclude that any finite stack of affine layers is a single affine layer.

17. ⭐⭐⭐ (A) Show that ReLU breaks the collapse: find a $1\times 1$ example (scalars $w_1, w_2$ and the map $x \mapsto w_2\,\mathrm{ReLU}(w_1 x)$) for which no single linear function $x \mapsto cx$ agrees with it for all $x$. (Hint: pick signs so the composition bends at the origin.)

18. ⭐⭐⭐ (C) Write a function forward(layers, x) that takes a list of (W, b, activation) triples and an input vector, and returns the network's output by chaining $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$. Test it on the §33.3 network (two layers, ReLU then sigmoid) and confirm it reproduces $\mathbf{a}_2 \approx (0.1192, 0.9241)$.

19. ⭐⭐⭐ (C) Verify the collapse numerically. For random $W_1$ ($4\times 3$) and $W_2$ ($2\times 4$), confirm with np.allclose that W2 @ (W1 @ x) == (W2 @ W1) @ x for several random $\mathbf{x}$. Then insert a ReLU between them and show the two sides now differ.

20. ⭐⭐⭐ (C) Write nearest(word, emb, k) that returns the $k$ words with the highest cosine similarity to a given word in an embedding dictionary emb (a {word: np.array} map), excluding the query word itself. Test it on the §33.6 toy embedding: the nearest word to king should be man.

21. ⭐⭐⭐ (C) Implement cosine_matrix(V) that, given an $n\times k$ matrix of item embeddings, returns the $n\times n$ matrix of pairwise cosine similarities. (Hint: normalize each row to unit length, then one matrix product gives all pairs at once — a nice use of Chapter 8.)


Tier 4 — Application (⭐⭐⭐⭐)

22. ⭐⭐⭐⭐ (C) The anchor recommender. Reproduce the §33.9 demo: build the $5\times 4$ rating matrix with the two missing entries, factor it as $R \approx \mu + UV^{\mathsf{T}}$ with $k=2$ by gradient descent on the observed entries (seed $0$), and confirm the train RMSE is $\approx 0.21$ and the predictions are $P[0,3] \approx 1.5$ and $P[3,1] \approx 3.0$. Then change the rank to $k=1$ and to $k=3$ and report how the train RMSE changes. Explain the trend using the low-rank-approximation idea of Chapter 31.

23. ⭐⭐⭐⭐ (C) Recovering planted structure. Construct a "true" rank-$2$ rating matrix $R_{\text{true}} = U_0 V_0^{\mathsf{T}}$ from random $U_0$ ($20\times 2$), $V_0$ ($15\times 2$). Hide $30\%$ of the entries at random (mask them). Factor the observed part with $k=2$ and measure the error of your predictions on the hidden entries (not the observed ones). Does low-rank factorization recover the planted ratings it never saw? Now repeat with $k=2$ but $R_{\text{true}}$ of true rank $5$ — does a rank-$2$ model still predict well? Discuss.

24. ⭐⭐⭐⭐ (C) Bias in embeddings. Build a small embedding where one axis is a "sentiment" direction. Show that a neutral word's vector, when you add a learned "profession" difference vector, can pick up unintended sentiment — illustrating how analogy arithmetic can surface bias. Write two sentences on why this is a consequence of learned statistics, not a bug in the linear algebra, and connect it to how models work.

25. ⭐⭐⭐⭐ (A/C) From SVD to recommender (synthesis). On a fully observed small matrix, show that the truncated SVD of Chapter 30 (np.linalg.svd, keep top $k$) and a gradient-descent factorization (no regularization, no missing entries) reach the same reconstruction error $\lVert R - UV^{\mathsf{T}}\rVert_F$. Explain, citing the Eckart–Young theorem from Chapter 31, why the truncated SVD is the best rank-$k$ factorization, and state the two reasons production recommenders use gradient descent instead (missing entries; scale).


Notes on the coding problems

All coding problems use only numpy and the chapter's snippets as starting points. For Problems 22–23 and 25, set a fixed random seed (np.random.default_rng(0)) so your results are reproducible and match the chapter. Problem 25 is the natural bridge to the Chapter 39 capstone, where the recommender is assembled on a larger, real dataset.