Chapter 33 Quiz — Machine Learning and Linear Algebra
Twelve quick checks on neural-network layers, embeddings, and matrix-factorization recommenders. Try each before opening the answer. Notation is locked: $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$, $R \approx UV^{\mathsf{T}}$, cosine similarity $\cos\theta = \dfrac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}$.
Q1. In one sentence, what is a fully-connected neural-network layer in linear-algebra terms?
Answer
A **matrix multiplication plus a bias plus a nonlinearity**: $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$. The matrix $W$ applies a linear transformation, the bias $\mathbf{b}$ shifts it (making it affine), and the activation $\sigma$ bends each coordinate, breaking out of flat/linear geometry.Q2. Why is the nonlinearity essential? What does a deep network become without it?
Answer
Without $\sigma$, a stack of layers $W_k(\cdots(W_1\mathbf{x}+\mathbf{b}_1)\cdots)+\mathbf{b}_k$ collapses to a *single* affine map $W'\mathbf{x}+\mathbf{b}'$ with $W' = W_k\cdots W_1$ — by the composition rule of Chapter 8. So **without the nonlinearity, deep = shallow**: depth buys nothing, and the network can only represent linear (affine) functions. The nonlinearity between layers is what lets depth accumulate genuine expressive power.Q3. Each output coordinate of $W\mathbf{x}$ is what, in the language of Chapter 18?
Answer
A **dot product** of a row of $W$ with the input $\mathbf{x}$ — a weighted sum of the inputs. A single neuron (one row of $W$, one bias, one $\sigma$) measures the alignment between its weight vector and the input and fires accordingly. This is the row-times-column view of matrix-vector multiplication, which Chapter 8 earned as composition rather than as a memorized rule.Q4. ReLU is $\sigma(t)=\max(0,t)$. Is the layer $\mathbf{x}\mapsto \mathrm{ReLU}(W\mathbf{x})$ a linear map? Is the pre-activation $\mathbf{x}\mapsto W\mathbf{x}$ linear?
Answer
The pre-activation $W\mathbf{x}$ **is** linear (it is just a matrix times a vector). The full layer with ReLU is **not** linear: ReLU is nonlinear (it bends at $0$), so $\mathrm{ReLU}(W(\mathbf{x}+\mathbf{y})) \neq \mathrm{ReLU}(W\mathbf{x}) + \mathrm{ReLU}(W\mathbf{y})$ in general. That failure of additivity is exactly the point — it is what makes the network more than a single matrix.Q5. A model has "70 billion parameters." What, concretely, is being counted?
Answer
The total number of learnable numbers across **all the weight matrices and bias vectors** (and embedding tables) in the network. A layer with an $m\times n$ weight matrix and an $m$-vector bias contributes $mn + m$ parameters; summing over all layers gives the parameter count. Training's job is to choose good values for all of them.Q6. What does it mean that "geometry is meaning" in an embedding space, and which quantity measures it?
Answer
Each word/item/user is a **learned vector**, arranged so that related things point in similar directions. Relatedness is measured by **cosine similarity** (Chapter 18): $+1$ for the same direction (very related), $0$ for perpendicular (unrelated), $-1$ for opposite. The model stores a *position in space*, and closeness in angle is closeness in meaning.Q7. Why does "king − man + woman ≈ queen" work? What property of the embedding makes vector arithmetic meaningful?
Answer
Because the embedding encodes **relationships as consistent difference vectors.** If the "gender" relationship is a fixed displacement, then $\text{queen}-\text{king}\approx\text{woman}-\text{man}$; rearranging gives the analogy. Subtracting "man" strips the male-human part, leaving the royalty direction; adding "woman" attaches the female-human part. The arithmetic works because a relationship is a *direction* you can add to any word.Q8. Why is cosine similarity, rather than Euclidean distance, the usual meter for embedding relatedness?
Answer
In most embedding spaces the **direction** carries the meaning and the **length** carries something incidental (like frequency). Cosine ignores length (it divides out both norms), so two vectors pointing the same way count as "same meaning" regardless of magnitude. If vectors are normalized to unit length, cosine and Euclidean distance rank neighbors identically ($\lVert\hat{\mathbf{u}}-\hat{\mathbf{v}}\rVert^2 = 2 - 2\cos\theta$), which is why systems often normalize first.Q9. State the matrix-factorization model of a recommender, and say what each factor means.
Answer
$R \approx UV^{\mathsf{T}}$ (often $\mu + UV^{\mathsf{T}}$ with a global-mean baseline). $U$ is $m\times k$: row $i$ is user $i$'s **latent factor vector** (their taste along $k$ hidden themes). $V$ is $n\times k$: row $j$ is item $j$'s latent factor vector (how much it expresses each theme). The predicted rating is the **dot product** $\mathbf{u}_i\cdot\mathbf{v}_j$ — high when a user's tastes align with an item's character. It is the low-rank approximation of Chapters 30–31 applied to a rating table.Q10. How does a matrix-factorization recommender predict a rating for a user–item pair that was never observed?
Answer
It learns the user's factor vector $\mathbf{u}_i$ from the items they *did* rate and the item's factor vector $\mathbf{v}_j$ from the users who *did* rate it, then computes the dot product $\mu + \mathbf{u}_i\cdot\mathbf{v}_j$ for the unseen pair. The **low rank** is what makes this honest: with few factors, the model cannot memorize each rating individually, so it must discover shared structure (themes) that generalizes to unseen pairs. This is collaborative filtering.Q11. Why must the factorization be low-rank rather than full-rank?
Answer
A full-rank $UV^{\mathsf{T}}$ could fit the observed entries *perfectly* while leaving the missing entries completely unconstrained — it would learn nothing about what to predict (the matrix version of overfitting from Chapter 17). Forcing rank down to a small $k$ demands that each user's predictions lie in the same $k$-dimensional pattern as similar users, which is precisely what lets the model **generalize** from observed to unobserved ratings.Q12. At a high level, how are the weights of a network — or the factors of a recommender — learned? Where does the transpose come in?