Case Study 1 — How a Streaming Service Predicts What You'll Watch: Matrix Factorization in Action

DataField.Dev

Case Study 1 — How a Streaming Service Predicts What You'll Watch: Matrix Factorization in Action

Field: data science / media & entertainment (the chapter anchor). Concepts used: rating matrix, low-rank approximation (Chapters 30–31), matrix factorization $R \approx UV^{\mathsf{T}}$, latent factors, dot product as alignment (Chapter 18), collaborative filtering, cold start. Why it matters: the recommendation engines behind video, music, and shopping platforms drive an enormous fraction of what people watch and buy. At their mathematical core they are a single low-rank factorization of a giant, mostly-empty table — the exact idea you met compressing an image in Chapter 31, redirected at human preference.

The problem: a giant table with the answers missing

Imagine a streaming service with a hundred million subscribers and tens of thousands of titles. Conceptually there is one enormous matrix $R$ behind the whole business: rows are users, columns are titles, and the entry $R_{ij}$ is how much user $i$ liked title $j$ — a star rating, a thumbs up/down, or an implicit signal like "watched to the end." The trouble is that $R$ is almost entirely empty. A typical subscriber has rated, or even watched, a vanishing fraction of the catalog. If the matrix is a hundred million by fifty thousand, it has five trillion cells, of which perhaps $99.9\%$ are blank. The recommender's whole job is to fill in the blanks: predict the rating each user would give each unseen title, then surface the titles with the highest predicted scores.

You cannot do this by looking up an answer — the answer is precisely what is missing. And you cannot do it title-by-title in isolation, because most user–title pairs have never met. What you can exploit is that human taste is not a chaos of independent whims. People's preferences are governed by a modest number of hidden themes — appetite for sci-fi versus cooking shows, prestige drama versus light comedy, recent versus classic — and those few themes explain a great deal of the rating pattern. In the language of this book, the rating matrix $R$, though astronomically large, is approximately low-rank: it sits close to a matrix of small rank $k$, because only $k$ underlying taste-dimensions are really at work. This is the same observation that let Chapter 31 compress an image — most of the matrix's "energy" lives in a few singular directions — applied now to preference instead of pixels.

The model: factor the ratings into user and item vectors

The low-rank structure is captured by factoring $R$ into a thin product, $$ R \approx UV^{\mathsf{T}}, $$ where $U$ is $m\times k$ (a row of $k$ numbers per user) and $V$ is $n\times k$ (a row of $k$ numbers per title), with $k$ small — a few dozen in practice. Each user gets a latent factor vector $\mathbf{u}_i$ and each title a latent factor vector $\mathbf{v}_j$, and the predicted rating is their dot product, often offset by a global-mean baseline $\mu$: $$ \widehat{R}_{ij} = \mu + \mathbf{u}_i\cdot\mathbf{v}_j = \mu + \sum_{f=1}^{k} u_{if}\,v_{jf}. $$ Read this as Chapter 18's alignment meter: $u_{if}$ is "how much user $i$ likes theme $f$," $v_{jf}$ is "how much title $j$ expresses theme $f$," and the dot product is large when the user's tastes point the same way as the title's character. A sci-fi lover ($\mathbf{u}_i$ pointing toward the sci-fi direction) dotted with a sci-fi title ($\mathbf{v}_j$ also pointing that way) gives a high predicted rating; the same lover dotted with a cooking show gives a low one.

The factors are learned, never assigned. We choose $U$ and $V$ to reproduce the observed ratings as closely as possible — and crucially we fit only the entries we actually have, never the blanks, so the missing data exerts no false pull. Because $k$ is small there are far fewer numbers in $U$ and $V$ than there are observed ratings, so the factors cannot memorize each rating; they are forced to discover the genuine shared themes. Then we predict the blanks with the same factors. This is collaborative filtering: users collaborate, unknowingly, by collectively revealing the themes, so that one viewer's ratings inform predictions for another.

A worked instance: six viewers, six shows

Let us shrink the trillion-cell problem to something we can read in full. Six users, six shows; shows $0,1,2$ are sci-fi, shows $3,4,5$ are cooking. Users $0,1,2$ lean sci-fi; users $3,4,5$ lean cooking. Ratings run $1$–$5$, and four cells are blank ($\bullet$): user $0$ never rated show $4$ (a cooking title), user $1$ never rated show $5$ (another cooking title), user $2$ never rated show $2$ (a sci-fi title), and user $4$ never rated show $0$ (another sci-fi title). $$ R = \begin{bmatrix} 5 & 5 & 4 & 1 & \bullet & 1 \\ 4 & 5 & 5 & 2 & 1 & \bullet \\ 5 & 4 & \bullet & 1 & 2 & 1 \\ 1 & 2 & 1 & 5 & 4 & 5 \\ \bullet & 1 & 2 & 4 & 5 & 4 \\ 2 & 1 & 1 & 5 & 5 & 4 \end{bmatrix} $$

Wait — I placed the blanks deliberately on sci-fi shows for the sci-fi fan (user $2$) and for a cooking fan (user $4$), so we can watch the model predict in two opposite directions. We factor $R \approx \mu + UV^{\mathsf{T}}$ with rank $k = 2$, hoping the two latent dimensions will line up with "sci-fi-ness" and "cooking-ness," fitting only the observed cells by gradient descent.

# Streaming recommender: factor a 6x6 rating matrix with two blanks, rank 2.
import numpy as np
R = np.array([[5,5,4,1,0,1],   # 0 marks an unknown rating
              [4,5,5,2,1,0],
              [5,4,0,1,2,1],
              [1,2,1,5,4,5],
              [0,1,2,4,5,4],
              [2,1,1,5,5,4]], dtype=float)
mask = (R > 0).astype(float)
m, n, k = 6, 6, 2
mu = R[mask > 0].mean()                       # global mean ~ 3.062
rng = np.random.default_rng(1)
U = rng.normal(0, 0.1, (m, k)); V = rng.normal(0, 0.1, (n, k))
lr, reg = 0.01, 0.05
for _ in range(8000):
    E = mask * (R - (mu + U @ V.T))           # error on KNOWN entries only
    U += lr * (E @ V - reg * U)
    V += lr * (E.T @ U - reg * V)
P = mu + U @ V.T
rmse = np.sqrt((mask * (R - P) ** 2).sum() / mask.sum())
print("train RMSE:", round(rmse, 3))          # 0.242
print("predict (user2, show2) =", round(P[2, 2], 2))  # 4.51  sci-fi fan, sci-fi show
print("predict (user4, show0) =", round(P[4, 0], 2))  # 2.42  cooking fan, sci-fi show

The model fits the $32$ observed ratings with a root-mean-square error of about $0.24$ stars — excellent for a two-factor model — and its predictions for the blanks are exactly what taste would suggest. User $2$, a sci-fi fan, is predicted to rate the missing sci-fi show $2$ a high $4.51$; user $4$, a cooking fan, is predicted to rate the missing sci-fi show $0$ a low $2.42$. The recommender would happily suggest show $2$ to user $2$ and would not push show $0$ on user $4$. And — the point worth repeating — nobody told the model which shows are sci-fi. It inferred the two-theme structure from the rating pattern alone, then transferred that knowledge to pairs it had never observed. The sci-fi-loving users collectively taught the model what "sci-fi-ness" is; that learned direction predicted a rating no single user supplied.

Reading the factors, and the geometry of "more like this"

The learned item matrix $V$ places each show as a point in a $2$-dimensional latent plane, and the three sci-fi shows cluster in one region while the three cooking shows cluster in another — the model has built an embedding of the catalog in which similar shows are nearby vectors. That is not a separate feature bolted on; it falls out of the factorization for free. The everyday "Because you watched X" row is then a cosine-similarity search (Chapter 18) over these item vectors: find the shows whose factor vectors point most nearly the same way as $X$'s. The user vectors live in the same plane, so "recommended for you" is "find the shows whose vectors best align with your vector" — a dot-product ranking, exactly the prediction formula. The recommender, the embedding, and the similarity search are one geometry used three ways.

What this leaves out, honestly, and why it still scales

Our six-by-six toy omits much of what a production system layers on top, and a careful reader should know what. Real systems add per-user and per-item bias terms (some users rate generously; some titles are universally loved), implicit feedback (what you watched and abandoned, not just what you rated), temporal effects (taste drifts), and increasingly neural networks that learn richer, nonlinear factor representations than a plain dot product. The famous \$1M Netflix Prize (2006–2009) was won by an ensemble whose backbone was matrix factorization of exactly this $R\approx UV^{\mathsf{T}}$ form, and the technique remains foundational even where neural models now sit on top [verify].

But the skeleton genuinely scales. Going from six users to a hundred million changes the size of $U$ and $V$, not the mathematics: still a low-rank factorization, still fit on observed entries by gradient descent (never by forming the full matrix, which would be impossible at trillions of cells), still predicting via dot products. The gradient route is preferred over a literal SVD for precisely the reasons the chapter's Math-Major Sidebar gives — the matrix is mostly missing, and you cannot factor a matrix you cannot even store — but the target is the low-rank approximation of Chapter 31, the rank-$k$ matrix closest to the observed ratings.

The method's honest limit is the cold-start problem. A brand-new user with no ratings has no learned vector to place them in the latent space, and a brand-new title nobody has rated has no vector either — there is simply nothing to pin the point down, just as you cannot fit a line through zero data points. Platforms bridge this with signup surveys, content metadata (genre, cast), and non-personalized "trending" fallbacks until a few ratings arrive. That limit, too, is a linear-algebra fact: you need enough independent observations to determine your unknowns, and a recommender obeys the same rank logic as every system in this book.

The takeaway for this chapter's anchor: the engine deciding your next show is not exotic. It is a perpendicular-free cousin of the SVD — a low-rank factorization $R\approx UV^{\mathsf{T}}$ — whose predicted ratings are dot products of learned user and item vectors. The mathematics of Part IV (the dot product) and Part VI (low rank) is, quite literally, choosing what a hundred million people watch tonight.