Case Study 2 — Normalizing a Joint Distribution for a Recommender System

DataField.Dev

Case Study 2 — Normalizing a Joint Distribution for a Recommender System

Field: Data science (probability modeling, statistical inference)

A data scientist named Theo is building the scoring engine for a streaming service's recommender. For each user he tracks two normalized features: $X$, how much of a title's runtime they watched (a fraction in $[0,1]$), and $Y$, how quickly they started watching after the title appeared in their feed (also rescaled to $[0,1]$, where $1$ means "instantly"). His team's hypothesis is that these two behaviors are correlated — people who start fast also tend to finish — and they want a joint probability density $p(x,y)$ over the unit square that captures it. Theo's job today is to turn a raw, unnormalized model into a legitimate probability density, then use it to answer two business questions. Every step is a double integral, and getting the normalization wrong would silently corrupt every probability the system reports.

The unnormalized model

From fitting the data, Theo's model for the shape of the density is

$$\tilde p(x,y) = x + y \quad\text{on the unit square } [0,1]^2,$$

and zero outside. The form $x + y$ rises toward the corner $(1,1)$ — high-completion, fast-start users are the densest cluster — and falls toward $(0,0)$. But $\tilde p$ is only proportional to a probability density; it does not yet integrate to $1$. A genuine density must satisfy the normalization condition from §32.11:

$$\iint_{\mathbb{R}^2} p(x,y)\,dA = 1.$$

So Theo needs a constant $c$ with $p(x,y) = c\,(x+y)$ and $\iint_{[0,1]^2} c(x+y)\,dA = 1$.

Step 1 — Find the normalizing constant

He computes the total mass of $\tilde p$ over the square. Because the region is a rectangle he can iterate in either order (Fubini, §32.2); he integrates $y$ first:

$$\iint_{[0,1]^2}(x+y)\,dA = \int_0^1\!\!\int_0^1 (x+y)\,dy\,dx = \int_0^1\left[xy + \frac{y^2}{2}\right]_{y=0}^{y=1}dx = \int_0^1\left(x + \frac12\right)dx.$$

The outer integral is $\left[\frac{x^2}{2} + \frac{x}{2}\right]_0^1 = \frac12 + \frac12 = 1$. So $\iint \tilde p\,dA = 1$ already — a lucky accident of this particular model — and the normalizing constant is simply $c = 1$. Theo double-checks by noting the integrand $x + y$ is a sum, so the rectangle factoring shortcut does not apply; he was right to iterate honestly. His density is

$$p(x,y) = x + y, \qquad (x,y)\in[0,1]^2.$$

Had the integral come out to, say, $\tfrac32$, the constant would have been $c = \tfrac23$ — the reciprocal of the total. The normalization step is never optional; it is what converts a shape into probabilities.

Step 2 — A region probability (the business question)

Marketing wants to know: what fraction of users are "low-engagement," meaning their two scores sum to less than $1$, $X + Y \le 1$? This is a probability over a triangular sub-region $R$ — the part of the unit square below the line $x + y = 1$ (§32.11):

$$P(X + Y \le 1) = \iint_R (x+y)\,dA.$$

Theo sketches the region first, exactly as the chapter insists. The triangle has vertices $(0,0)$, $(1,0)$, $(0,1)$. Using Type I limits (§32.3), $x$ runs from $0$ to $1$, and for each $x$ the strip runs from $y = 0$ up to the line $y = 1 - x$:

$$P(X+Y\le 1) = \int_0^1\!\!\int_0^{1-x}(x+y)\,dy\,dx.$$

The inner integral:

$$\int_0^{1-x}(x+y)\,dy = \left[xy + \frac{y^2}{2}\right]_{0}^{1-x} = x(1-x) + \frac{(1-x)^2}{2}.$$

Expand: $x(1-x) = x - x^2$, and $\frac{(1-x)^2}{2} = \frac{1 - 2x + x^2}{2} = \frac12 - x + \frac{x^2}{2}$. Adding,

$$x - x^2 + \frac12 - x + \frac{x^2}{2} = \frac12 - \frac{x^2}{2}.$$

The linear terms cancelled, leaving a clean $\frac12 - \frac{x^2}{2}$. Now the outer integral:

$$P(X+Y\le 1) = \int_0^1\left(\frac12 - \frac{x^2}{2}\right)dx = \left[\frac{x}{2} - \frac{x^3}{6}\right]_0^1 = \frac12 - \frac16 = \frac13.$$

So one-third of users fall in the low-engagement triangle. A useful cross-check: if the density were uniform ($p = 1$), the probability would be the triangle's area, $\tfrac12$. Theo's density tilts mass toward the high-engagement corner $(1,1)$, which lies outside the triangle, so it is sensible that the low-engagement probability ($\tfrac13$) comes out below the uniform value ($\tfrac12$). The number passed his intuition test before he reported it.

Step 3 — A marginal distribution

Product wants to understand the completion feature $X$ on its own, ignoring start-speed. That means the marginal density $p_X(x)$, obtained by integrating out $y$ (§32.11):

$$p_X(x) = \int_0^1 (x+y)\,dy = \left[xy + \frac{y^2}{2}\right]_{y=0}^{y=1} = x + \frac12, \qquad 0 \le x \le 1.$$

This is the distribution of completion fraction across the whole user base. Theo verifies it is a legitimate 1D density: $\int_0^1\left(x + \frac12\right)dx = \frac12 + \frac12 = 1$ ✓. The marginal rises with $x$, confirming that more users skew toward higher completion. He can now report, for instance, that the average completion fraction is $\int_0^1 x\,p_X(x)\,dx = \int_0^1 x(x + \tfrac12)\,dx = \left[\frac{x^3}{3} + \frac{x^2}{4}\right]_0^1 = \frac13 + \frac14 = \frac{7}{12} \approx 0.58$.

Step 4 — Why this scales to the Gaussian

Theo's $x+y$ model is a toy; the production system uses a bivariate Gaussian density, the smooth hill the chapter built in §32.11. Its independent standard form is

$$p(x,y) = \frac{1}{2\pi}\,e^{-(x^2+y^2)/2}.$$

The constant $\frac{1}{2\pi}$ is there for the same reason Theo's $c$ was: to force the total integral to $1$. And verifying that integral is precisely the polar Gaussian trick of §32.5 — switch to polar, where the rogue factor $r$ in $dA = r\,dr\,d\theta$ makes $\int_0^\infty e^{-r^2/2}\,r\,dr$ elementary, and the constant $\frac{1}{2\pi}$ falls out of $\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt\pi$. The same normalized 2D Gaussian, Theo notes, is the blur kernel used to smooth the recommender's heatmaps before display: each pixel becomes a $p$-weighted average of its neighbors, and the weights must integrate to $1$ so the image neither brightens nor darkens. The humble normalization integral he did by hand in Step 1 is the conceptual seed of both the statistical model and the image-processing filter.

Closing the loop

In four integrals Theo went from a raw shape to a deployable model: he normalized the density (Step 1), answered a region-probability question for marketing (Step 2), extracted a one-feature marginal for product (Step 3), and connected the whole pipeline to the Gaussian that powers production (Step 4). The recurring discipline — sketch the region, normalize before computing probabilities, and sanity-check against the uniform case — is exactly what keeps a probabilistic system honest.

Discussion Questions

In Step 1 the normalization integral happened to equal $1$, so $c = 1$. Rework the problem for the model $\tilde p(x,y) = xy$ on the unit square: find the $c$ that normalizes it. (You will need the rectangle factoring shortcut from §32.2.)
The uniform-density cross-check in Step 2 gave $\tfrac12$, and Theo's tilted density gave $\tfrac13$. Construct a different density on the square for which $P(X+Y\le 1)$ would come out greater than $\tfrac12$, and explain where its mass concentrates.
Step 3 found the marginal $p_X(x) = x + \tfrac12$ by integrating out $y$. What would the marginal $p_Y(y)$ be, and what does the symmetry of $p(x,y) = x+y$ tell you before computing?
The chapter computes $\int_{-\infty}^\infty e^{-x^2}\,dx = \sqrt\pi$ by squaring and going to polar (§32.5). Explain in your own words why this "lift to 2D" works for the Gaussian but the same trick gives nothing useful for, say, $\int_0^1 (x+y)\,dx$.

Annotated Reading

Wasserman, All of Statistics, §2.4–2.5 (Joint and Marginal Distributions). Concise, rigorous definitions of joint densities, normalization, and marginals — the formal backbone of Steps 1–3. Pairs directly with §32.11.
OpenStax, Calculus Volume 3, §5.2 (Double Integrals over General Regions), example on probability. A free worked example computing a probability as a double integral over a non-rectangular region, mirroring Step 2.
Bishop, Pattern Recognition and Machine Learning, §2.3 (The Gaussian Distribution). Where the bivariate Gaussian of Step 4 becomes the workhorse of modeling; shows the normalization constant in its general (covariance-matrix) form and why it matters for every probabilistic model.