Case Study 1 — The Normal Distribution and Why It Appears Everywhere

DataField.Dev

Case Study 1 — The Normal Distribution and Why It Appears Everywhere

Field: Statistics and data science Tools from this chapter: the definite integral as area under a curve (§13.4), signed area (§13.6), the midpoint and trapezoidal rules (§13.13), and the standard-normal anchor (§13.14).

The problem nobody can solve with a formula

A quality engineer at a bottling plant measures the fill volume of a "500 mL" bottle. No machine is perfect: some bottles come out at 498 mL, some at 503 mL, most near 500. Plot thousands of these measurements and a familiar shape emerges — a symmetric mound, tall in the middle, tapering smoothly on both sides. It is the bell curve, and it governs an astonishing range of natural and manufactured quantities: human heights, measurement errors, exam scores, the daily returns of a stock index, the thermal noise in a sensor.

After rescaling the measurements to "how many standard deviations from the mean," the curve becomes the standard normal density of §13.14: $$\phi(x) = \frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}.$$ The engineer's real question is a probability: what fraction of bottles fall within one standard deviation of the target? In the language of this chapter, that fraction is an area under the curve: $$P(-1 \le X \le 1) = \int_{-1}^{1} \frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}\,dx.$$

And here is the wall every statistics student eventually hits. There is no elementary antiderivative for $\phi$. You can hunt forever for a tidy formula $F(x)$ with $F' = \phi$ built from powers, roots, exponentials, logs, and trig functions, and you will never find one — Liouville proved in the 1830s that none exists (§13.14). The Fundamental Theorem of Calculus, which arrives in Chapter 14, will apply perfectly to this integral; it will simply hand us an antiderivative we are unable to write down. So the FTC shortcut, powerful as it is, cannot rescue us here.

What can rescue us is the entire content of Chapter 13: the definite integral is the limit of Riemann sums, and a Riemann sum is just arithmetic. We will compute the bell-curve area the honest way — with rectangles and trapezoids — and watch it converge to the number every statistician knows.

Slicing the bell

Set $a = -1$, $b = 1$. We chop $[-1, 1]$ into $n$ equal strips of width $\Delta x = 2/n$, build a rectangle over each strip, and add the areas. The only decision is where to sample the height. The midpoint rule (§13.13) samples $\phi$ at the center of each strip, $\overline{x}_i = a + (i - \tfrac12)\Delta x$, because centering balances the curve's overshoot on one side of a strip against its undershoot on the other.

Start crude, with $n = 4$ so $\Delta x = 0.5$ and midpoints $-0.75, -0.25, 0.25, 0.75$. The density is symmetric, so $\phi(-0.75) = \phi(0.75)$ and $\phi(-0.25) = \phi(0.25)$. Evaluating the bell (these are the heights of our four rectangles): $$\phi(0.25) \approx 0.38667, \qquad \phi(0.75) \approx 0.30114.$$ The midpoint estimate is the common width times the sum of heights: $$M_4 = \Delta x \sum_{i=1}^{4} \phi(\overline{x}_i) = 0.5\big[2(0.38667) + 2(0.30114)\big] = 0.5(1.37562) = 0.68781.$$ Four rectangles — heights you could read off a calculator — already give $0.6878$. The true value, as we are about to see, is $0.6827$. We are within two-tenths of a percent on the very first try.

It is worth pausing on why the midpoint rule is so good here. Over each narrow strip the bell curve is nearly straight; the midpoint rectangle cuts through that near-line so that the little triangle of curve it misses on the left is almost exactly cancelled by the triangle it includes on the right. The errors are not just small — they are self-cancelling. The endpoint rules enjoy no such luck, which is why §13.13 ranks the midpoint rule $O(h^2)$ against the rectangles' $O(h)$.

Refining toward the truth

A single estimate is a guess; a sequence of estimates that stabilizes is a measurement. Refine the partition and watch the numbers lock in. The following code implements the three workhorse rules of §13.13 — left, right, midpoint, and the trapezoidal rule (which §13.13 shows equals the average of left and right) — and reports them as $n$ grows. The outputs are the values the chapter's §13.14 computation produces.

import numpy as np

def phi(x: np.ndarray) -> np.ndarray:
    return np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)   # standard normal density

def rules(f, a: float, b: float, n: int) -> dict:
    dx = (b - a) / n
    xl = a + dx * np.arange(n)            # left endpoints
    xr = a + dx * np.arange(1, n + 1)     # right endpoints
    xm = a + dx * (np.arange(n) + 0.5)    # midpoints
    left  = dx * np.sum(f(xl))
    right = dx * np.sum(f(xr))
    mid   = dx * np.sum(f(xm))
    trap  = 0.5 * (left + right)          # trapezoid = average of left & right (§13.13)
    return {"left": left, "right": right, "midpoint": mid, "trapezoid": trap}

for n in (4, 50, 1000):
    r = rules(phi, -1.0, 1.0, n)
    print(n, {k: round(v, 6) for k, v in r.items()})

# Output:
# 4    {'left': 0.672522, 'right': 0.672522, 'midpoint': 0.687806, 'trapezoid': 0.672522}
# 50   {'left': 0.682625, 'right': 0.682625, 'midpoint': 0.682722, 'trapezoid': 0.682625}
# 1000 {'left': 0.682689, 'right': 0.682689, 'midpoint': 0.68269,  'trapezoid': 0.682689}
# Every rule converges to 0.6827 -- the famous "68%" within one standard deviation.

Two features deserve comment. First, the left and right sums come out identical. That is not a bug — it is the symmetry of $\phi$ (§13.6). Because $\phi(-x) = \phi(x)$ and the interval $[-1,1]$ is symmetric about $0$, every left rectangle has a mirror-image right rectangle of exactly the same height, so the two sums are forced to agree. Second, by $n = 1000$ all four rules report $0.6827$ to four decimals. The estimate has stopped moving; we have measured the area.

That number, $0.6827$, is the most quoted figure in all of applied statistics: the "68%" of the empirical "68–95–99.7" rule. About 68% of any normally distributed quantity lands within one standard deviation of its mean. The bottling engineer now knows that roughly 68% of bottles fall within one standard deviation of 500 mL — a fact computed from nothing but rectangles.

Why this matters beyond one integral

Every confidence interval, every $p$-value, every $z$-score you will ever meet is, underneath, an area under a bell curve exactly like this one (§13.14). When statistical software reports "$p = 0.03$," it did not consult a magic formula — it approximated $\int \phi$ by high-accuracy numerical integration, the industrial-strength descendant of the midpoint sum above. The "standard normal table" printed in the back of every statistics text is a list of such integrals, $\Phi(z) = \int_{-\infty}^{z}\phi(x)\,dx$, each entry the output of a Riemann-type calculation.

This single integral is also the anchor example that threads through the rest of this book (see the continuity tracker's anchor list). Chapter 14 will explain why the cumulative-area function $\Phi(x) = \int_{-\infty}^{x}\phi(t)\,dt$ satisfies $\Phi'(x) = \phi(x)$ even though $\Phi$ has no closed form — that is the Fundamental Theorem at work. Chapter 23 will expand $e^{-x^2/2}$ as a Taylor series and integrate it term by term, producing a series for the bell-curve area that converges to any precision we like. The area under the normal curve is the thread tying Riemann sums, the Fundamental Theorem, and infinite series into one story — and it begins here, with four rectangles and a sum.

Discussion Questions

The midpoint rule beat the endpoint rules badly at $n = 4$ ($0.6878$ vs. $0.6725$) but the gap nearly vanished by $n = 50$. Using the error-scaling table of §13.13, explain why the advantage of the midpoint rule is largest when $n$ is small.
The left and right sums were identical here because of symmetry. Construct an interval, such as $[0, 2]$, on which they would not be equal, and predict which one overestimates $\int_0^2 \phi\,dx$. (Hint: is $\phi$ increasing or decreasing on $[0,2]$? — recall §13.2.)
A colleague proposes to find $\int_{-1}^1 \phi\,dx$ by "just finding the antiderivative of $\phi$." Explain, in one paragraph, why this plan fails and why numerical integration is not a second-best workaround but the correct method (§13.14).
The bottling engineer wants the fraction within two standard deviations, $P(-2 \le X \le 2)$. Describe how you would adapt the code above, and state which famous percentage you expect the sum to converge to.

A Short Annotated Reading

OpenStax, Introductory Statistics, "The Normal Distribution" — the applied side of this case study: how areas under $\phi$ become probabilities, $z$-scores, and the empirical rule, with no calculus prerequisites assumed. Read it to see where the number $0.6827$ goes once you have computed it.
Stewart, Calculus: Early Transcendentals (9th ed.), §5.2 "The Definite Integral" — the rigorous source for the Riemann-sum machinery used here; pairs directly with §13.4–13.5 of this chapter.
Stewart, §7.7 "Approximate Integration" — the full treatment of the midpoint and trapezoidal rules (and Simpson's rule, our Chapter 16), including the error bounds that explain the convergence you watched above.