Case Study 1: The Bell Curve in Standardized Testing — How SAT and ACT Scores Are Designed to Be Normal

Contributors

Case Study 1: The Bell Curve in Standardized Testing — How SAT and ACT Scores Are Designed to Be Normal

The Scenario

It's late October, and Sam Okafor is sitting in the Riverside University library, staring at a practice SAT score report. He scored 1280. His roommate scored 1180. His girlfriend scored 1350. But what do those numbers actually mean?

"How do they know what counts as a 'good' score?" Sam asks his statistics tutor, a grad student named Priya. "Like, who decided that 1060 is average? Why not 500, or 1000, or something round?"

Priya grins. "That's actually a really interesting question. And the answer is: the College Board literally constructed the test so that the scores would follow a normal distribution. The bell curve isn't something they discovered in the data. It's something they built."

Sam blinks. "Wait — the scores are designed to be normal? I thought the normal distribution was some natural thing."

"Sometimes it is," Priya says. "But sometimes we engineer it. And understanding the difference is one of the most important things you can learn in a statistics class."

How the SAT Creates a Bell Curve

The SAT (and its competitor, the ACT) goes through an elaborate process to ensure scores are approximately normally distributed. Here's how:

Step 1: Item Selection

The College Board tests thousands of potential questions on sample groups before any question appears on a real SAT. Each question is evaluated for its difficulty level — the percentage of students who answer it correctly. The final test is assembled with a carefully calibrated mix of difficulties:

A few very easy questions (85-95% of students get them right)
Many moderately difficult questions (40-60% correct)
A few very hard questions (10-25% correct)

This mix is specifically chosen to produce a bell-shaped distribution of total scores. If the test were all easy questions, everyone would score high, creating a left-skewed distribution. If all questions were hard, everyone would score low, creating a right-skewed distribution. The mix ensures the familiar bell shape.

Step 2: Equating and Scaling

Raw scores (number correct) are converted to scaled scores (200-800 per section) through a process called equating. This transformation is designed to:

Make scores comparable across different test dates
Produce a distribution with a predetermined mean and standard deviation
Ensure the distribution is approximately symmetric

The SAT aims for a mean of approximately 1060 (combined) and a standard deviation of about 217. The ACT aims for a mean around 20-21 and a standard deviation around 5-6.

Step 3: Norming

Both tests are "normed" — calibrated against a representative national sample. This norming process ensures that the percentile ranks associated with particular scores remain stable from year to year, give or take small shifts.

The Numbers: SAT Score Distribution

Based on recent College Board data, the SAT score distribution looks approximately like this:

$$X_{\text{SAT}} \sim N(1060, 217^2)$$

Let's use this model to answer questions Sam and his friends might ask.

"What percentile is my 1280?"

$$z = \frac{1280 - 1060}{217} = \frac{220}{217} = 1.01$$

From the z-table: $P(Z \leq 1.01) = 0.8438$

Sam is at approximately the 84th percentile — he scored higher than about 84% of test takers. Not bad.

"How does my roommate's 1180 compare?"

$$z = \frac{1180 - 1060}{217} = \frac{120}{217} = 0.55$$

$P(Z \leq 0.55) = 0.7088$

About the 71st percentile — above average, but not as far above as it might seem from the raw score.

"What score do you need for the top 5%?"

We need the 95th percentile. From the z-table, $P(Z \leq 1.645) \approx 0.95$.

$$x = 1060 + 1.645 \times 217 = 1060 + 357 = 1417$$

You need about a 1420 (rounding to the nearest 10) to be in the top 5%.

Python Verification

from scipy import stats

mu, sigma = 1060, 217

# Sam's percentile
sam_z = (1280 - mu) / sigma
sam_pct = stats.norm.cdf(1280, mu, sigma)
print(f"Sam (1280): z = {sam_z:.2f}, percentile = {sam_pct:.1%}")

# Roommate's percentile
roommate_pct = stats.norm.cdf(1180, mu, sigma)
print(f"Roommate (1180): percentile = {roommate_pct:.1%}")

# Top 5% cutoff
top_5 = stats.norm.ppf(0.95, mu, sigma)
print(f"Top 5% cutoff: {top_5:.0f}")

# Score ranges
print(f"\nScore distribution:")
for pct in [10, 25, 50, 75, 90, 95, 99]:
    score = stats.norm.ppf(pct/100, mu, sigma)
    print(f"  {pct}th percentile: {score:.0f}")

Output:

Sam (1280): z = 1.01, percentile = 84.4%
Roommate (1180): percentile = 71.0%
Top 5% cutoff: 1417

Score distribution:
  10th percentile: 782
  25th percentile: 914
  50th percentile: 1060
  75th percentile: 1206
  90th percentile: 1338
  95th percentile: 1417
  99th percentile: 1565

The ACT Comparison

The ACT has a different scale but the same normal structure:

$$X_{\text{ACT}} \sim N(20.8, 5.8^2)$$

This creates an interesting problem: how do you compare a student who scored 1280 on the SAT with one who scored 28 on the ACT?

Z-scores to the rescue — the same tool from Chapter 6 that let Alex compare watch time to engagement scores.

$$z_{\text{SAT}} = \frac{1280 - 1060}{217} = 1.01 \qquad z_{\text{ACT}} = \frac{28 - 20.8}{5.8} = 1.24$$

The ACT 28 is slightly more impressive ($z = 1.24$, about the 89th percentile) than the SAT 1280 ($z = 1.01$, about the 84th percentile). Without z-scores, you'd have no way to make this comparison.

The Deeper Question: Normal by Nature or by Design?

This is where the case study gets interesting — and connects to Theme 4 (the bell curve isn't destiny).

Sam's question was incisive: "Are scores naturally normal, or designed to be normal?"

The answer is: designed. And this has profound implications.

What the Design Tells Us

The fact that SAT scores are normal tells us about the test, not about the students. The College Board chooses questions, sets difficulty levels, and applies scaling specifically to produce a bell curve. If they wanted to, they could construct a test that produced a uniform distribution, or a bimodal one, or almost any shape they desired.

This means: - The bell curve of scores does not represent a bell curve of intelligence or ability. The scores are a transformed, engineered measurement, not a direct observation of cognitive capacity. - The percentile ranks are relative, not absolute. Saying someone is at the 84th percentile means they scored higher than 84% of test takers on this particular test. It says nothing about their absolute intellectual capability. - Changes in the test change the distribution. When the SAT was redesigned in 2016, the new scoring scale produced different means and standard deviations. Students didn't suddenly get smarter or dumber — the measuring instrument changed.

The Misuse of the Bell Curve

Throughout history, the bell-shaped distribution of test scores has been used to argue that intelligence is "naturally" distributed — that most people are "average," a few are "gifted," and a few are "deficient," and that this is simply a law of nature.

This reasoning is flawed for several reasons:

The normal distribution of scores is an artifact of test design. You could build a test that gives everyone the same score, or one that separates people into two clumps. The bell shape is a choice.
Environmental factors affect scores. Access to test preparation, quality of schools, family income, nutrition, exposure to test-taking strategies — all of these influence scores. The distribution of scores reflects the distribution of opportunity, not just the distribution of ability.
The model is useful, not true. The normal distribution is a good approximation of SAT scores, which makes it a useful tool for admissions offices and students. But treating the model as a law of nature leads to fatalistic thinking: "The bell curve says most people are mediocre." No. The bell curve says that a particular measurement tool, applied to a particular population, produces a particular distribution of scores. That's a much more modest — and more accurate — claim.

Professor Washington's Lens

This matters beyond standardized testing. Professor Washington encounters the same issue with risk assessment algorithms. If a predictive policing algorithm assigns risk scores that follow a bell curve, it's tempting to think the bell curve represents some fundamental distribution of "criminality" in the population. But the distribution reflects the algorithm's design choices: which features it includes, how it weights them, what training data it uses.

"When someone shows you a bell curve," Washington tells his students, "your first question should be: Who built this instrument, and what choices did they make?"

Discussion Questions

If the SAT were redesigned to produce a uniform distribution of scores (everyone equally spread from 400 to 1600), how would this change the college admissions process? Would it be fairer? Less fair? Why?
The College Board publishes mean scores broken down by race, gender, and family income, and significant gaps exist between groups. Given what you've learned about how the test is designed, what can and cannot be concluded from these score gaps?
Some universities have moved to "test-optional" admissions, partly because of concerns about what standardized tests actually measure. Based on what you've learned about the normal distribution as a model, write a paragraph supporting or opposing this policy change. Use statistical reasoning.
Apply z-scores to your own test experience. If a class exam has $\mu = 72$ and $\sigma = 10$, and a standardized test has $\mu = 500$ and $\sigma = 100$, on which exam is a score of 92 more impressive? Show your work.

Key Takeaways

SAT and ACT scores are engineered to follow a normal distribution through careful question selection, scaling, and norming
The bell curve of scores reflects the test design, not a natural law of intelligence
Z-scores allow comparison across different tests and scoring scales
The normal model is useful for calculating percentiles and making comparisons — but it should not be mistaken for a description of human potential
Always ask: "Is this bell curve a feature of the phenomenon, or a feature of the measurement?"