Case Study 2: Medical Screening and the False Positive Paradox — When Proportion Inference Meets Bayes' Theorem

Contributors

Case Study 2: Medical Screening and the False Positive Paradox — When Proportion Inference Meets Bayes' Theorem

The Scenario

Dr. Maya Chen has been asked to evaluate whether a new rapid screening test for a respiratory illness should be deployed across her county. The test manufacturer claims the following performance characteristics:

Sensitivity: 92% (if you have the illness, the test correctly identifies it 92% of the time)
Specificity: 95% (if you don't have the illness, the test correctly says "negative" 95% of the time)

These numbers sound impressive. A test that's right 92-95% of the time seems pretty good. But Maya, who studied Bayes' theorem in her statistics education (Chapter 9), knows she needs one more piece of information: the prevalence of the illness in the population being screened.

And that's where proportion inference comes in.

Step 1: Estimating Prevalence (Proportion Inference)

Maya doesn't know the exact prevalence of the respiratory illness in her county. National estimates put it at about 3%, but her county is different — it's more rural, has an older population, and has had recent outbreaks.

She randomly samples 500 adults in the county and tests them using the gold-standard diagnostic test (not the rapid screening test). Of the 500, 24 test positive.

The Hypothesis Test

Maya wants to test whether her county's prevalence exceeds the national rate of 3%.

Step 1: Hypotheses - $H_0: p = 0.03$ (county prevalence equals the national rate) - $H_a: p > 0.03$ (county prevalence exceeds the national rate)

Step 2: Check Conditions - Random sample? Yes. - Independence? $500 \leq 0.10 \times 85{,}000$. ✓ - Success-failure? $np_0 = 500 \times 0.03 = 15 \geq 10$ ✓ and $n(1-p_0) = 500 \times 0.97 = 485 \geq 10$ ✓

Step 3: Test Statistic

$$\hat{p} = \frac{24}{500} = 0.048$$

$$z = \frac{0.048 - 0.03}{\sqrt{0.03 \times 0.97 / 500}} = \frac{0.018}{\sqrt{0.0000582}} = \frac{0.018}{0.00763} = 2.36$$

Step 4: P-value

$$P(Z \geq 2.36) = 0.0091$$

Step 5: Decision

$p = 0.009 < \alpha = 0.05$. Reject $H_0$.

Conclusion: There is statistically significant evidence that the prevalence of the respiratory illness in Maya's county exceeds the national rate of 3% ($z = 2.36$, $p = 0.009$).

The Confidence Interval

Maya also constructs a 95% CI for the county prevalence:

$$\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = 0.048 \pm 1.960 \times \sqrt{\frac{0.048 \times 0.952}{500}}$$

$$= 0.048 \pm 1.960 \times 0.00956 = 0.048 \pm 0.019$$

$$\text{95% CI: } (0.029, 0.067)$$

Because $\hat{p}$ is close to 0 and $n$ is moderate, Maya also computes the Wilson interval:

$$\text{95% Wilson CI: } (0.033, 0.072)$$

The Wilson interval suggests the prevalence is between 3.3% and 7.2% — plausibly above the national average.

Step 2: Connecting Prevalence to Bayes' Theorem

Now comes the critical part. Maya needs to figure out what a positive rapid-screening test result actually means for patients in her county. This is the positive predictive value (PPV) — the probability that a patient who tests positive actually has the illness.

From Chapter 9, she knows:

$$\text{PPV} = P(\text{illness} \mid \text{positive test}) = \frac{P(\text{positive} \mid \text{illness}) \times P(\text{illness})}{P(\text{positive})}$$

Where: - $P(\text{positive} \mid \text{illness})$ = sensitivity = 0.92 - $P(\text{illness})$ = prevalence = the proportion she just estimated - $P(\text{positive})$ = $P(\text{positive} \mid \text{illness}) \times P(\text{illness}) + P(\text{positive} \mid \text{no illness}) \times P(\text{no illness})$

PPV Depends on Prevalence

Here's the key insight: the PPV changes dramatically depending on the prevalence. Let's compute it for three scenarios using the endpoints and center of Maya's confidence interval.

Scenario A: Prevalence = 3% (national rate)

$$P(\text{positive}) = 0.92 \times 0.03 + 0.05 \times 0.97 = 0.0276 + 0.0485 = 0.0761$$

$$\text{PPV} = \frac{0.92 \times 0.03}{0.0761} = \frac{0.0276}{0.0761} = 0.363 = 36.3\%$$

Scenario B: Prevalence = 4.8% (Maya's point estimate)

$$P(\text{positive}) = 0.92 \times 0.048 + 0.05 \times 0.952 = 0.0442 + 0.0476 = 0.0918$$

$$\text{PPV} = \frac{0.92 \times 0.048}{0.0918} = \frac{0.0442}{0.0918} = 0.481 = 48.1\%$$

Scenario C: Prevalence = 7% (upper end of CI)

$$P(\text{positive}) = 0.92 \times 0.07 + 0.05 \times 0.93 = 0.0644 + 0.0465 = 0.1109$$

$$\text{PPV} = \frac{0.92 \times 0.07}{0.1109} = \frac{0.0644}{0.1109} = 0.581 = 58.1\%$$

The Results Are Sobering

Prevalence	PPV	What It Means
3% (national)	36.3%	About 2 in 3 positive results are false positives
4.8% (Maya's estimate)	48.1%	About half of positive results are false positives
7% (upper CI bound)	58.1%	Still more than 2 in 5 positives are false

Even with a test that has 92% sensitivity and 95% specificity, the majority of positive results could be false positives — depending on the prevalence.

This is the false positive paradox from Chapter 9, now with a real-world twist: the proportion inference from Step 1 directly determines how useful the screening test will be.

Step 3: The Decision

Maya presents her findings to the county health board:

"Our county's illness prevalence appears to be approximately 4.8%, statistically significantly above the national rate of 3%. However, even at this elevated prevalence, the rapid screening test would produce false positives for about half of all patients who test positive.

If we screen the entire county population of 85,000 adults: - Expected true positives: $85{,}000 \times 0.048 \times 0.92 \approx 3{,}754$ - Expected false positives: $85{,}000 \times 0.952 \times 0.05 \approx 4{,}046$

The total positive results would be approximately 7,800, but only about 3,750 of those — fewer than half — would represent real cases. The remaining 4,050 people would receive unnecessary anxiety, follow-up testing, and potentially unnecessary treatment."

The Board's Response

The board has to weigh several factors:

Without screening: Many of the 4,080 estimated cases ($85{,}000 \times 0.048$) go undetected, leading to worse health outcomes and potential spread.
With screening: Nearly all cases are detected (3,754 out of 4,080), but at the cost of 4,046 false alarms.
The compromise: Screen only high-risk populations (e.g., adults over 65, people with certain conditions) where the prevalence is likely higher — say 10-15%. At 10% prevalence, the PPV jumps to 67%. At 15%, it reaches 76%.

Connection to Chapter 9 — Bayesian Updating:

This analysis is Bayes' theorem in action. The "prior" is the population prevalence (which we estimated using proportion inference). The "evidence" is the test result. The "posterior" is the PPV — the updated probability after seeing the evidence.

The chain of reasoning: random sample → proportion inference → prevalence estimate → Bayes' theorem → PPV → screening policy.

Every link in this chain uses tools from this course. Remove any one, and the analysis breaks down.

The AI Connection

This same logic applies to any classification algorithm — not just medical tests.

Predictive Policing (Professor Washington)

When a predictive policing algorithm flags someone as "high risk": - Sensitivity = the algorithm's ability to flag people who will actually reoffend - Specificity = the algorithm's ability to correctly leave alone people who won't reoffend - Base rate = the overall recidivism rate in the population

If the base rate is low (say, 5% of all people in the database will actually commit a new offense), then even a "highly accurate" algorithm will produce many false positives — people flagged as "high risk" who never reoffend.

And here's the key connection to proportion inference: that base rate is itself an estimate. If you use the wrong base rate (from a different population, or from biased historical data), your entire analysis is off.

Professor Washington's concern: the base rate might be different for different racial groups (because of differential policing, not differential behavior). If Black defendants are arrested at higher rates for the same behavior, the historical data shows a higher base rate, the algorithm assigns higher risk scores, and the cycle perpetuates itself.

Spam Filters and AI Classification

Alex Rivera's domain connects too. StreamVibe's recommendation algorithm classifies users into segments — "likely to upgrade," "likely to churn," etc. The accuracy of these classifications depends on:

The algorithm's sensitivity and specificity (how well it classifies)
The base rate of each behavior (what proportion of users actually churn?)

If only 5% of users churn, a classifier needs very high specificity to avoid flooding the "likely to churn" category with false positives. Proportion inference on historical churn data is the first step in evaluating any retention algorithm.

For Your Analysis

The Python Implementation

import numpy as np
from scipy import stats

def compute_ppv(sensitivity, specificity, prevalence):
    """Compute the Positive Predictive Value given test characteristics
    and disease prevalence."""
    p_positive = (sensitivity * prevalence +
                  (1 - specificity) * (1 - prevalence))
    ppv = (sensitivity * prevalence) / p_positive
    return ppv

# Test characteristics
sensitivity = 0.92
specificity = 0.95

# Maya's proportion inference results
p_hat = 0.048
ci_lower = 0.029
ci_upper = 0.067

# Compute PPV across the CI
prevalences = [0.03, ci_lower, p_hat, ci_upper, 0.10, 0.15]
labels = ['National (3%)', 'CI lower (2.9%)', 'Point estimate (4.8%)',
          'CI upper (6.7%)', 'High-risk (10%)', 'Very high-risk (15%)']

print("=" * 60)
print("POSITIVE PREDICTIVE VALUE BY PREVALENCE")
print(f"Sensitivity: {sensitivity:.0%}, Specificity: {specificity:.0%}")
print("=" * 60)

for prev, label in zip(prevalences, labels):
    ppv = compute_ppv(sensitivity, specificity, prev)
    print(f"  {label:30s} → PPV = {ppv:.1%}")

# Output:
# ============================================================
# POSITIVE PREDICTIVE VALUE BY PREVALENCE
# Sensitivity: 92%, Specificity: 95%
# ============================================================
#   National (3%)                  → PPV = 36.3%
#   CI lower (2.9%)                → PPV = 35.3%
#   Point estimate (4.8%)          → PPV = 48.1%
#   CI upper (6.7%)                → PPV = 56.9%
#   High-risk (10%)                → PPV = 67.1%
#   Very high-risk (15%)           → PPV = 76.4%

Discussion Questions

Maya's 95% CI for prevalence is (2.9%, 6.7%). The corresponding PPV range is 35.3% to 56.9%. How should Maya communicate this uncertainty to the health board? Should she present a single PPV number or a range?
The test manufacturer advertises "92% sensitivity, 95% specificity." Is this misleading? Why might a patient interpret these numbers differently from how a statistician would?
If Maya could increase her sample size from 500 to 2,000, her CI for prevalence would narrow. How would this affect the precision of her PPV estimate? Is a more precise prevalence estimate worth the cost of a larger study?
Professor Washington's concern about differential base rates by race connects directly to this case study. If the "prevalence" (base rate) is different for different subgroups, the PPV will be different too. What are the ethical implications of using a single PPV to evaluate an algorithm applied to multiple groups?
A hospital administrator says: "If the PPV is only 48%, this test is useless — it's basically a coin flip." Evaluate this statement. Consider what happens if the alternative is no screening at all (and the 4,080 cases go undetected).