Case Study 1: Medical Screening — When a Positive Test Doesn't Mean What You Think

Contributors

Case Study 1: Medical Screening — When a Positive Test Doesn't Mean What You Think

The Scenario

It's 8:47 AM on a Tuesday, and Dr. Maya Chen is staring at a spreadsheet that could reshape health policy for her entire county.

Last year, the county health board — under pressure from a concerned parents' coalition — implemented universal screening for a rare metabolic disorder in all newborns. The disorder, if caught early, is treatable. Left untreated, it causes developmental delays. The screening seemed like a no-brainer.

The test's specifications looked impressive on paper:

Sensitivity: 98% (the test catches 98 out of 100 babies who actually have the disorder)
Specificity: 95% (the test correctly clears 95 out of 100 babies who are healthy)
Prevalence: The disorder affects approximately 1 in 5,000 newborns

The county screened 42,000 newborns last year. Maya has the results — and they've caused a crisis.

The Numbers

Maya lays out the natural frequency breakdown:

Step 1: How many newborns have the disorder?

$$42{,}000 \times \frac{1}{5{,}000} = 8.4 \approx 8 \text{ babies}$$

About 8 out of 42,000 newborns actually have the metabolic disorder.

Step 2: How did the test perform on those 8 babies?

$$8 \times 0.98 = 7.84 \approx 8 \text{ true positives}$$

The test caught all (or nearly all) of the affected babies. This is the sensitivity doing its job.

Step 3: How did the test perform on the 41,992 healthy babies?

$$41{,}992 \times 0.05 = 2{,}099.6 \approx 2{,}100 \text{ false positives}$$

This is where the problem hits. A 5% false positive rate sounds small — but applied to nearly 42,000 healthy babies, it produces over 2,100 false alarms.

Step 4: The positive predictive value (PPV)

$$PPV = \frac{8}{8 + 2{,}100} = \frac{8}{2{,}108} \approx 0.0038$$

A positive screening result means there's a 0.38% chance the baby actually has the disorder. More than 99.6% of positive results are false alarms.

The Human Cost

Maya's spreadsheet tells a mathematical story, but behind each row is a family. Here's what happened to those 2,100+ families who received a false positive:

Immediate impact: - Parents received a phone call saying their newborn may have a serious metabolic disorder. - They were told to bring the baby in for confirmatory testing (blood draws, specialist appointments). - The waiting period for confirmatory results was 5-10 days.

What research tells us: - A 2012 study in Pediatrics found that parents who received false-positive newborn screening results experienced significantly elevated anxiety and depression levels even after the results were cleared. - A follow-up study found that some parents continued to perceive their children as "vulnerable" years later, leading to more doctor visits, more missed school days, and higher rates of protective parenting — despite the children being completely healthy. - The phenomenon is so well-documented it has a name: vulnerable child syndrome triggered by false-positive screening.

Financial cost: - Each confirmatory workup costs approximately $800-$1,200 in specialist visits and lab tests. - Total cost of false-positive follow-ups: approximately $2,100 \times $1,000 = $2.1 million. - Cost per true case detected: about $265,000 (total program cost divided by 8 cases found).

Maya presents these numbers to the county health board, and the room goes quiet.

Bayes' Theorem Explains It All

Let's verify Maya's findings using Bayes' theorem formally:

$$P(\text{disorder} \mid \text{positive}) = \frac{P(\text{positive} \mid \text{disorder}) \times P(\text{disorder})}{P(\text{positive})}$$

$$= \frac{0.98 \times 0.0002}{0.98 \times 0.0002 + 0.05 \times 0.9998}$$

$$= \frac{0.000196}{0.000196 + 0.04999}$$

$$= \frac{0.000196}{0.050186}$$

$$= 0.0039$$

The Bayes' theorem calculation confirms: about 0.39%, matching our natural frequency approach.

But Wait — The Test Catches Sick Babies

Before you conclude that screening is useless, consider what would happen without the program.

Without screening, those 8 babies with the disorder would go undetected until symptoms appeared — potentially months or years later, when treatment is far less effective. The disorder causes irreversible developmental damage. Early detection saves lives and prevents disability.

So the question isn't "should we screen?" — it's "how should we screen smarter?"

Maya's Recommendations

Based on her Bayesian analysis, Maya proposes a revised screening protocol:

Option 1: Two-Stage Screening

Instead of one test, use two:

First screen (high sensitivity, moderate specificity): catches nearly all affected babies but produces many false positives.
Confirmatory test (high specificity): applied only to babies who tested positive on the first screen.

If the confirmatory test has 99.5% specificity:

Start with 2,108 positive first-screen results.
Of the 8 true positives: $8 \times 0.98 = 7.84 \approx 8$ still test positive.
Of the 2,100 false positives: $2{,}100 \times 0.005 = 10.5 \approx 11$ still test positive.
New PPV: $8 / (8 + 11) = 8/19 \approx 42\%$.

By adding a second test, the PPV jumps from 0.4% to 42%. Still not perfect, but a massive improvement that reduces family anxiety from 2,100 families to 19.

Option 2: Targeted Screening

Instead of screening all 42,000 newborns, screen only those with known risk factors (family history, ethnic background associated with higher prevalence). If the at-risk population has a prevalence of 1 in 500 instead of 1 in 5,000, the PPV improves dramatically:

$$PPV = \frac{0.98 \times 0.002}{0.98 \times 0.002 + 0.05 \times 0.998} = \frac{0.00196}{0.00196 + 0.0499} = \frac{0.00196}{0.05186} \approx 0.038$$

Still only 3.8%, but roughly 10 times better than universal screening. Combined with the two-stage approach, targeted screening becomes much more efficient.

Option 3: Improve the Test

If a new test could achieve 99.5% specificity (instead of 95%):

$$PPV = \frac{0.98 \times 0.0002}{0.98 \times 0.0002 + 0.005 \times 0.9998} = \frac{0.000196}{0.000196 + 0.004999} = \frac{0.000196}{0.005195} \approx 0.038$$

A tenfold improvement in specificity produces a tenfold improvement in PPV. This is why test developers focus so heavily on specificity when designing screens for rare conditions.

The Broader Lesson

Maya's case illustrates a principle that extends far beyond medicine:

When the condition you're looking for is rare, even accurate tests produce mostly false alarms.

This principle applies to:

Airport security: When terrorists are extremely rare among travelers, even a good behavioral screening system will flag mostly innocent people.
Drug testing in the workplace: When drug use is uncommon in a workforce, most positive drug tests are false positives.
Fraud detection in banking: When fraudulent transactions are rare, most flagged transactions are legitimate.
AI content moderation: When policy violations are rare among posts, most AI-flagged content is not actually violating policies.
Predictive policing: When future offenders are rare in a population, most "high risk" labels are applied to people who will not re-offend.

In every case, the base rate is the invisible force that determines whether a positive result is informative or misleading. Bayes' theorem makes that force visible.

Questions for Discussion

The ethical trade-off. Maya's county must decide between universal screening (catches all cases but produces 2,100 false alarms) and targeted screening (fewer false alarms but might miss affected babies outside the at-risk group). What values are in tension here? How would you decide?
The communication challenge. How should a doctor explain a positive screening result to anxious parents? Draft a one-paragraph explanation that uses natural frequencies (not technical terms like "specificity" or "PPV").
The AI connection. A hospital is considering using an AI system to screen chest X-rays for lung nodules. The AI has 97% sensitivity and 93% specificity. Lung nodules appear in about 5% of screening X-rays. Calculate the PPV. Would you recommend implementing this system? Under what conditions?
Base rate matters. Consider COVID-19 testing in early 2020 (when prevalence in the general population was very low) versus late 2020 (when prevalence was much higher). How would the PPV of the same test change between these two scenarios? What does this imply about testing strategy?
Two tests vs. one. Why does a two-stage testing protocol dramatically improve PPV? Work through the math: if the first test has 98% sensitivity and 95% specificity, and the confirmatory test has 95% sensitivity and 99.5% specificity, what is the overall PPV for the two-stage process when prevalence is 1 in 5,000?

Python Extension

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def screening_analysis(prevalence, sensitivity, specificity,
                       population=100000):
    """
    Analyze screening test performance using natural frequencies.

    Returns a dictionary with counts and probabilities.
    """
    diseased = int(population * prevalence)
    healthy = population - diseased

    true_pos = int(diseased * sensitivity)
    false_neg = diseased - true_pos
    false_pos = int(healthy * (1 - specificity))
    true_neg = healthy - false_pos

    total_pos = true_pos + false_pos
    total_neg = true_neg + false_neg
    ppv = true_pos / total_pos if total_pos > 0 else 0
    npv = true_neg / total_neg if total_neg > 0 else 0

    return {
        'population': population,
        'prevalence': prevalence,
        'sensitivity': sensitivity,
        'specificity': specificity,
        'true_pos': true_pos,
        'false_neg': false_neg,
        'false_pos': false_pos,
        'true_neg': true_neg,
        'total_pos': total_pos,
        'ppv': ppv,
        'npv': npv,
        'false_alarm_ratio': false_pos / true_pos if true_pos > 0 else float('inf')
    }

# Maya's scenario
result = screening_analysis(
    prevalence=1/5000,
    sensitivity=0.98,
    specificity=0.95,
    population=42000
)

print("=== Maya's Newborn Screening Analysis ===")
print(f"Population screened: {result['population']:,}")
print(f"Prevalence: 1 in {int(1/result['prevalence']):,}")
print(f"\nResults:")
print(f"  True positives:  {result['true_pos']:,}")
print(f"  False positives: {result['false_pos']:,}")
print(f"  False negatives: {result['false_neg']:,}")
print(f"  True negatives:  {result['true_neg']:,}")
print(f"\n  PPV (P(disease|positive)): {result['ppv']:.4f} ({result['ppv']*100:.2f}%)")
print(f"  NPV (P(healthy|negative)): {result['npv']:.6f} ({result['npv']*100:.4f}%)")
print(f"  False alarms per true case: {result['false_alarm_ratio']:.0f}:1")

# How PPV changes with prevalence
prevalences = [1/100000, 1/50000, 1/10000, 1/5000, 1/1000,
               1/500, 1/100, 1/50, 1/10]
ppvs = []

for prev in prevalences:
    r = screening_analysis(prev, 0.98, 0.95)
    ppvs.append(r['ppv'])
    label = f"1 in {int(1/prev):,}"
    print(f"  Prevalence {label:>12}: PPV = {r['ppv']:.4f}")

plt.figure(figsize=(10, 5))
plt.semilogx([p*100 for p in prevalences], [p*100 for p in ppvs],
             'o-', color='steelblue', linewidth=2, markersize=8)
plt.xlabel('Disease Prevalence (%)', fontsize=12)
plt.ylabel('Positive Predictive Value (%)', fontsize=12)
plt.title('How Disease Prevalence Affects Test Reliability\n'
          '(Sensitivity = 98%, Specificity = 95%)', fontsize=13)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Key Takeaway

A test's accuracy is only half the story. The other half — and often the more important half — is the base rate of whatever you're testing for. Bayes' theorem connects the two, and ignoring it means treating a 0.4% probability as though it were 98%. That misunderstanding costs money, causes anxiety, and undermines trust in screening programs. Understanding Bayes' theorem doesn't just make you better at math — it makes you a more informed patient, a better-equipped healthcare professional, and a more thoughtful citizen.