Case Study 2: Clinical Trials and FDA Approval — When Type I and Type II Errors Are Life-or-Death
The Scenario
In March 2020, as COVID-19 swept the globe, the FDA faced an agonizing decision. A pharmaceutical company submitted early data from a clinical trial of a new antiviral drug. The trial had enrolled 1,062 hospitalized patients: 538 received the drug, and 524 received a placebo.
The results:
| Outcome | Drug Group ($n = 538$) | Placebo Group ($n = 524$) |
|---|---|---|
| Median recovery time | 10 days | 15 days |
| Mortality rate | 6.7% | 11.9% |
The drug appeared to reduce recovery time by 5 days and cut the mortality rate nearly in half. But the question the FDA had to answer wasn't "does the drug look promising?" It was: "Is the evidence strong enough to approve this drug for millions of people?"
This is hypothesis testing with the highest possible stakes.
The Hypotheses
The FDA evaluates drugs using a rigorous hypothesis testing framework. For the mortality outcome:
$H_0:$ The drug has no effect on mortality. Any observed difference is due to chance.
$H_a:$ The drug reduces mortality.
The significance level the FDA typically uses is $\alpha = 0.05$, though for some applications they require $\alpha = 0.025$ (one-sided) or two significant trials.
The Statistical Evidence
Recovery Time
The statistical test for recovery time (a log-rank test, comparing the distribution of recovery times between groups) produced:
$$p = 0.001$$
At $\alpha = 0.05$, this is clearly statistically significant. The data are highly incompatible with the null hypothesis of no effect.
Mortality
The test for mortality was less clear-cut:
$$p = 0.059$$
At $\alpha = 0.05$, this just barely fails to reach statistical significance. The drug appears to reduce mortality, but the evidence doesn't quite cross the conventional threshold.
The Dilemma: Two Types of Errors, Two Types of Harm
This is where hypothesis testing becomes deeply human.
If the FDA Commits a Type I Error (Approves an Ineffective Drug)
- Patients take a drug that doesn't actually work
- Patients may experience side effects without any benefit
- Resources are diverted from searching for truly effective treatments
- Public trust in the drug approval process is eroded
- Billions of dollars in healthcare spending are wasted
If the FDA Commits a Type II Error (Rejects an Effective Drug)
- An effective treatment is withheld from patients
- During a pandemic, thousands of additional deaths may occur while waiting for more data
- Patients and families suffer preventable harm
- The delay in treatment availability has cascading effects on healthcare systems, economies, and society
The Asymmetry During a Pandemic
Under normal circumstances, the FDA is deliberately conservative. Drug safety matters more than drug speed. The burden of proof is high because Type I errors — approving ineffective or harmful drugs — erode the entire pharmaceutical system.
But COVID-19 changed the calculus. With thousands dying daily, a Type II error — withholding an effective treatment — had unprecedented costs. The FDA had to weigh the standard scientific caution against the urgent reality of a global pandemic.
The Decision Framework
Let's examine how different $\alpha$ levels would affect the decision for the mortality outcome ($p = 0.059$):
| $\alpha$ Level | Decision on Mortality Effect | Reasoning |
|---|---|---|
| 0.10 | Reject $H_0$ (significant) | Evidence is sufficient at a more lenient threshold |
| 0.05 | Fail to reject $H_0$ (not significant) | Just barely misses the conventional threshold |
| 0.01 | Fail to reject $H_0$ (not significant) | Far from significance at this stricter level |
At $\alpha = 0.05$, the mortality benefit is "not significant." But notice how arbitrary this feels. The p-value is 0.059 — just barely above 0.05. Does the 0.009 difference between $p = 0.050$ and $p = 0.059$ really warrant different conclusions?
This is one of the strongest arguments against treating $\alpha = 0.05$ as a rigid cutoff. The evidence doesn't suddenly vanish when the p-value crosses 0.05. It changes continuously.
What the FDA Actually Did
The FDA granted the drug an Emergency Use Authorization (EUA) — a special pathway that allows use of unapproved treatments during a public health emergency. The EUA has a lower evidence threshold than full approval.
The reasoning was essentially:
- The recovery time benefit was statistically significant ($p = 0.001$) and clinically meaningful (5 fewer days in the hospital)
- The mortality benefit was suggestive ($p = 0.059$) — not conclusive, but consistent with a real effect
- The pandemic context elevated the cost of Type II errors
- Safety data showed the drug was reasonably safe
- The alternative (no treatment) was clearly worse
The FDA's decision illustrates a key principle: statistical evidence is one input into a decision, not the only input. Context, consequences, and the relative costs of errors all matter.
The Follow-Up: More Data, More Clarity
After the EUA, larger studies continued. A subsequent analysis with more patients produced:
- Mortality difference: $p = 0.019$ (now statistically significant at $\alpha = 0.05$)
- Recovery time: $p < 0.001$ (confirmed)
The additional data resolved the ambiguity. But the critical question remains: was the FDA right to act on the earlier, less definitive data?
Lessons for Hypothesis Testing
Lesson 1: The P-Value Is Not a Binary Switch
A p-value of 0.049 and a p-value of 0.051 represent essentially the same amount of evidence. Treating one as "significant" and the other as "not significant" is a consequence of the decision framework, not a reflection of reality. In high-stakes decisions, the continuous p-value (and the confidence interval) is more informative than the binary significant/not-significant label.
Lesson 2: Context Determines Which Error Matters More
The FDA's standard $\alpha = 0.05$ reflects a world where Type I errors (approving bad drugs) are very costly and there's time to gather more evidence. During a pandemic, the cost of Type II errors skyrockets, and the appropriate $\alpha$ may need to shift.
This doesn't mean abandoning scientific rigor. It means acknowledging that $\alpha$ is a decision about risk tolerance, not a law of physics.
Lesson 3: Effect Sizes Matter as Much as P-Values
The recovery time difference was 5 days. The mortality reduction was from 11.9% to 6.7% — a 44% relative reduction. These are clinically meaningful effects. A "non-significant" p-value for a clinically meaningful effect usually means the study needs more participants, not that the effect doesn't exist.
Lesson 4: Multiple Endpoints Require Careful Thinking
The clinical trial had two primary outcomes: recovery time and mortality. Testing both raises the multiple testing problem from Section 13.12. Some statisticians argue the significance level should be adjusted (e.g., Bonferroni correction: $\alpha' = 0.05/2 = 0.025$). Under this correction, the recovery time result ($p = 0.001$) remains significant, but the mortality result ($p = 0.059$) is even further from significance.
However, the two outcomes are not independent — a drug that speeds recovery might also reduce mortality. The optimal approach to multiple endpoints in clinical trials is an active area of statistical research.
Lesson 5: Confidence Intervals Tell a Richer Story
Instead of fixating on whether $p$ is above or below 0.05, consider the confidence interval for the mortality difference:
The 95% CI for the difference in mortality rates was approximately $(-0.1\%, 10.5\%)$.
This interval includes zero (consistent with the "not significant" finding), but it's heavily weighted toward a positive effect. The upper bound suggests the drug could reduce mortality by up to 10.5 percentage points. The clinical picture — a likely benefit with a very small chance of harm — supports the EUA decision even when the formal hypothesis test doesn't reach significance.
Connection to Our Running Examples
This case study resonates with several of our running examples:
Dr. Maya Chen faces similar decisions in public health. When testing whether a community's disease prevalence exceeds a threshold, a Type II error means failing to allocate resources for a real problem. Maya might argue for a more lenient $\alpha$ in screening studies where the goal is detection, not definitive proof.
Alex Rivera faces the business version: if the A/B test for a new recommendation algorithm produces $p = 0.07$, should StreamVibe roll out the change? The cost of a Type I error (deploying an unhelpful algorithm) is measured in engineering resources and potential user frustration. The cost of a Type II error (keeping an inferior algorithm) is measured in lost watch time and revenue. Unlike the FDA, Alex's decision is reversible — she can always roll back the change. This makes her more willing to tolerate Type I errors.
Professor Washington faces a scenario where Type I and Type II errors both have serious consequences. If he concludes the algorithm is biased when it isn't (Type I), the department might discard a useful tool. If he fails to detect real bias (Type II), innocent people continue to be affected. The stakes are high in both directions.
Discussion Questions
-
Do you think the FDA made the right decision to issue an EUA based on a mortality p-value of 0.059? What factors beyond the p-value should inform such decisions?
-
Should the significance level $\alpha$ be different for pandemic-era drug approvals vs. normal circumstances? If so, who should decide the appropriate $\alpha$?
-
The drug reduced recovery time by 5 days ($p = 0.001$) but showed a "non-significant" mortality benefit ($p = 0.059$). If you were a patient, would you want access to this drug? If you were a policymaker, would you make it available? Are these the same question?
-
One critic argued: "A p-value of 0.059 means there's roughly a 6% chance the mortality benefit is random noise. That's a one-in-17 chance. For a deadly pandemic, those are excellent odds." Identify the statistical error in this argument. What is the p-value actually measuring?
-
Clinical trials sometimes use $\alpha = 0.025$ (one-sided) instead of $\alpha = 0.05$ (two-sided) for the primary endpoint. Explain why a one-sided test is appropriate for a drug trial where the question is "does the drug help?" rather than "does the drug do anything?"
-
How does the multiple testing issue apply to clinical trials that measure several outcomes (recovery time, mortality, hospitalization duration, need for mechanical ventilation, etc.)? What are the tradeoffs of applying a Bonferroni correction in this context?
A Broader Reflection
The clinical trial story highlights something profound about hypothesis testing: it is ultimately a decision-making framework, not a truth-discovering one.
The p-value doesn't tell you whether the drug works. It tells you how surprising the data would be if the drug didn't work. The significance level doesn't tell you the "right" answer. It tells you how much risk of a false alarm you're willing to accept.
The actual decision — approve the drug, deploy the algorithm, allocate the resources, convict the defendant — requires weighing the statistical evidence against the consequences of being wrong in each direction. Two people looking at the same p-value can rationally reach different decisions if they have different assessments of the costs of Type I vs. Type II errors.
This is why statistics is as much about judgment as it is about calculation. The formulas give you the evidence. What you do with that evidence is a human decision.
Note: This case study is inspired by real events (particularly the development of remdesivir and the FDA's EUA process during COVID-19) but uses simplified numbers for pedagogical clarity. The actual clinical trial data, statistical methods, and FDA decision process were more complex than presented here.
Sources: Beigel, J. H. et al. (2020). "Remdesivir for the Treatment of Covid-19 — Final Report." New England Journal of Medicine, 383(19), 1813-1826. FDA (2020). Emergency Use Authorization documentation. Goodman, S. N. (1999). "Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy." Annals of Internal Medicine, 130(12), 995-1004.