Case Study 1: The Replication Crisis Revisited — Now You Understand Why

The Setup

In Chapter 1, you read about the replication crisis — the discovery that a disturbing percentage of published scientific findings couldn't be reproduced. At the time, you learned the vocabulary: p-hacking, publication bias, HARKing, small sample sizes. But you didn't yet have the tools to understand why these practices are so damaging.

Now you do. You understand p-values, significance levels, Type I errors, and the logic of hypothesis testing. Let's revisit the crisis with your new toolkit.

Daryl Bem's Precognition Study: A Statistical Autopsy

Recall that in 2011, psychologist Daryl Bem published a study claiming that humans could perceive the future. The paper appeared in Psychological Science, one of the most respected journals in psychology.

Here's what the study actually involved: Bem ran nine different experiments with a total of about 1,000 participants. Eight of the nine produced results with $p < 0.05$ in the predicted direction.

At first glance, this seems compelling. Eight out of nine significant results? That's a lot of evidence.

But let's look more carefully, using what you've learned in this chapter.

Problem 1: Multiple Testing

Bem didn't run one test. He ran nine separate experiments, each with multiple outcome measures. Within each experiment, he analyzed several subgroups and conditions. By some estimates, the total number of statistical tests was in the dozens.

You know from Section 13.12 that testing multiple hypotheses inflates the false positive rate. If Bem ran 40 statistical tests (a conservative estimate), the probability of finding at least one "significant" result by chance alone is:

$$P(\text{at least one false positive in 40 tests}) = 1 - (1 - 0.05)^{40} = 1 - 0.95^{40} \approx 0.871$$

An 87% chance of finding something "significant" even when nothing real is happening. The question isn't whether he'd find something — it's whether he'd find enough to fill a paper.

Problem 2: Researcher Degrees of Freedom

Statisticians Eric-Jan Wagenmakers and colleagues (2011) re-analyzed Bem's data and found that the results were sensitive to seemingly minor analytical choices:

  • Which participants to include or exclude
  • Which outcome measures to emphasize
  • How to handle reaction time data (means vs. medians, trimming outliers)
  • Whether to analyze men and women together or separately

Each choice is a "fork in the garden of forking paths" (Andrew Gelman's phrase). Bem may not have been deliberately cherry-picking — but the sheer number of analytical choices means the reported p-values don't have their face-value interpretation.

Problem 3: Small Effect Sizes

Even taking Bem's p-values at face value, the effects were tiny. The average effect size across his nine studies (measured as Cohen's $d$) was approximately 0.22 — a "small" effect by conventional standards. To put this in perspective: an effect this small would mean that a person with "precognitive ability" would guess correctly about 53% of the time instead of 50%.

You learned in this chapter that a small p-value does NOT mean a large effect. Bem's studies demonstrate this perfectly. Even if the p-values were valid (and they're not, due to the issues above), they'd be telling us about a negligibly small effect.

Problem 4: Failure to Replicate

Multiple research teams attempted to replicate Bem's findings. The results were uniformly negative:

Replication Study Sample Size Result
Ritchie, Wiseman, & French (2012) 150 participants $p > 0.50$ (no effect)
Galak et al. (2012) 3,289 participants $p > 0.80$ (no effect)
Wagenmakers et al. (2012) multiple studies Bayesian analysis strongly favoring $H_0$

The replications used much larger samples and pre-registered their analysis plans. They found nothing.

The Anatomy of a False Positive

Let's build a quantitative model of how the replication crisis happened, using the tools from this chapter and Chapter 9.

Setting the Stage

Imagine a field where: - Researchers test 1,000 hypotheses per year - Only 10% of hypotheses are actually true (most ideas don't pan out) - Each study uses $\alpha = 0.05$ - Studies have 60% power (probability of detecting a real effect — a typical value for psychology studies)

The Math

$H_0$ is TRUE (900 hypotheses) $H_0$ is FALSE (100 hypotheses) Total
Significant ($p < 0.05$) $900 \times 0.05 = 45$ (false positives) $100 \times 0.60 = 60$ (true positives) 105
Not significant ($p \geq 0.05$) $900 \times 0.95 = 855$ $100 \times 0.40 = 40$ (missed effects) 895

Of the 105 "significant" findings: - 60 are real (true positives) - 45 are false (Type I errors)

False discovery rate: $45 / 105 = 42.9\%$

That's stunning: nearly 43% of significant findings are false. And this is before accounting for p-hacking, publication bias, or any other questionable practice. The math alone — the interplay between base rates, power, and $\alpha$ — produces a high false discovery rate.

Adding Publication Bias

Now add publication bias. Suppose only significant results get published:

  • Published: 105 studies (60 true positives + 45 false positives)
  • Unpublished: 895 studies (855 true negatives + 40 missed effects)

The published literature looks like "science works" — 105 significant findings. But 43% of them are false. And the reader has no way to know which 43%.

Adding P-Hacking

Now add p-hacking. Suppose 20% of researchers (unintentionally or intentionally) test multiple hypotheses or tweak analyses until they find significance. This could easily double the false positive rate:

  • Additional false positives from p-hacking: ~45 more
  • Total "significant" findings: ~150
  • False positives: ~90
  • False discovery rate: ~60%

Now the majority of published findings are false. This is roughly what the Open Science Collaboration found in 2015 — 64% of psychology studies failed to replicate.

The Bayes Connection

Notice how this analysis mirrors the medical testing scenario from Chapter 9. The false discovery rate depends on:

  1. The base rate (how many hypotheses are actually true) — analogous to disease prevalence
  2. The false positive rate ($\alpha$) — analogous to the test's false positive rate
  3. Power ($1 - \beta$) — analogous to sensitivity

Just as a screening test in a low-prevalence population produces mostly false positives, hypothesis testing in a field where most hypotheses are false produces mostly false discoveries.

This insight is captured by the positive predictive value of a study:

$$\text{PPV} = \frac{(1-\beta) \times P(\text{true effect})}{(1-\beta) \times P(\text{true effect}) + \alpha \times P(\text{no effect})}$$

For our example: $PPV = \frac{0.60 \times 0.10}{0.60 \times 0.10 + 0.05 \times 0.90} = \frac{0.060}{0.105} = 0.571$

Only 57% of significant findings reflect real effects. The rest are noise.

What Changed After the Crisis

The replication crisis sparked a revolution in how science handles statistical evidence. Key reforms include:

1. Pre-registration. Researchers publicly commit to their hypotheses, sample size, and analysis plan before collecting data. This eliminates HARKing and constrains the garden of forking paths. Major pre-registration platforms: OSF (Open Science Framework), AsPredicted, ClinicalTrials.gov.

2. Registered reports. Some journals now review studies before data collection. If the design is sound, the paper is accepted regardless of whether the results are significant. This eliminates publication bias entirely.

3. Effect size reporting. Many journals now require effect sizes (e.g., Cohen's $d$, correlation coefficients) alongside p-values. This helps readers distinguish between statistically significant and practically meaningful results.

4. Larger samples and multi-site studies. Many fields now require much larger sample sizes than before, and replications are conducted across multiple laboratories simultaneously.

5. The ASA statement. The American Statistical Association's 2016 statement on p-values was a direct response to the crisis. By clarifying what p-values can and cannot do, the ASA hoped to curb misuse.

6. Bayesian alternatives. Some researchers advocate using Bayesian methods (previewed in Chapter 9), which allow quantifying evidence for and against a hypothesis without the binary significant/not-significant framework.

Discussion Questions

  1. Using the false discovery rate model above, calculate the PPV for a field where 50% of hypotheses are true (instead of 10%). How does this compare? What does it tell you about the importance of studying plausible hypotheses?

  2. The Bonferroni correction adjusts $\alpha$ for multiple tests by using $\alpha' = \alpha / k$ (where $k$ is the number of tests). If Bem ran 40 tests, his adjusted threshold would be $\alpha' = 0.05/40 = 0.00125$. How many of his results would survive this correction?

  3. A colleague argues: "The replication crisis shows that statistics is broken. We should stop using p-values entirely." Evaluate this argument. Is the problem with the tool or with how the tool is used?

  4. Pre-registration has been called "the single most important reform" in response to the replication crisis. But critics argue it stifles exploratory research. How would you balance the need for confirmatory rigor with the value of unexpected discoveries?

  5. Return to Bem's precognition study. Wagenmakers et al. conducted a Bayesian analysis and found a Bayes factor of approximately 1 — meaning the data provided roughly equal support for the null and alternative hypotheses. How does this contrast with Bem's reported p-values? What does the discrepancy tell you about the limitations of p-values as measures of evidence?

Connection to Your Learning

This case study connects several threads from across the course:

Concept Where You Learned It How It Appears Here
P-value Ch.13 §13.5 Bem's p-values lose their meaning due to multiple testing
Type I error Ch.13 §13.9 False positives accumulate across many studies
Conditional probability Ch.9 §9.2 $P(\text{data} \mid H_0)$ vs. $P(H_0 \mid \text{data})$
Bayes' theorem Ch.9 §9.6 PPV of a study depends on base rate of true effects
Base rate fallacy Ch.9 §9.14 Low base rate of true effects → high false discovery rate
Sampling variability Ch.11 §11.2 Small samples → high variability → inflated effects
Study design Ch.4 §4.7 Pre-registration as a design principle

Sources: Bem, D. J. (2011). Journal of Personality and Social Psychology, 100(3), 407-425. Wagenmakers, E.-J. et al. (2011). Journal of Personality and Social Psychology, 100(3), 426-432. Open Science Collaboration (2015). Science, 349(6251). Ioannidis, J. P. A. (2005). "Why Most Published Research Findings Are False." PLOS Medicine, 2(8), e124. Gelman, A., & Loken, E. (2014). "The Statistical Crisis in Science." American Scientist, 102(6), 460.