Case Study: The Replication Crisis — When Statistics Goes Wrong

The Setup

In 2011, a team of psychologists published a study in Psychological Science — one of the most prestigious journals in the field — with a finding that seemed impossible: people could predict the future. The study, by Daryl Bem of Cornell University, reported that participants performed better on memory tests when they were going to study the material later, as if they could sense future events.

The paper passed peer review. The statistics were technically correct. The methodology followed accepted practices. And yet, the conclusion — that humans have precognitive abilities — struck most scientists as absurd.

This single paper helped ignite what is now known as the replication crisis: the discovery that a disturbing percentage of published scientific findings cannot be reproduced when other researchers try to repeat the experiments.

What Happened

Bem's study was a wake-up call, not because anyone believed in precognition, but because it exposed something uncomfortable: if standard statistical practices could produce such an obviously wrong conclusion, what else had they gotten wrong?

The answer turned out to be: a lot.

The Open Science Collaboration

In 2015, a massive international effort called the Open Science Collaboration (published in Science) attempted to replicate 100 psychology studies that had been published in top journals. The results were sobering:

  • Of the original 100 studies, 97% had reported statistically significant results
  • When replicated, only 36% produced statistically significant results
  • The average effect size in the replications was roughly half the original

In other words, most of the original findings were either wrong or dramatically overstated.

How Did This Happen?

The replication crisis wasn't caused by fraud (though that exists too). It was caused by a collection of statistical practices that, individually, seem harmless but collectively produce unreliable results:

1. P-hacking: Researchers would analyze their data in many different ways — different subgroups, different outcome measures, different statistical tests — until they found a result with a p-value below 0.05 (the conventional threshold for "statistical significance"). We'll learn exactly what p-values mean in Chapter 13, but for now, the key insight is: if you test enough things, you'll find something that looks significant by chance alone.

2. Publication bias: Journals overwhelmingly published studies with positive (statistically significant) results and rejected studies that found nothing. This created a published literature full of "hits" and an invisible graveyard of unpublished "misses."

3. Small sample sizes: Many studies used too few participants, making their results highly sensitive to random fluctuation. A study with 30 participants might find a large effect that disappears with 300 participants.

4. HARKing (Hypothesizing After Results are Known): Researchers would analyze their data, see what patterns emerged, and then write their paper as if they had predicted those patterns from the start. This makes exploratory findings look like confirmed hypotheses.

Why This Matters for You

You might be thinking: "I'm not a psychology researcher. Why should I care about the replication crisis?"

Because the same statistical pitfalls that produced the replication crisis affect every field that uses data — medicine, business, criminal justice, education, public health, and technology.

  • Drug trials that can't be replicated waste billions of dollars and delay effective treatments
  • Business decisions based on unreliable A/B tests waste resources
  • Criminal justice policies based on flawed statistical analyses affect real people's freedom
  • AI systems trained on unreliable findings perpetuate those errors at scale

Understanding why statistics can go wrong is just as important as learning how to do it right. This course will teach you both.

Connection to This Chapter

The replication crisis perfectly illustrates the four pillars of statistical investigation — and what happens when any of them is compromised:

Pillar How It Was Compromised
Ask a good question HARKing: questions were formulated after seeing the data
Collect good data Small samples and convenience sampling
Analyze carefully P-hacking: running analyses until something "works"
Interpret honestly Publication bias: only sharing positive results

Discussion Questions

  1. If a study with 50 participants finds a large effect, but a replication with 500 participants finds no effect, which should you trust more? Why?

  2. How does publication bias create a distorted picture of reality? Can you think of an analogy from everyday life where only showing "wins" gives a misleading impression?

  3. The textbook argues that "uncertainty is not failure." How does the replication crisis support this argument?

  4. Should Bem's precognition paper have been published? Make arguments for both sides.

Mini-Project

Find a news article that reports on a scientific study. Using only the information in the article, answer: - How many participants were in the study? - Was the study replicated or is this the first time? - Does the article mention any limitations? - On a scale of 1-5, how confident are you in the study's conclusion? Write a paragraph justifying your rating.


Sources: Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425. Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251).