Exercises: Power, Effect Sizes, and What "Significant" Really Means
These exercises progress from conceptual understanding through effect size calculation, power analysis, critical evaluation, and Python implementation. Estimated completion time: 4 hours.
Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
A.1. Explain in your own words the difference between statistical significance and practical significance. Give an example of a result that is statistically significant but not practically significant, and an example that is practically significant but not statistically significant.
A.2. A news headline reads: "Study proves new teaching method significantly improves test scores." A student asks you what "significantly" means. Explain at least two different interpretations of "significantly" and why the distinction matters.
A.3. True or false (explain each):
(a) A p-value of 0.001 means the effect is very large.
(b) Failing to reject $H_0$ proves there is no effect.
(c) Cohen's d depends on sample size.
(d) A study with 80% power has a 20% chance of a Type II error.
(e) If two studies both find $p < 0.05$, they found effects of similar magnitude.
(f) An underpowered study that finds significance probably overestimates the true effect size.
A.4. Rank the following from largest to smallest effect size. No calculations needed — just use your intuition and the context.
(a) The difference in height between adult men and women (about 5 inches, with SD about 3 inches)
(b) The effect of aspirin on heart attack risk (reduces absolute risk by about 0.7 percentage points)
(c) The difference in SAT scores between students who take a prep course and those who don't (about 30 points, with SD about 200 points)
A.5. A pharmaceutical company runs a clinical trial with 100,000 participants and finds that their new drug reduces systolic blood pressure by 0.5 mmHg ($p < 0.001$, $d = 0.02$). A competitor runs a trial with 200 participants and finds their drug reduces blood pressure by 15 mmHg ($p = 0.06$, $d = 0.45$). Which drug would you recommend and why?
Part B: Effect Size Calculation ⭐⭐
B.1. Calculate Cohen's d for each scenario:
(a) Group 1: $\bar{x}_1 = 75$, $s_1 = 10$, $n_1 = 30$; Group 2: $\bar{x}_2 = 80$, $s_2 = 12$, $n_2 = 30$
(b) Group 1: $\bar{x}_1 = 100$, $s_1 = 15$, $n_1 = 500$; Group 2: $\bar{x}_2 = 101$, $s_2 = 15$, $n_2 = 500$
(c) Group 1: $\bar{x}_1 = 3.2$, $s_1 = 0.8$, $n_1 = 20$; Group 2: $\bar{x}_2 = 4.1$, $s_2 = 0.9$, $n_2 = 20$
For each, classify the effect as small, medium, or large using Cohen's benchmarks.
B.2. A two-sample t-test produces $t = 3.50$ with $df = 98$. Calculate $r^2$ and interpret it. What percentage of the variance in the outcome is explained by group membership?
B.3. Convert the following Cohen's d values to $r^2$: (a) $d = 0.2$, (b) $d = 0.5$, (c) $d = 0.8$, (d) $d = 1.2$. Explain why even a "large" effect explains a relatively small proportion of variance.
B.4. A study compares the average GPAs of students who study with music ($n = 45$, $\bar{x} = 3.21$, $s = 0.48$) versus those who study in silence ($n = 42$, $\bar{x} = 3.35$, $s = 0.51$).
(a) Compute Cohen's d.
(b) Compute $r^2$.
(c) A news article about this study says "Students who listen to music while studying earn significantly lower grades." Critique this claim using your effect size calculations.
B.5. In Sam's analysis from Chapter 14, Daria's three-point shooting rate was 38.5% (25/65) compared to a baseline of 31%.
(a) Calculate Cohen's h for this comparison.
(b) What does this effect size tell you about the practical significance of Daria's improvement?
(c) Is the effect size the reason Sam's test was not significant, or is sample size the bigger issue? Explain.
Part C: Power Concepts ⭐⭐
C.1. For each scenario below, determine whether power increases or decreases (explain your reasoning):
(a) You increase the sample size from 50 to 200.
(b) You change $\alpha$ from 0.05 to 0.01.
(c) The true effect size is larger than originally expected.
(d) The population standard deviation turns out to be larger than expected.
(e) You switch from a two-tailed test to a one-tailed test.
(f) You use a paired design instead of an independent-groups design.
C.2. A researcher conducts a study with $n = 30$ per group, $\alpha = 0.05$, and gets $p = 0.08$. She concludes: "There is no effect." Explain why this conclusion may be premature and what she should do next.
C.3. Why is an underpowered study that does find significance potentially more misleading than one that doesn't? (Hint: think about effect size inflation and the winner's curse.)
C.4. A grant reviewer says: "This proposal requests funding for 500 participants, but the pilot study with 30 participants already found a significant effect. Why do you need more data?" Write a brief response explaining why the larger sample is necessary.
Part D: Power Analysis Problems ⭐⭐
D.1. Use the following information to determine the required sample size per group:
(a) Expected Cohen's d = 0.5, $\alpha = 0.05$, power = 0.80, two-tailed test.
(b) Expected Cohen's d = 0.3, $\alpha = 0.05$, power = 0.80, two-tailed test.
(c) Expected Cohen's d = 0.3, $\alpha = 0.01$, power = 0.80, two-tailed test.
(d) Compare answers (b) and (c). How does changing $\alpha$ affect the required sample size?
D.2. A psychologist wants to study the effect of mindfulness meditation on anxiety scores. Previous research suggests the effect is around $d = 0.4$.
(a) How many participants per group are needed for 80% power at $\alpha = 0.05$?
(b) How many for 90% power?
(c) The psychologist can only recruit 60 participants total (30 per group). What is the achieved power? Is this adequate?
(d) Given the constraint of 60 participants, what is the minimum detectable effect size at 80% power?
D.3. Maya is planning a study to compare hypertension rates between two communities. She expects rates of about 35% in the industrial community and 25% in the suburban community.
(a) Calculate Cohen's h for this expected difference.
(b) How many participants does Maya need per group for 80% power at $\alpha = 0.05$?
(c) If Maya can afford to survey 150 people per community, what is her power to detect this difference?
D.4. A tech company wants to detect a 2% increase in click-through rate (from 10% to 12%) in an A/B test.
(a) Calculate the required sample size per group for 80% power at $\alpha = 0.05$.
(b) The company gets 50,000 visitors per day. How many days does the test need to run?
(c) Now suppose the company wants to detect a 0.5% increase (from 10% to 10.5%). How does the required sample size change? What does this tell you about the relationship between effect size and sample size?
Part E: Interpretation and Critical Evaluation ⭐⭐⭐
E.1. A study reports: "Participants who received the intervention scored higher on the test ($M = 78.2$, $SD = 14.3$) than those who received the placebo ($M = 74.8$, $SD = 15.1$), $t(198) = 1.63$, $p = 0.105$."
(a) Calculate Cohen's d.
(b) Calculate $r^2$.
(c) Calculate the 95% confidence interval for the difference in means.
(d) Is this result practically significant? Could it be that the study was simply underpowered?
(e) How many participants per group would be needed to detect this effect with 80% power?
E.2. Consider two studies testing the same hypothesis:
- Study A: $n = 50$ per group, $d = 0.72$, $p = 0.003$
- Study B: $n = 5,000$ per group, $d = 0.08$, $p = 0.001$
(a) Which study found a more convincing effect?
(b) Which study has the more impressive p-value? Why is this misleading?
(c) If you had to make a real-world decision based on one of these studies, which would you rely on? Why?
E.3. A meta-analysis of 50 studies on a particular intervention finds an average effect size of $d = 0.35$ with a 95% CI of (0.22, 0.48). The p-value for the meta-analytic effect is $p < 0.001$.
(a) Is this effect statistically significant?
(b) Is this effect practically significant? (Discuss what additional information you'd need.)
(c) Should you be concerned about publication bias? Why or why not?
(d) If the 50 studies had an average power of 50% to detect $d = 0.35$, what does that suggest about the published effect sizes?
E.4. A researcher tests 20 different dietary supplements for their effect on concentration, using $\alpha = 0.05$ for each test. One supplement shows $p = 0.03$.
(a) Should the researcher conclude that this supplement improves concentration? Why or why not?
(b) What is the probability of finding at least one significant result among 20 tests when no supplement actually works?
(c) If the researcher had pre-registered only this one supplement, would your answer change?
(d) What correction could the researcher apply to account for multiple testing? (Look up "Bonferroni correction" if you haven't encountered it.)
Part F: Anchor Example Extensions ⭐⭐⭐
F.1. (Alex — StreamVibe) Return to Alex's A/B test results from Chapter 16.
(a) Alex computed $d = 0.23$ for watch time. The engineering team asks: "Should we deploy the new algorithm?" Write a paragraph advising them, incorporating the effect size, confidence interval, power, and business context (12 million users, 3 sessions/week).
(b) Alex's manager asks: "How many users would we need in the next A/B test to detect a smaller improvement of $d = 0.1$ with 90% power?" Calculate the answer and discuss whether this is feasible.
(c) The data science team suspects the effect might be larger for Premium users ($d \approx 0.4$) than for Free users ($d \approx 0.1$). If Alex runs separate tests for each group with $n = 250$ per group, what is the power for each subgroup analysis?
F.2. (Maya — Public Health) Maya found that respiratory illness rates were significantly different between the industrial and suburban communities ($z = 2.92$, $p = 0.004$ from Chapter 16).
(a) The industrial community had 68/200 (34%) respiratory illness and the suburban community had 42/200 (21%). Calculate Cohen's h.
(b) Maya wants to design a follow-up study with 90% power. How many participants per community does she need?
(c) A policy maker says: "The study proved that the factory causes respiratory illness." Using concepts from both this chapter and Chapter 4, explain what the study actually showed and what it didn't show.
F.3. (James — Criminal Justice) Return to James's bail algorithm study.
(a) The overall recidivism comparison had $h \approx 0.14$ (small) and $p = 0.049$ (barely significant). The racial FP rate disparity had $h \approx 0.43$ (medium) and $p < 0.001$. Compare these two findings in terms of both statistical and practical significance.
(b) James had about 48% power for the overall comparison. What does this mean for how we should interpret the barely-significant result? Would you trust this result if it had gone the other way ($p = 0.051$)?
(c) A county official argues: "Since the overall effect is small, the algorithm is basically equivalent to human judges." Respond using both the effect size and the disaggregated racial analysis.
Part G: Ethical Reasoning and Debate ⭐⭐⭐
G.1. A researcher finds $p = 0.12$ for their primary hypothesis. They then analyze 8 different subgroups and find $p = 0.03$ for women aged 25-34. They write up the paper with the subgroup analysis as the primary finding.
(a) Is this p-hacking? Why or why not?
(b) How should the researcher have handled this analysis?
(c) If the subgroup effect is real, how could it be confirmed in a future study?
G.2. A pharmaceutical company suppresses 4 out of 5 clinical trials that showed no significant effect for their drug. The one published trial showed $p = 0.02$.
(a) Is the published p-value still valid? Why or why not?
(b) What is the actual false positive rate across the company's research program?
(c) What reforms (pre-registration, registered reports, trial registries) could prevent this?
G.3. Prepare arguments for both sides of the following debate:
"Universities should require that all thesis research include a pre-registered power analysis and report effect sizes. Studies that fail to meet 80% power should not count toward the thesis requirement."
Consider: (1) the benefits of requiring rigor, (2) the constraints faced by student researchers (limited funding, time, access to participants), (3) whether the 80% threshold is appropriate for all types of research, and (4) whether this policy could have unintended consequences.
Part H: Synthesis and Application ⭐⭐⭐
H.1. You read a newspaper article with the headline: "Study of 2 million people finds that drinking one glass of wine per day reduces risk of heart disease ($p < 0.0001$)."
(a) Why should the large sample size make you cautious rather than more convinced?
(b) What additional information would you need to evaluate this claim?
(c) Write a more complete and accurate headline that includes effect size information.
H.2. A colleague says: "Power analysis is just a bureaucratic requirement for grant applications. In the real world, you take the data you can get." Respond to this claim with specific examples of how neglecting power analysis can lead to:
(a) Wasted resources
(b) Overestimated effects
(c) Missed important findings
(d) Ethical harm
H.3. Create a "Statistical Significance Reality Check" — a one-page summary that a non-statistician could use to evaluate claims from news articles and press releases. Include at least 5 questions to ask, with brief explanations of why each matters.
Part I: Python Implementation ⭐⭐⭐
I.1. Write Python code to compute Cohen's d, $r^2$, a 95% CI for the difference, and the achieved power for the following data:
- Control group: $n = 45$, $\bar{x} = 72.3$, $s = 11.8$
- Treatment group: $n = 48$, $\bar{x} = 78.1$, $s = 13.2$
Present your results in a formatted report similar to the template in Section 17.12.
I.2. Create a power analysis function that takes an expected effect size (Cohen's d), desired power, and significance level, and returns:
(a) The required sample size per group
(b) A power curve plot
(c) A table showing required sample sizes for $d = \{0.2, 0.3, 0.4, 0.5, 0.6, 0.8\}$
I.3. Replicate the p-hacking simulation from Section 17.9 with the following modification: instead of running completely independent tests, have each "study" test the same two groups but measure 5 different outcome variables. Generate all 5 outcomes as independent draws from the same null distribution. Record:
(a) The false positive rate when reporting the minimum p-value across 5 outcomes
(b) The false positive rate when applying the Bonferroni correction ($\alpha/5 = 0.01$ per test)
(c) A histogram of the minimum p-values across 10,000 simulated studies (compare to the uniform distribution that p-values follow under $H_0$)
I.4. Write a Python function that generates a complete "effect size report" for a two-sample comparison, including:
- Cohen's d with interpretation
- $r^2$ with interpretation
- 95% CI for the difference
- A visualization showing the two group distributions with the effect size annotated
- The achieved power
- A sentence summarizing whether the result is statistically significant, practically significant, both, or neither
Test your function on at least two datasets: one where the result is significant but trivially small, and one where the result is non-significant but potentially meaningful.
Part J: Integration with Prior Chapters ⭐⭐⭐⭐
J.1. Return to the replication crisis case study from Chapter 1 and the quantitative model from Chapter 13 (Case Study 1). That model assumed:
- 10% of tested hypotheses are true
- Power = 80% for each test
- $\alpha = 0.05$
(a) Recompute the false discovery rate if power is only 50% (more realistic for many fields). How does this change the proportion of significant findings that are false?
(b) Now add publication bias: assume only 10% of null results are published. What is the apparent false discovery rate in the published literature?
(c) How do these calculations connect to the replication crisis finding that only 36% of psychology studies replicated?
J.2. The following table shows real data from the Open Science Collaboration's replication project:
| Measure | Original Studies | Replications |
|---|---|---|
| Mean effect size ($r$) | 0.403 | 0.197 |
| Significant results | 97% | 36% |
| Mean sample size | 79 | 218 |
(a) The original studies had larger effect sizes but smaller samples. Explain why underpowered studies combined with publication bias lead to inflated effect sizes.
(b) The replications had nearly 3x the sample size. If the original effect sizes were accurate, what power should the replications have had?
(c) The fact that only 36% of replications were significant suggests the original effect sizes were inflated. Is this consistent with the winner's curse phenomenon? Explain.
J.3. (Capstone Integration) Choose one of the four anchor examples (Alex, Maya, James, or Sam) and write a complete statistical report that includes:
- The research question and study design (connecting to Ch.4)
- Descriptive statistics and visualizations (connecting to Ch.5-6)
- The hypothesis test with full five-step procedure (connecting to Ch.13-16)
- Effect size calculation and interpretation (this chapter)
- Power analysis and sample size evaluation (this chapter)
- A confidence interval interpretation (connecting to Ch.12)
- A discussion of limitations (connecting to Ch.4 and this chapter)
- Recommendations for future research (connecting to this chapter)
This report should be 2-3 pages and suitable for a non-technical audience.
Part K: Progressive Project ⭐⭐⭐
K.1. Return to your Data Detective Portfolio and the hypothesis test you conducted in the Chapter 16 component.
(a) Calculate the effect size (Cohen's d or Cohen's h, depending on your test) for your two-group comparison.
(b) Classify the effect as small, medium, or large using Cohen's benchmarks. Then discuss whether this classification makes sense in the context of your specific dataset and research question.
(c) Calculate $r^2$ — what proportion of variance does the group variable explain?
(d) Conduct a power analysis: - What was the achieved power of your test? - If the power was below 80%, how many observations would you need for adequate power? - Create a power curve for your specific effect size.
(e) Write a paragraph interpreting your results that includes: the point estimate, the 95% CI, the effect size, the p-value, and the achieved power. Model it on the reporting template from Section 17.11.
(f) Reflect: Is your result statistically significant? Is it practically significant? Are these answers the same or different?