Key Takeaways: Power, Effect Sizes, and What "Significant" Really Means
One-Sentence Summary
Statistical significance tells you whether a result is unlikely under the null hypothesis, but it cannot tell you whether the result is important — for that, you need effect sizes, confidence intervals, and power analysis, which together form the complete toolkit for evaluating evidence.
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Effect size | A standardized measure of the magnitude of a phenomenon, independent of sample size | Tells you how big the effect is, not just whether it exists |
| Cohen's d | Difference between group means divided by the pooled SD; measures separation in standard deviation units | The go-to effect size for two-group comparisons; small $\approx$ 0.2, medium $\approx$ 0.5, large $\approx$ 0.8 |
| $r^2$ (proportion of variance) | Proportion of total variance explained by the group variable: $r^2 = t^2 / (t^2 + df)$ | Puts effect sizes in humbling perspective: even "large" effects explain only ~14% of variance |
| Statistical power | $P(\text{reject } H_0 \mid H_0 \text{ is false}) = 1 - \beta$; probability of detecting a real effect | Studies with low power miss real effects and overestimate those they do find |
The Key Formulas
Cohen's d (Two Independent Groups)
$$\boxed{d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \quad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}}$$
$r^2$ from a t-test
$$\boxed{r^2 = \frac{t^2}{t^2 + df}}$$
Cohen's d to $r^2$
$$\boxed{r^2 = \frac{d^2}{d^2 + 4}}$$
Cohen's h (Two Proportions)
$$\boxed{h = 2\arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})}$$
Statistical Power
$$\boxed{\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ is false})}$$
Effect Size Benchmarks
| Small | Medium | Large | |
|---|---|---|---|
| Cohen's d | 0.2 | 0.5 | 0.8 |
| Cohen's h | 0.2 | 0.5 | 0.8 |
| $r^2$ | 0.01 (1%) | 0.06 (6%) | 0.14 (14%) |
Use with caution: These benchmarks are generic guidelines, not rigid rules. What counts as "small" or "large" depends on the field and the stakes. Always interpret in context.
Four Factors Affecting Power
| Factor | Increase Power By... | Tradeoff |
|---|---|---|
| Sample size ($n$) | Collecting more data | Costs time and money |
| Effect size | Studying larger effects | Can't control nature's effect |
| Significance level ($\alpha$) | Using a more lenient threshold (e.g., 0.10) | Increases false positive rate |
| Variability ($\sigma$) | Reducing noise (better measurement, paired designs) | May limit generalizability |
The Significance Matrix
| Practically Significant | Not Practically Significant | |
|---|---|---|
| Statistically Significant | Best case: real and important effect | "Significant but trivial" — large $n$, tiny effect |
| Not Statistically Significant | "Important but missed" — small $n$, real effect | Consistent with no meaningful effect |
Key insight: You need both statistical significance and practical significance. A p-value alone cannot tell you which cell you're in.
The Statistical Reporting Checklist
Every statistical analysis should report:
| Component | What It Tells You |
|---|---|
| Point estimate | The observed effect size in original units |
| 95% Confidence interval | The plausible range of the true effect |
| Cohen's d (or h) | Standardized effect magnitude |
| $r^2$ | Proportion of variance explained |
| P-value | Compatibility with $H_0$ |
| Power | Probability the study could detect this effect |
| Sample size | How much data the conclusion rests on |
Python Quick Reference
from statsmodels.stats.power import TTestIndPower
from scipy import stats
import numpy as np
# --- Cohen's d ---
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
s_p = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
return (np.mean(group1) - np.mean(group2)) / s_p
# --- r-squared from t-test ---
def r_squared(t_stat, df):
return t_stat**2 / (t_stat**2 + df)
# --- Power analysis: find required n ---
power = TTestIndPower()
n_needed = power.solve_power(effect_size=0.5, alpha=0.05,
power=0.80, alternative='two-sided')
# --- Power analysis: find achieved power ---
achieved = power.solve_power(effect_size=0.23, nobs1=250,
alpha=0.05, alternative='two-sided')
# --- Cohen's h for proportions ---
def cohens_h(p1, p2):
return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
Common Misconceptions
| Misconception | Reality |
|---|---|
| "Significant = important" | A result can be significant but trivially small (with large $n$) |
| "Not significant = no effect" | A result can be non-significant because the study was underpowered |
| "$p = 0.001$ means a huge effect" | P-values mix effect size and sample size; $p = 0.001$ can come from a tiny effect with a massive sample |
| "$d = 0.2$ is always small" | Effect size benchmarks depend on context; a $d$ of 0.2 on mortality could save thousands of lives |
| "Post-hoc power analysis is useful" | Computing power from the observed effect size is circular; use the hypothesized effect size or focus on the CI |
| "We need $n = 30$ for everything" | Required sample size depends on the effect size you want to detect; small effects require hundreds or thousands |
How This Chapter Connects
| This Chapter | Builds On | Leads To |
|---|---|---|
| Effect sizes (Cohen's d, $r^2$) | Means and SDs (Ch.6), two-sample t-test (Ch.16) | Regression $R^2$ (Ch.22-23) |
| Statistical power | Type I/II errors (Ch.13), standard error (Ch.11) | ANOVA power (Ch.20), sample size planning (throughout) |
| Practical significance | P-value definition (Ch.13), CI interpretation (Ch.12) | Communicating results (Ch.25), ethical practice (Ch.27) |
| Publication bias, p-hacking | Replication crisis (Ch.1, Ch.13) | Bootstrap methods (Ch.18), ethics (Ch.27) |
| Power analysis in Python | scipy.stats (Ch.13-16) | statsmodels throughout (Ch.20-24) |
The Key Themes
Theme 4: Uncertainty is not failure. The confidence interval is more honest and more useful than a binary declaration of "significant" or "not significant." Reporting that the true difference is "somewhere between 1 and 8 minutes" communicates both what we know and what we don't. Treating uncertainty as information rather than failure is the hallmark of mature statistical thinking.
Theme 6: P-hacking and ethical data practice. Testing many hypotheses and reporting only the significant ones inflates the false positive rate from 5% to as high as 64%. Publication bias compounds the problem by making the published literature systematically overconfident. The ethical obligation: pre-register hypotheses, report all analyses, report effect sizes, and treat null results as valuable information.
The One Thing to Remember
If you forget everything else from this chapter, remember this:
"Statistically significant" does not mean "important." A p-value tells you whether a result is surprising under $H_0$ — not whether the effect is large, meaningful, or worth acting on. For that, you need the effect size. Cohen's d expresses the difference in standard deviation units (small $\approx$ 0.2, medium $\approx$ 0.5, large $\approx$ 0.8). $r^2$ tells you the proportion of variance explained. Statistical power (minimum 80%) tells you whether you had enough data to find the effect. And the confidence interval — which simultaneously conveys the direction, magnitude, and precision of the effect — is the single most informative summary of any statistical analysis. Always report all of these. Never report just a p-value.
Key Terms
| Term | Definition |
|---|---|
| Statistical power | The probability of correctly rejecting $H_0$ when it is false: Power $= 1 - \beta$; depends on $\alpha$, $n$, effect size, and variability |
| Effect size | A quantitative, sample-size-independent measure of the magnitude of a phenomenon; answers "how big?" rather than "is there an effect?" |
| Cohen's d | Effect size for comparing two means: $d = (\bar{x}_1 - \bar{x}_2) / s_p$; expresses the group difference in standard deviation units |
| Practical significance | Whether an effect is large enough to matter in the real world, as opposed to merely being statistically detectable |
| Power analysis | The process of determining the sample size needed to detect a given effect size with a specified power and significance level |
| Sample size planning | Using power analysis prospectively to determine how many observations are needed before collecting data |
| Underpowered study | A study with insufficient sample size to reliably detect the effect of interest; typically power $< 80\%$ |
| P-hacking | Manipulating data analysis — testing multiple hypotheses, variables, or subgroups — until a statistically significant result is found; inflates the false positive rate |
| Publication bias | The tendency for journals to publish significant results and reject null results, creating a distorted literature that overestimates effect sizes |
| Replication crisis | The discovery that many published scientific findings cannot be reproduced; driven by underpowered studies, p-hacking, and publication bias |