Key Takeaways: Correlation and Simple Linear Regression
One-Sentence Summary
The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables, while simple linear regression fits a least squares line that predicts one variable from another — but correlation does not imply causation, $R^2$ measures the proportion of variability explained, residual plots reveal whether a linear model is appropriate, and regression to the mean explains why extreme observations tend to be followed by less extreme ones.
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Pearson's $r$ | A number from $-1$ to $+1$ measuring the strength and direction of a linear association | Quantifies what a scatterplot shows; unitless and symmetric |
| Least squares regression | The line $\hat{y} = b_0 + b_1 x$ that minimizes the sum of squared residuals | Provides predictions and a precise description of the average relationship |
| $R^2$ (coefficient of determination) | Proportion of variability in $y$ explained by the linear relationship with $x$ | Measures how useful the model is; connects to $\eta^2$ from ANOVA |
| Correlation $\neq$ causation | An observed correlation can arise from direct causation, reverse causation, confounding, or coincidence | Prevents drawing false causal conclusions from observational data |
| Regression to the mean | Extreme observations tend to be followed by less extreme ones (threshold concept) | Prevents attributing natural regression to interventions, treatments, or jinxes |
The Regression Procedure
Step by Step
-
Always plot first: Create a scatterplot. Look for direction, form, strength, and outliers.
-
Calculate the correlation: $$r = \frac{1}{n-1}\sum\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
-
Fit the regression line: - Slope: $b_1 = r \cdot \frac{s_y}{s_x}$ - Intercept: $b_0 = \bar{y} - b_1\bar{x}$ - Equation: $\hat{y} = b_0 + b_1 x$
-
Interpret slope and intercept in context.
-
Assess model fit: - $R^2 = r^2$ — proportion of variability explained - Residual plot — check for patterns
-
Check LINE conditions (for inference): - Linearity - Independence - Normality of residuals - Equal variance of residuals
-
Discuss causation: Is this an experiment (causal) or observational study (association only)? Identify lurking variables.
Key Python Code
import numpy as np
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation
r, p = stats.pearsonr(x, y)
# Also: np.corrcoef(x, y)[0, 1]
# Simple regression
result = stats.linregress(x, y)
slope = result.slope
intercept = result.intercept
r_value = result.rvalue
r_squared = result.rvalue**2
# Full output with statsmodels
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print(model.summary())
# Visualization with seaborn
sns.regplot(x=x, y=y, ci=95)
# Residual plot
predicted = intercept + slope * x
residuals = y - predicted
plt.scatter(predicted, residuals)
plt.axhline(y=0, color='red', linestyle='--')
Excel Functions
| Task | Function |
|---|---|
| Correlation | =CORREL(x_range, y_range) |
| Slope | =SLOPE(y_range, x_range) |
| Intercept | =INTERCEPT(y_range, x_range) |
| $R^2$ | =RSQ(y_range, x_range) |
| Prediction | =FORECAST.LINEAR(new_x, y_range, x_range) |
| Trendline | Right-click chart data → Add Trendline → Linear |
The Threshold Concept: Regression to the Mean
Extreme observations tend to be followed by less extreme observations — not because of any causal force, but because the correlation between measurements is less than perfect.
| Example | What People Think | What's Really Happening |
|---|---|---|
| Sports Illustrated Jinx | Appearing on the cover causes bad performance | The athlete made the cover because of an extreme (lucky) performance; the next performance naturally regresses toward their true average |
| Sophomore slump | Success in year 1 creates pressure that hurts year 2 | An outstanding rookie year included some luck; year 2 reflects the true skill level more accurately |
| Medical treatment "works" | Patients improve after treatment, so treatment caused improvement | Patients were selected because they had extreme symptoms; symptoms naturally regress toward average with or without treatment |
Mathematical explanation: In z-score terms, predicted $z_y = r \times z_x$. Since $|r| < 1$, the prediction is always less extreme than the observation it's based on.
Correlation vs. Causation
Four Explanations for a Correlation
| Explanation | Meaning | Example |
|---|---|---|
| Direct causation | $x$ causes $y$ | Smoking → lung cancer |
| Reverse causation | $y$ causes $x$ | More police ← more crime |
| Common cause | $z$ causes both $x$ and $y$ | Temperature → (ice cream, drowning) |
| Coincidence | No meaningful connection | Cheese consumption & engineering PhDs |
How to Establish Causation
| Method | Strength |
|---|---|
| Randomized controlled experiment | Gold standard — eliminates confounders |
| Natural experiment | Exploits random-like variation in the real world |
| Multiple lines of evidence | Temporal precedence + dose-response + biological plausibility + consistency + elimination of alternatives (Bradford Hill criteria) |
| Regression alone | Can describe associations but cannot prove causation |
Key Formulas
| Formula | Description |
|---|---|
| $r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$ | Pearson correlation coefficient |
| $b_1 = r \cdot \frac{s_y}{s_x} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$ | Slope of regression line |
| $b_0 = \bar{y} - b_1\bar{x}$ | Intercept of regression line |
| $\hat{y} = b_0 + b_1 x$ | Regression prediction equation |
| $e_i = y_i - \hat{y}_i$ | Residual (observed $-$ predicted) |
| $R^2 = r^2 = 1 - \frac{SS_{\text{Res}}}{SS_{\text{Total}}}$ | Coefficient of determination |
| $SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$ | Variability decomposition |
Properties of $r$
| Property | Details |
|---|---|
| Range | $-1 \leq r \leq +1$ |
| $r = +1$ | Perfect positive linear relationship |
| $r = -1$ | Perfect negative linear relationship |
| $r = 0$ | No linear relationship (but could be nonlinear!) |
| Units | None — $r$ is unitless |
| Symmetry | $r(x, y) = r(y, x)$ |
| Only linear | Measures linear relationships only |
Interpreting Slope and Intercept
Slope Template
"For each one-unit increase in [explanatory variable], the predicted [response variable] changes by $b_1$ [units], on average."
Intercept Interpretation
"When [explanatory variable] = 0, the predicted [response variable] is $b_0$."
Caution: The intercept only has practical meaning when $x = 0$ is within or near the range of observed data. If $x = 0$ is implausible (shoe size = 0, years of experience = 0 for a CEO study), the intercept is just a mathematical anchor.
Common Mistakes
| Mistake | Correction |
|---|---|
| Concluding causation from correlation | Only randomized experiments establish causation; always consider lurking variables |
| Not plotting before calculating $r$ | Always scatterplot first (Anscombe's Quartet) |
| Extrapolating beyond the data range | Only predict within the observed range of $x$ |
| Ignoring residual plots | High $R^2$ does not guarantee the model is appropriate (Anscombe Dataset II) |
| Interpreting $r = 0$ as "no relationship" | $r$ measures linear relationships; strong nonlinear patterns can have $r \approx 0$ |
| Confusing $r$ with $R^2$ | $r$ = correlation; $R^2 = r^2$ = proportion of variance explained |
| Ignoring regression to the mean | Extreme observations naturally become less extreme; don't credit/blame interventions |
Connections
| Connection | Details |
|---|---|
| Ch.5 (Scatterplots) | Scatterplots described qualitatively in Ch.5 are now quantified with $r$ and regression |
| Ch.6 (Standard deviation) | SD is inside the correlation formula; $r$ is the average product of z-scores |
| Ch.4 (Confounding) | Confounding variables create spurious correlations; regression cannot prove causation |
| Ch.13 (Hypothesis testing) | Hypothesis test for the slope: $H_0: \beta_1 = 0$ tests whether the linear relationship is statistically significant |
| Ch.17 (Effect sizes) | $R^2$ is the effect size for regression, analogous to $\eta^2$ for ANOVA |
| Ch.20 (ANOVA) | $SS_T = SS_{\text{Reg}} + SS_{\text{Res}}$ is the same decomposition as $SS_T = SS_B + SS_W$ |
| Ch.23 (Multiple regression) | Extends to multiple predictors; "holding other variables constant" partially controls confounders |
| Ch.24 (Logistic regression) | When the response is binary (yes/no), linear regression fails; logistic regression is needed |