Key Takeaways: Correlation and Simple Linear Regression

Contributors

Key Takeaways: Correlation and Simple Linear Regression

One-Sentence Summary

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables, while simple linear regression fits a least squares line that predicts one variable from another — but correlation does not imply causation, $R^2$ measures the proportion of variability explained, residual plots reveal whether a linear model is appropriate, and regression to the mean explains why extreme observations tend to be followed by less extreme ones.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Pearson's $r$	A number from $-1$ to $+1$ measuring the strength and direction of a linear association	Quantifies what a scatterplot shows; unitless and symmetric
Least squares regression	The line $\hat{y} = b_0 + b_1 x$ that minimizes the sum of squared residuals	Provides predictions and a precise description of the average relationship
$R^2$ (coefficient of determination)	Proportion of variability in $y$ explained by the linear relationship with $x$	Measures how useful the model is; connects to $\eta^2$ from ANOVA
Correlation $\neq$ causation	An observed correlation can arise from direct causation, reverse causation, confounding, or coincidence	Prevents drawing false causal conclusions from observational data
Regression to the mean	Extreme observations tend to be followed by less extreme ones (threshold concept)	Prevents attributing natural regression to interventions, treatments, or jinxes

The Regression Procedure

Step by Step

Always plot first: Create a scatterplot. Look for direction, form, strength, and outliers.
Calculate the correlation: $$r = \frac{1}{n-1}\sum\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
Fit the regression line: - Slope: $b_1 = r \cdot \frac{s_y}{s_x}$ - Intercept: $b_0 = \bar{y} - b_1\bar{x}$ - Equation: $\hat{y} = b_0 + b_1 x$
Interpret slope and intercept in context.
Assess model fit: - $R^2 = r^2$ — proportion of variability explained - Residual plot — check for patterns
Check LINE conditions (for inference): - Linearity - Independence - Normality of residuals - Equal variance of residuals
Discuss causation: Is this an experiment (causal) or observational study (association only)? Identify lurking variables.

Key Python Code

import numpy as np
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation
r, p = stats.pearsonr(x, y)

# Also: np.corrcoef(x, y)[0, 1]

# Simple regression
result = stats.linregress(x, y)
slope = result.slope
intercept = result.intercept
r_value = result.rvalue
r_squared = result.rvalue**2

# Full output with statsmodels
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print(model.summary())

# Visualization with seaborn
sns.regplot(x=x, y=y, ci=95)

# Residual plot
predicted = intercept + slope * x
residuals = y - predicted
plt.scatter(predicted, residuals)
plt.axhline(y=0, color='red', linestyle='--')

Excel Functions

Task	Function
Correlation	`=CORREL(x_range, y_range)`
Slope	`=SLOPE(y_range, x_range)`
Intercept	`=INTERCEPT(y_range, x_range)`
$R^2$	`=RSQ(y_range, x_range)`
Prediction	`=FORECAST.LINEAR(new_x, y_range, x_range)`
Trendline	Right-click chart data → Add Trendline → Linear

The Threshold Concept: Regression to the Mean

Extreme observations tend to be followed by less extreme observations — not because of any causal force, but because the correlation between measurements is less than perfect.

Example	What People Think	What's Really Happening
Sports Illustrated Jinx	Appearing on the cover causes bad performance	The athlete made the cover because of an extreme (lucky) performance; the next performance naturally regresses toward their true average
Sophomore slump	Success in year 1 creates pressure that hurts year 2	An outstanding rookie year included some luck; year 2 reflects the true skill level more accurately
Medical treatment "works"	Patients improve after treatment, so treatment caused improvement	Patients were selected because they had extreme symptoms; symptoms naturally regress toward average with or without treatment

Mathematical explanation: In z-score terms, predicted $z_y = r \times z_x$. Since $|r| < 1$, the prediction is always less extreme than the observation it's based on.

Correlation vs. Causation

Four Explanations for a Correlation

Explanation	Meaning	Example
Direct causation	$x$ causes $y$	Smoking → lung cancer
Reverse causation	$y$ causes $x$	More police ← more crime
Common cause	$z$ causes both $x$ and $y$	Temperature → (ice cream, drowning)
Coincidence	No meaningful connection	Cheese consumption & engineering PhDs

How to Establish Causation

Method	Strength
Randomized controlled experiment	Gold standard — eliminates confounders
Natural experiment	Exploits random-like variation in the real world
Multiple lines of evidence	Temporal precedence + dose-response + biological plausibility + consistency + elimination of alternatives (Bradford Hill criteria)
Regression alone	Can describe associations but cannot prove causation

Key Formulas

Formula	Description
$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$	Pearson correlation coefficient
$b_1 = r \cdot \frac{s_y}{s_x} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$	Slope of regression line
$b_0 = \bar{y} - b_1\bar{x}$	Intercept of regression line
$\hat{y} = b_0 + b_1 x$	Regression prediction equation
$e_i = y_i - \hat{y}_i$	Residual (observed $-$ predicted)
$R^2 = r^2 = 1 - \frac{SS_{\text{Res}}}{SS_{\text{Total}}}$	Coefficient of determination
$SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$	Variability decomposition

Properties of $r$

Property	Details
Range	$-1 \leq r \leq +1$
$r = +1$	Perfect positive linear relationship
$r = -1$	Perfect negative linear relationship
$r = 0$	No linear relationship (but could be nonlinear!)
Units	None — $r$ is unitless
Symmetry	$r(x, y) = r(y, x)$
Only linear	Measures linear relationships only

Interpreting Slope and Intercept

Slope Template

"For each one-unit increase in [explanatory variable], the predicted [response variable] changes by $b_1$ [units], on average."

Intercept Interpretation

"When [explanatory variable] = 0, the predicted [response variable] is $b_0$."

Caution: The intercept only has practical meaning when $x = 0$ is within or near the range of observed data. If $x = 0$ is implausible (shoe size = 0, years of experience = 0 for a CEO study), the intercept is just a mathematical anchor.

Common Mistakes

Mistake	Correction
Concluding causation from correlation	Only randomized experiments establish causation; always consider lurking variables
Not plotting before calculating $r$	Always scatterplot first (Anscombe's Quartet)
Extrapolating beyond the data range	Only predict within the observed range of $x$
Ignoring residual plots	High $R^2$ does not guarantee the model is appropriate (Anscombe Dataset II)
Interpreting $r = 0$ as "no relationship"	$r$ measures linear relationships; strong nonlinear patterns can have $r \approx 0$
Confusing $r$ with $R^2$	$r$ = correlation; $R^2 = r^2$ = proportion of variance explained
Ignoring regression to the mean	Extreme observations naturally become less extreme; don't credit/blame interventions

Connections

Connection	Details
Ch.5 (Scatterplots)	Scatterplots described qualitatively in Ch.5 are now quantified with $r$ and regression
Ch.6 (Standard deviation)	SD is inside the correlation formula; $r$ is the average product of z-scores
Ch.4 (Confounding)	Confounding variables create spurious correlations; regression cannot prove causation
Ch.13 (Hypothesis testing)	Hypothesis test for the slope: $H_0: \beta_1 = 0$ tests whether the linear relationship is statistically significant
Ch.17 (Effect sizes)	$R^2$ is the effect size for regression, analogous to $\eta^2$ for ANOVA
Ch.20 (ANOVA)	$SS_T = SS_{\text{Reg}} + SS_{\text{Res}}$ is the same decomposition as $SS_T = SS_B + SS_W$
Ch.23 (Multiple regression)	Extends to multiple predictors; "holding other variables constant" partially controls confounders
Ch.24 (Logistic regression)	When the response is binary (yes/no), linear regression fails; logistic regression is needed