Key Takeaways: Correlation and Simple Linear Regression

One-Sentence Summary

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables, while simple linear regression fits a least squares line that predicts one variable from another — but correlation does not imply causation, $R^2$ measures the proportion of variability explained, residual plots reveal whether a linear model is appropriate, and regression to the mean explains why extreme observations tend to be followed by less extreme ones.

Core Concepts at a Glance

Concept Definition Why It Matters
Pearson's $r$ A number from $-1$ to $+1$ measuring the strength and direction of a linear association Quantifies what a scatterplot shows; unitless and symmetric
Least squares regression The line $\hat{y} = b_0 + b_1 x$ that minimizes the sum of squared residuals Provides predictions and a precise description of the average relationship
$R^2$ (coefficient of determination) Proportion of variability in $y$ explained by the linear relationship with $x$ Measures how useful the model is; connects to $\eta^2$ from ANOVA
Correlation $\neq$ causation An observed correlation can arise from direct causation, reverse causation, confounding, or coincidence Prevents drawing false causal conclusions from observational data
Regression to the mean Extreme observations tend to be followed by less extreme ones (threshold concept) Prevents attributing natural regression to interventions, treatments, or jinxes

The Regression Procedure

Step by Step

  1. Always plot first: Create a scatterplot. Look for direction, form, strength, and outliers.

  2. Calculate the correlation: $$r = \frac{1}{n-1}\sum\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$

  3. Fit the regression line: - Slope: $b_1 = r \cdot \frac{s_y}{s_x}$ - Intercept: $b_0 = \bar{y} - b_1\bar{x}$ - Equation: $\hat{y} = b_0 + b_1 x$

  4. Interpret slope and intercept in context.

  5. Assess model fit: - $R^2 = r^2$ — proportion of variability explained - Residual plot — check for patterns

  6. Check LINE conditions (for inference): - Linearity - Independence - Normality of residuals - Equal variance of residuals

  7. Discuss causation: Is this an experiment (causal) or observational study (association only)? Identify lurking variables.

Key Python Code

import numpy as np
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation
r, p = stats.pearsonr(x, y)

# Also: np.corrcoef(x, y)[0, 1]

# Simple regression
result = stats.linregress(x, y)
slope = result.slope
intercept = result.intercept
r_value = result.rvalue
r_squared = result.rvalue**2

# Full output with statsmodels
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print(model.summary())

# Visualization with seaborn
sns.regplot(x=x, y=y, ci=95)

# Residual plot
predicted = intercept + slope * x
residuals = y - predicted
plt.scatter(predicted, residuals)
plt.axhline(y=0, color='red', linestyle='--')

Excel Functions

Task Function
Correlation =CORREL(x_range, y_range)
Slope =SLOPE(y_range, x_range)
Intercept =INTERCEPT(y_range, x_range)
$R^2$ =RSQ(y_range, x_range)
Prediction =FORECAST.LINEAR(new_x, y_range, x_range)
Trendline Right-click chart data → Add Trendline → Linear

The Threshold Concept: Regression to the Mean

Extreme observations tend to be followed by less extreme observations — not because of any causal force, but because the correlation between measurements is less than perfect.

Example What People Think What's Really Happening
Sports Illustrated Jinx Appearing on the cover causes bad performance The athlete made the cover because of an extreme (lucky) performance; the next performance naturally regresses toward their true average
Sophomore slump Success in year 1 creates pressure that hurts year 2 An outstanding rookie year included some luck; year 2 reflects the true skill level more accurately
Medical treatment "works" Patients improve after treatment, so treatment caused improvement Patients were selected because they had extreme symptoms; symptoms naturally regress toward average with or without treatment

Mathematical explanation: In z-score terms, predicted $z_y = r \times z_x$. Since $|r| < 1$, the prediction is always less extreme than the observation it's based on.

Correlation vs. Causation

Four Explanations for a Correlation

Explanation Meaning Example
Direct causation $x$ causes $y$ Smoking → lung cancer
Reverse causation $y$ causes $x$ More police ← more crime
Common cause $z$ causes both $x$ and $y$ Temperature → (ice cream, drowning)
Coincidence No meaningful connection Cheese consumption & engineering PhDs

How to Establish Causation

Method Strength
Randomized controlled experiment Gold standard — eliminates confounders
Natural experiment Exploits random-like variation in the real world
Multiple lines of evidence Temporal precedence + dose-response + biological plausibility + consistency + elimination of alternatives (Bradford Hill criteria)
Regression alone Can describe associations but cannot prove causation

Key Formulas

Formula Description
$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$ Pearson correlation coefficient
$b_1 = r \cdot \frac{s_y}{s_x} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$ Slope of regression line
$b_0 = \bar{y} - b_1\bar{x}$ Intercept of regression line
$\hat{y} = b_0 + b_1 x$ Regression prediction equation
$e_i = y_i - \hat{y}_i$ Residual (observed $-$ predicted)
$R^2 = r^2 = 1 - \frac{SS_{\text{Res}}}{SS_{\text{Total}}}$ Coefficient of determination
$SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$ Variability decomposition

Properties of $r$

Property Details
Range $-1 \leq r \leq +1$
$r = +1$ Perfect positive linear relationship
$r = -1$ Perfect negative linear relationship
$r = 0$ No linear relationship (but could be nonlinear!)
Units None — $r$ is unitless
Symmetry $r(x, y) = r(y, x)$
Only linear Measures linear relationships only

Interpreting Slope and Intercept

Slope Template

"For each one-unit increase in [explanatory variable], the predicted [response variable] changes by $b_1$ [units], on average."

Intercept Interpretation

"When [explanatory variable] = 0, the predicted [response variable] is $b_0$."

Caution: The intercept only has practical meaning when $x = 0$ is within or near the range of observed data. If $x = 0$ is implausible (shoe size = 0, years of experience = 0 for a CEO study), the intercept is just a mathematical anchor.

Common Mistakes

Mistake Correction
Concluding causation from correlation Only randomized experiments establish causation; always consider lurking variables
Not plotting before calculating $r$ Always scatterplot first (Anscombe's Quartet)
Extrapolating beyond the data range Only predict within the observed range of $x$
Ignoring residual plots High $R^2$ does not guarantee the model is appropriate (Anscombe Dataset II)
Interpreting $r = 0$ as "no relationship" $r$ measures linear relationships; strong nonlinear patterns can have $r \approx 0$
Confusing $r$ with $R^2$ $r$ = correlation; $R^2 = r^2$ = proportion of variance explained
Ignoring regression to the mean Extreme observations naturally become less extreme; don't credit/blame interventions

Connections

Connection Details
Ch.5 (Scatterplots) Scatterplots described qualitatively in Ch.5 are now quantified with $r$ and regression
Ch.6 (Standard deviation) SD is inside the correlation formula; $r$ is the average product of z-scores
Ch.4 (Confounding) Confounding variables create spurious correlations; regression cannot prove causation
Ch.13 (Hypothesis testing) Hypothesis test for the slope: $H_0: \beta_1 = 0$ tests whether the linear relationship is statistically significant
Ch.17 (Effect sizes) $R^2$ is the effect size for regression, analogous to $\eta^2$ for ANOVA
Ch.20 (ANOVA) $SS_T = SS_{\text{Reg}} + SS_{\text{Res}}$ is the same decomposition as $SS_T = SS_B + SS_W$
Ch.23 (Multiple regression) Extends to multiple predictors; "holding other variables constant" partially controls confounders
Ch.24 (Logistic regression) When the response is binary (yes/no), linear regression fails; logistic regression is needed