Key Takeaways: Multiple Regression

One-Sentence Summary

Multiple regression extends simple regression to multiple predictors, enabling you to estimate the effect of each variable "holding all others constant" — thereby partially controlling for confounders — while adjusted $R^2$ penalizes unnecessary complexity, VIF detects multicollinearity, indicator variables incorporate categorical predictors, and residual diagnostics verify assumptions.

Core Concepts at a Glance

Concept Definition Why It Matters
Multiple regression $\hat{y} = b_0 + b_1 x_1 + \cdots + b_k x_k$ — predicting an outcome from several variables simultaneously Reflects the real world, where outcomes have multiple causes
Partial regression coefficient The predicted change in $y$ for a one-unit increase in $x_i$, holding all other predictors constant Isolates each variable's unique contribution; the threshold concept
Adjusted $R^2$ $R^2$ with a penalty for model complexity: $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$ Prevents rewarding models just for adding more variables
Multicollinearity High correlation among predictors, detected by VIF Inflates standard errors; makes individual coefficients hard to interpret
Indicator (dummy) variable A 0/1 variable representing membership in a category; use $k-1$ for $k$ categories Lets categorical variables enter the regression framework

The Multiple Regression Procedure

Step by Step

  1. Define the question. What outcome? What predictors? Why these predictors (theory-driven)?

  2. Explore the data: - Scatterplots of $y$ vs. each $x$ - Correlation matrix among predictors (watch for multicollinearity) - Descriptive statistics

  3. Fit the model using smf.ols('y ~ x1 + x2 + x3', data=df).fit() in Python or the Data Analysis ToolPak in Excel.

  4. Check overall fit: - $R^2$ and adjusted $R^2$ - F-test (is at least one predictor useful?)

  5. Interpret individual predictors: - Coefficient: "For each one-unit increase in $x_i$, predicted $y$ changes by $b_i$, holding all other variables constant" - t-test / p-value: Is this specific predictor significant? - 95% CI: Plausible range for the true effect

  6. Check multicollinearity: Calculate VIF for each predictor

  7. Diagnose residuals: - Residuals vs. predicted → linearity, equal variance - QQ-plot → normality - Independence → study design

  8. Report carefully: Discuss causation, unmeasured confounders, and limitations

Key Python Code

import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy import stats

# Fit model
model = smf.ols('y ~ x1 + x2 + C(category)', data=df).fit()
print(model.summary())

# VIF
X = df[['x1', 'x2']]
for i, col in enumerate(X.columns):
    print(f"VIF for {col}: "
          f"{variance_inflation_factor(X.values, i):.2f}")

# Residual diagnostics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_title('Residuals vs. Predicted')
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title('QQ-Plot')
axes[2].hist(model.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_title('Residual Distribution')
plt.tight_layout()
plt.show()

Excel Procedure

Step Action
1. Enter data Each variable in its own column; create 0/1 dummy columns manually for categorical variables
2. Data Analysis Data tab → Data Analysis → Regression
3. Input Y Range Select the response variable column
4. Input X Range Select ALL predictor columns (including dummies) at once
5. Options Check "Labels," "Residuals," and "Residual Plots"
6. Output Read Regression Statistics, ANOVA table, and Coefficients table

The Threshold Concept: "Holding Other Variables Constant"

Each partial regression coefficient estimates the effect of one predictor on the outcome, assuming all other predictors remain unchanged. This is how regression statistically approximates what an experiment achieves through randomization: isolating the effect of one variable by controlling for others.

Key Implication Details
Coefficients change from simple to multiple regression Because multiple regression isolates the unique effect, not the total association
Simpson's Paradox can be resolved Multiple regression controls for confounders that reverse apparent trends
Statistical control $\neq$ experimental control Regression controls for measured confounders only; unmeasured confounders can still bias results
"Otherwise identical" is a mathematical fiction In the real world, you can rarely change one variable while holding all others fixed — but the model approximates this

Key Formulas

Formula Description
$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$ Multiple regression equation
$R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$ Adjusted R-squared
$F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}$ F-statistic for overall model significance
$t_i = \frac{b_i}{SE(b_i)}$ t-statistic for individual predictor $x_i$
$\text{VIF}_i = \frac{1}{1 - R_i^2}$ Variance Inflation Factor for predictor $x_i$
$\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ Matrix formula for least squares estimates

Adjusted $R^2$ vs. $R^2$

Property $R^2$ Adjusted $R^2$
When to use Simple regression (one predictor) Multiple regression (comparing models)
Adding a useless variable Always increases or stays the same Decreases
Adding a useful variable Increases Increases
Range 0 to 1 Can be negative (rare; means model is worse than just using $\bar{y}$)
Relationship $R^2_{\text{adj}} \leq R^2$ always

Multicollinearity Reference

VIF Interpretation Action
1 No multicollinearity None needed
1–5 Moderate Generally acceptable
5–10 High Investigate; be cautious interpreting individual coefficients
> 10 Severe Consider removing or combining correlated predictors

Key distinction: Multicollinearity affects individual coefficient interpretation but does NOT affect overall model predictions.

Dummy Variables Quick Reference

Situation Number of Dummies Reference Category
2 categories (e.g., Male/Female) 1 The excluded category
3 categories (e.g., Low/Med/High) 2 The excluded category
$k$ categories $k - 1$ The excluded category

The dummy variable trap: Including all $k$ dummies creates perfect multicollinearity. Python/R handle this automatically; in Excel, you must create $k-1$ columns manually.

Residual Diagnostics (LINE)

Condition What to Check Plot Good Sign Bad Sign
Linearity Relationship is linear Residuals vs. predicted Random scatter Curved pattern
Independence Observations are independent Study design Independent sampling Time series, clustering
Normality Residuals ~ Normal QQ-plot, histogram Points on line Curved QQ-plot
Equal variance Constant spread Residuals vs. predicted Uniform width Fan/funnel shape

Interaction Terms

  • What they capture: The effect of one variable depends on the level of another
  • In the model: Add $x_1 \times x_2$ as a predictor; in Python: y ~ x1 * x2 (includes both main effects and interaction)
  • Interpretation: When the interaction is significant, you cannot interpret main effects in isolation — the "effect" of $x_1$ changes depending on $x_2$
  • Caution: Don't add interactions without theoretical justification; with $k$ predictors, there are $k(k-1)/2$ possible pairwise interactions

Model Building Strategy

Approach When to Use
Substantive (theory-driven) Always the primary approach; include variables you have reason to believe matter
Forward selection Exploratory supplement; starts simple, adds predictors
Backward elimination Exploratory supplement; starts full, removes predictors
Rule of thumb Need 10-15 observations per predictor to avoid overfitting

Common Mistakes

Mistake Correction
Interpreting coefficients without "holding other variables constant" Always include this phrase — it changes the meaning
Comparing coefficients across predictors on different scales Use standardized coefficients or compare practical impact instead
Adding variables to maximize $R^2$ Use adjusted $R^2$; build models based on theory
Ignoring multicollinearity Calculate VIF; investigate values above 5
Using $k$ dummies for $k$ categories Use $k-1$ dummies; choose a reference category
Claiming causation Say "associated with," not "causes"; acknowledge unmeasured confounders
Skipping residual diagnostics Always check residuals vs. predicted, QQ-plot, and histogram

Connections

Connection Details
Ch.4 (Confounding) Multiple regression partially controls for confounders by including them in the model; fulfills the Ch.4 promise of statistical control
Ch.17 (Effect sizes) Partial coefficients are effect sizes; comparing them across predictors requires standardization
Ch.20 (ANOVA) The F-test is the same signal-to-noise ratio; ANOVA is a special case of regression with only categorical predictors
Ch.22 (Simple regression) Multiple regression extends simple regression; coefficients change because confounders are now controlled
Ch.24 (Logistic regression) When the response is binary, logistic regression replaces linear; same multiple-predictor framework
AI/ML (Theme 3) Multiple regression is the foundation; neural networks are regression with many more parameters and nonlinear functions