Key Takeaways: Multiple Regression

Contributors

Key Takeaways: Multiple Regression

One-Sentence Summary

Multiple regression extends simple regression to multiple predictors, enabling you to estimate the effect of each variable "holding all others constant" — thereby partially controlling for confounders — while adjusted $R^2$ penalizes unnecessary complexity, VIF detects multicollinearity, indicator variables incorporate categorical predictors, and residual diagnostics verify assumptions.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Multiple regression	$\hat{y} = b_0 + b_1 x_1 + \cdots + b_k x_k$ — predicting an outcome from several variables simultaneously	Reflects the real world, where outcomes have multiple causes
Partial regression coefficient	The predicted change in $y$ for a one-unit increase in $x_i$, holding all other predictors constant	Isolates each variable's unique contribution; the threshold concept
Adjusted $R^2$	$R^2$ with a penalty for model complexity: $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$	Prevents rewarding models just for adding more variables
Multicollinearity	High correlation among predictors, detected by VIF	Inflates standard errors; makes individual coefficients hard to interpret
Indicator (dummy) variable	A 0/1 variable representing membership in a category; use $k-1$ for $k$ categories	Lets categorical variables enter the regression framework

The Multiple Regression Procedure

Step by Step

Define the question. What outcome? What predictors? Why these predictors (theory-driven)?
Explore the data: - Scatterplots of $y$ vs. each $x$ - Correlation matrix among predictors (watch for multicollinearity) - Descriptive statistics
Fit the model using smf.ols('y ~ x1 + x2 + x3', data=df).fit() in Python or the Data Analysis ToolPak in Excel.
Check overall fit: - $R^2$ and adjusted $R^2$ - F-test (is at least one predictor useful?)
Interpret individual predictors: - Coefficient: "For each one-unit increase in $x_i$, predicted $y$ changes by $b_i$, holding all other variables constant" - t-test / p-value: Is this specific predictor significant? - 95% CI: Plausible range for the true effect
Check multicollinearity: Calculate VIF for each predictor
Diagnose residuals: - Residuals vs. predicted → linearity, equal variance - QQ-plot → normality - Independence → study design
Report carefully: Discuss causation, unmeasured confounders, and limitations

Key Python Code

import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy import stats

# Fit model
model = smf.ols('y ~ x1 + x2 + C(category)', data=df).fit()
print(model.summary())

# VIF
X = df[['x1', 'x2']]
for i, col in enumerate(X.columns):
    print(f"VIF for {col}: "
          f"{variance_inflation_factor(X.values, i):.2f}")

# Residual diagnostics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_title('Residuals vs. Predicted')
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title('QQ-Plot')
axes[2].hist(model.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_title('Residual Distribution')
plt.tight_layout()
plt.show()

Excel Procedure

Step	Action
1. Enter data	Each variable in its own column; create 0/1 dummy columns manually for categorical variables
2. Data Analysis	Data tab → Data Analysis → Regression
3. Input Y Range	Select the response variable column
4. Input X Range	Select ALL predictor columns (including dummies) at once
5. Options	Check "Labels," "Residuals," and "Residual Plots"
6. Output	Read Regression Statistics, ANOVA table, and Coefficients table

The Threshold Concept: "Holding Other Variables Constant"

Each partial regression coefficient estimates the effect of one predictor on the outcome, assuming all other predictors remain unchanged. This is how regression statistically approximates what an experiment achieves through randomization: isolating the effect of one variable by controlling for others.

Key Implication	Details
Coefficients change from simple to multiple regression	Because multiple regression isolates the unique effect, not the total association
Simpson's Paradox can be resolved	Multiple regression controls for confounders that reverse apparent trends
Statistical control $\neq$ experimental control	Regression controls for measured confounders only; unmeasured confounders can still bias results
"Otherwise identical" is a mathematical fiction	In the real world, you can rarely change one variable while holding all others fixed — but the model approximates this

Key Formulas

Formula	Description
$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$	Multiple regression equation
$R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$	Adjusted R-squared
$F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}$	F-statistic for overall model significance
$t_i = \frac{b_i}{SE(b_i)}$	t-statistic for individual predictor $x_i$
$\text{VIF}_i = \frac{1}{1 - R_i^2}$	Variance Inflation Factor for predictor $x_i$
$\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	Matrix formula for least squares estimates

Adjusted $R^2$ vs. $R^2$

Property	$R^2$	Adjusted $R^2$
When to use	Simple regression (one predictor)	Multiple regression (comparing models)
Adding a useless variable	Always increases or stays the same	Decreases
Adding a useful variable	Increases	Increases
Range	0 to 1	Can be negative (rare; means model is worse than just using $\bar{y}$)
Relationship	$R^2_{\text{adj}} \leq R^2$ always

Multicollinearity Reference

VIF	Interpretation	Action
1	No multicollinearity	None needed
1–5	Moderate	Generally acceptable
5–10	High	Investigate; be cautious interpreting individual coefficients
> 10	Severe	Consider removing or combining correlated predictors

Key distinction: Multicollinearity affects individual coefficient interpretation but does NOT affect overall model predictions.

Dummy Variables Quick Reference

Situation	Number of Dummies	Reference Category
2 categories (e.g., Male/Female)	1	The excluded category
3 categories (e.g., Low/Med/High)	2	The excluded category
$k$ categories	$k - 1$	The excluded category

The dummy variable trap: Including all $k$ dummies creates perfect multicollinearity. Python/R handle this automatically; in Excel, you must create $k-1$ columns manually.

Residual Diagnostics (LINE)

Condition	What to Check	Plot	Good Sign	Bad Sign
Linearity	Relationship is linear	Residuals vs. predicted	Random scatter	Curved pattern
Independence	Observations are independent	Study design	Independent sampling	Time series, clustering
Normality	Residuals ~ Normal	QQ-plot, histogram	Points on line	Curved QQ-plot
Equal variance	Constant spread	Residuals vs. predicted	Uniform width	Fan/funnel shape

Interaction Terms

What they capture: The effect of one variable depends on the level of another
In the model: Add $x_1 \times x_2$ as a predictor; in Python: y ~ x1 * x2 (includes both main effects and interaction)
Interpretation: When the interaction is significant, you cannot interpret main effects in isolation — the "effect" of $x_1$ changes depending on $x_2$
Caution: Don't add interactions without theoretical justification; with $k$ predictors, there are $k(k-1)/2$ possible pairwise interactions

Model Building Strategy

Approach	When to Use
Substantive (theory-driven)	Always the primary approach; include variables you have reason to believe matter
Forward selection	Exploratory supplement; starts simple, adds predictors
Backward elimination	Exploratory supplement; starts full, removes predictors
Rule of thumb	Need 10-15 observations per predictor to avoid overfitting

Common Mistakes

Mistake	Correction
Interpreting coefficients without "holding other variables constant"	Always include this phrase — it changes the meaning
Comparing coefficients across predictors on different scales	Use standardized coefficients or compare practical impact instead
Adding variables to maximize $R^2$	Use adjusted $R^2$; build models based on theory
Ignoring multicollinearity	Calculate VIF; investigate values above 5
Using $k$ dummies for $k$ categories	Use $k-1$ dummies; choose a reference category
Claiming causation	Say "associated with," not "causes"; acknowledge unmeasured confounders
Skipping residual diagnostics	Always check residuals vs. predicted, QQ-plot, and histogram

Connections

Connection	Details
Ch.4 (Confounding)	Multiple regression partially controls for confounders by including them in the model; fulfills the Ch.4 promise of statistical control
Ch.17 (Effect sizes)	Partial coefficients are effect sizes; comparing them across predictors requires standardization
Ch.20 (ANOVA)	The F-test is the same signal-to-noise ratio; ANOVA is a special case of regression with only categorical predictors
Ch.22 (Simple regression)	Multiple regression extends simple regression; coefficients change because confounders are now controlled
Ch.24 (Logistic regression)	When the response is binary, logistic regression replaces linear; same multiple-predictor framework
AI/ML (Theme 3)	Multiple regression is the foundation; neural networks are regression with many more parameters and nonlinear functions