Key Takeaways: Multiple Regression
One-Sentence Summary
Multiple regression extends simple regression to multiple predictors, enabling you to estimate the effect of each variable "holding all others constant" — thereby partially controlling for confounders — while adjusted $R^2$ penalizes unnecessary complexity, VIF detects multicollinearity, indicator variables incorporate categorical predictors, and residual diagnostics verify assumptions.
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Multiple regression | $\hat{y} = b_0 + b_1 x_1 + \cdots + b_k x_k$ — predicting an outcome from several variables simultaneously | Reflects the real world, where outcomes have multiple causes |
| Partial regression coefficient | The predicted change in $y$ for a one-unit increase in $x_i$, holding all other predictors constant | Isolates each variable's unique contribution; the threshold concept |
| Adjusted $R^2$ | $R^2$ with a penalty for model complexity: $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$ | Prevents rewarding models just for adding more variables |
| Multicollinearity | High correlation among predictors, detected by VIF | Inflates standard errors; makes individual coefficients hard to interpret |
| Indicator (dummy) variable | A 0/1 variable representing membership in a category; use $k-1$ for $k$ categories | Lets categorical variables enter the regression framework |
The Multiple Regression Procedure
Step by Step
-
Define the question. What outcome? What predictors? Why these predictors (theory-driven)?
-
Explore the data: - Scatterplots of $y$ vs. each $x$ - Correlation matrix among predictors (watch for multicollinearity) - Descriptive statistics
-
Fit the model using
smf.ols('y ~ x1 + x2 + x3', data=df).fit()in Python or the Data Analysis ToolPak in Excel. -
Check overall fit: - $R^2$ and adjusted $R^2$ - F-test (is at least one predictor useful?)
-
Interpret individual predictors: - Coefficient: "For each one-unit increase in $x_i$, predicted $y$ changes by $b_i$, holding all other variables constant" - t-test / p-value: Is this specific predictor significant? - 95% CI: Plausible range for the true effect
-
Check multicollinearity: Calculate VIF for each predictor
-
Diagnose residuals: - Residuals vs. predicted → linearity, equal variance - QQ-plot → normality - Independence → study design
-
Report carefully: Discuss causation, unmeasured confounders, and limitations
Key Python Code
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy import stats
# Fit model
model = smf.ols('y ~ x1 + x2 + C(category)', data=df).fit()
print(model.summary())
# VIF
X = df[['x1', 'x2']]
for i, col in enumerate(X.columns):
print(f"VIF for {col}: "
f"{variance_inflation_factor(X.values, i):.2f}")
# Residual diagnostics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_title('Residuals vs. Predicted')
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title('QQ-Plot')
axes[2].hist(model.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_title('Residual Distribution')
plt.tight_layout()
plt.show()
Excel Procedure
| Step | Action |
|---|---|
| 1. Enter data | Each variable in its own column; create 0/1 dummy columns manually for categorical variables |
| 2. Data Analysis | Data tab → Data Analysis → Regression |
| 3. Input Y Range | Select the response variable column |
| 4. Input X Range | Select ALL predictor columns (including dummies) at once |
| 5. Options | Check "Labels," "Residuals," and "Residual Plots" |
| 6. Output | Read Regression Statistics, ANOVA table, and Coefficients table |
The Threshold Concept: "Holding Other Variables Constant"
Each partial regression coefficient estimates the effect of one predictor on the outcome, assuming all other predictors remain unchanged. This is how regression statistically approximates what an experiment achieves through randomization: isolating the effect of one variable by controlling for others.
| Key Implication | Details |
|---|---|
| Coefficients change from simple to multiple regression | Because multiple regression isolates the unique effect, not the total association |
| Simpson's Paradox can be resolved | Multiple regression controls for confounders that reverse apparent trends |
| Statistical control $\neq$ experimental control | Regression controls for measured confounders only; unmeasured confounders can still bias results |
| "Otherwise identical" is a mathematical fiction | In the real world, you can rarely change one variable while holding all others fixed — but the model approximates this |
Key Formulas
| Formula | Description |
|---|---|
| $\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$ | Multiple regression equation |
| $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$ | Adjusted R-squared |
| $F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}$ | F-statistic for overall model significance |
| $t_i = \frac{b_i}{SE(b_i)}$ | t-statistic for individual predictor $x_i$ |
| $\text{VIF}_i = \frac{1}{1 - R_i^2}$ | Variance Inflation Factor for predictor $x_i$ |
| $\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | Matrix formula for least squares estimates |
Adjusted $R^2$ vs. $R^2$
| Property | $R^2$ | Adjusted $R^2$ |
|---|---|---|
| When to use | Simple regression (one predictor) | Multiple regression (comparing models) |
| Adding a useless variable | Always increases or stays the same | Decreases |
| Adding a useful variable | Increases | Increases |
| Range | 0 to 1 | Can be negative (rare; means model is worse than just using $\bar{y}$) |
| Relationship | $R^2_{\text{adj}} \leq R^2$ always |
Multicollinearity Reference
| VIF | Interpretation | Action |
|---|---|---|
| 1 | No multicollinearity | None needed |
| 1–5 | Moderate | Generally acceptable |
| 5–10 | High | Investigate; be cautious interpreting individual coefficients |
| > 10 | Severe | Consider removing or combining correlated predictors |
Key distinction: Multicollinearity affects individual coefficient interpretation but does NOT affect overall model predictions.
Dummy Variables Quick Reference
| Situation | Number of Dummies | Reference Category |
|---|---|---|
| 2 categories (e.g., Male/Female) | 1 | The excluded category |
| 3 categories (e.g., Low/Med/High) | 2 | The excluded category |
| $k$ categories | $k - 1$ | The excluded category |
The dummy variable trap: Including all $k$ dummies creates perfect multicollinearity. Python/R handle this automatically; in Excel, you must create $k-1$ columns manually.
Residual Diagnostics (LINE)
| Condition | What to Check | Plot | Good Sign | Bad Sign |
|---|---|---|---|---|
| Linearity | Relationship is linear | Residuals vs. predicted | Random scatter | Curved pattern |
| Independence | Observations are independent | Study design | Independent sampling | Time series, clustering |
| Normality | Residuals ~ Normal | QQ-plot, histogram | Points on line | Curved QQ-plot |
| Equal variance | Constant spread | Residuals vs. predicted | Uniform width | Fan/funnel shape |
Interaction Terms
- What they capture: The effect of one variable depends on the level of another
- In the model: Add $x_1 \times x_2$ as a predictor; in Python:
y ~ x1 * x2(includes both main effects and interaction) - Interpretation: When the interaction is significant, you cannot interpret main effects in isolation — the "effect" of $x_1$ changes depending on $x_2$
- Caution: Don't add interactions without theoretical justification; with $k$ predictors, there are $k(k-1)/2$ possible pairwise interactions
Model Building Strategy
| Approach | When to Use |
|---|---|
| Substantive (theory-driven) | Always the primary approach; include variables you have reason to believe matter |
| Forward selection | Exploratory supplement; starts simple, adds predictors |
| Backward elimination | Exploratory supplement; starts full, removes predictors |
| Rule of thumb | Need 10-15 observations per predictor to avoid overfitting |
Common Mistakes
| Mistake | Correction |
|---|---|
| Interpreting coefficients without "holding other variables constant" | Always include this phrase — it changes the meaning |
| Comparing coefficients across predictors on different scales | Use standardized coefficients or compare practical impact instead |
| Adding variables to maximize $R^2$ | Use adjusted $R^2$; build models based on theory |
| Ignoring multicollinearity | Calculate VIF; investigate values above 5 |
| Using $k$ dummies for $k$ categories | Use $k-1$ dummies; choose a reference category |
| Claiming causation | Say "associated with," not "causes"; acknowledge unmeasured confounders |
| Skipping residual diagnostics | Always check residuals vs. predicted, QQ-plot, and histogram |
Connections
| Connection | Details |
|---|---|
| Ch.4 (Confounding) | Multiple regression partially controls for confounders by including them in the model; fulfills the Ch.4 promise of statistical control |
| Ch.17 (Effect sizes) | Partial coefficients are effect sizes; comparing them across predictors requires standardization |
| Ch.20 (ANOVA) | The F-test is the same signal-to-noise ratio; ANOVA is a special case of regression with only categorical predictors |
| Ch.22 (Simple regression) | Multiple regression extends simple regression; coefficients change because confounders are now controlled |
| Ch.24 (Logistic regression) | When the response is binary, logistic regression replaces linear; same multiple-predictor framework |
| AI/ML (Theme 3) | Multiple regression is the foundation; neural networks are regression with many more parameters and nonlinear functions |