Further Reading: Multiple Regression

Books

For Deeper Understanding

Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li, Applied Linear Statistical Models, 5th edition (2005) Recommended in Chapter 22 for simple regression, this text truly shines in its treatment of multiple regression (Part II, Chapters 6-14). The coverage of multicollinearity diagnostics, model selection procedures, and influential observations is the most thorough at the intermediate level. Chapter 7 on multiple regression in matrix form provides the mathematical foundation for understanding why $\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ and how multicollinearity makes this computation unstable. Chapter 10 on model building and validation is particularly relevant for understanding the forward selection, backward elimination, and stepwise procedures discussed in this chapter.

Samprit Chatterjee and Ali S. Hadi, Regression Analysis by Example, 5th edition (2012) Also recommended in Chapter 22, this text's worked-example approach is especially valuable for multiple regression. Each chapter revolves around a real dataset — supervisor performance, cigarette data, fuel consumption — that makes the abstract concepts concrete. The chapter on multicollinearity (Chapter 9) uses the supervisor dataset to show exactly what happens to coefficients when correlated predictors are added and removed. The residual diagnostics chapters include side-by-side comparisons of good and problematic residual plots with real data.

Jacob Cohen, Patricia Cohen, Stephen G. West, and Leona S. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd edition (2003) The definitive text for researchers in psychology, education, and the social sciences who use multiple regression as their primary analytical tool. The treatment of interactions (Chapter 7) and polynomial regression (Chapter 6) is the most accessible available. Cohen et al. are particularly strong on the interpretation of partial coefficients and the distinction between semi-partial and partial correlation — topics that deepen the "holding other variables constant" concept. The effect size discussion connects directly to the power analysis concepts from Chapter 17.

James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning (ISLR), 2nd edition (2021) Free online at statlearning.com. Chapter 3 covers multiple linear regression and provides the bridge between classical statistics and machine learning. The bias-variance tradeoff discussion in Chapter 2 explains why adjusted $R^2$ matters — it's a simple version of the same principle that drives cross-validation. The labs use R, but Python editions are also available. This is the natural next step for students interested in how the regression concepts in this chapter scale up to machine learning.

For the Conceptually Curious

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's Chapter 12 on regression is a masterclass in making multiple regression intuitive. His examples — why controlling for education changes the race-income relationship, why neighborhoods with more Starbucks have higher real estate values (it's not the coffee) — are memorable and directly relevant to this chapter's theme of confounders. Previously recommended for Chapters 12, 13, 18, 20, and 22.

David Freedman, Statistical Models: Theory and Practice, 2nd edition (2009) Freedman was one of the most careful thinkers about what regression can and cannot do. This text includes penetrating critiques of the assumptions behind causal interpretation of regression coefficients. His discussion of "regression as description" vs. "regression as causal inference" is essential reading for anyone who wants to understand why "holding other variables constant" is simultaneously powerful and potentially misleading. More advanced than the chapter, but the first three chapters are accessible and deeply thought-provoking.

Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effect (2018) Pearl revolutionized the study of causation in statistics. This popular-science book explains why multiple regression alone cannot establish causation — even with perfect data and infinite sample size. Pearl's causal diagrams (directed acyclic graphs, or DAGs) provide a framework for deciding which variables to control for and which to leave alone. The example of Simpson's Paradox that opened this chapter is one of Pearl's signature illustrations. Essential reading for anyone who wants to think rigorously about when "holding other variables constant" actually isolates a causal effect.

Articles and Papers

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press. Recommended in Chapter 22 for its treatment of causal inference. The first three chapters are directly relevant to this chapter's discussion of controlling for confounders. Angrist and Pischke call multiple regression "the workhorse of applied econometrics" and provide a clear framework for understanding when regression coefficients have causal interpretations (randomized experiments, natural experiments, instrumental variables) and when they're purely descriptive.

Marquardt, D. W. (1970). "Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation." Technometrics, 12(3), 591-612. The paper that formalized VIF as a diagnostic tool for multicollinearity. While the mathematics is advanced, the first few pages clearly motivate the problem: when predictors are correlated, the variance of coefficient estimates inflates, and VIF quantifies exactly how much. Understanding VIF's origin as a ratio of actual variance to ideal variance deepens the intuition behind the numbers.

Simpson, E. H. (1951). "The Interpretation of Interaction in Contingency Tables." Journal of the Royal Statistical Society, Series B, 13(2), 238-241. The original paper describing what became known as Simpson's Paradox. At just four pages, it's a concise demonstration of how aggregating data can reverse the direction of an association. The kidney stone treatment example used in Section 23.1 is a real-world instance first documented by Charig et al. (1986) in the British Medical Journal.

Berkson, J. (1946). "Limitations of the Application of Fourfold Table Analysis to Hospital Data." Biometrics Bulletin, 2(3), 47-53. An early paper on a related confounding phenomenon: Berkson's paradox, where conditioning on a collider (a variable caused by both the exposure and the outcome) can create a spurious association between otherwise unrelated variables. This is relevant to the chapter's caveat about controlling for the wrong variables — not all confounders should be controlled, and controlling for colliders can introduce bias. Accessible and historically important.

Westfall, J. and Yarkoni, T. (2016). "Statistically Controlling for Confounding Constructs Is Harder than You Think." PLoS ONE, 11(3), e0152719. A modern paper demonstrating that statistical control via multiple regression is often much less effective than researchers assume. When the confounders are measured with error (which they almost always are), the regression coefficients for the variables of interest are biased. This paper is a wake-up call for anyone who believes that adding control variables to a regression solves the confounding problem.

Online Resources

Penn State STAT 501: "Regression Methods" https://online.stat.psu.edu/stat501/

Lessons 5-12 cover multiple regression, multicollinearity, model building, and diagnostics in detail. The interactive examples let you add and remove predictors and watch coefficients change in real time — a powerful demonstration of the concepts in this chapter. The multicollinearity lesson (Lesson 12) is particularly well done.

Statquest: "Multiple Regression, Clearly Explained" https://www.youtube.com/watch?v=zITIFTsivN8

Josh Starmer's step-by-step visual explanation of multiple regression builds from simple regression to multiple predictors with animations showing how the regression plane fits data in three dimensions. His follow-up videos on $R^2$ vs. adjusted $R^2$, the F-test, and multicollinearity form a complete visual complement to this chapter. Previously recommended for Chapters 13, 20, and 22.

Statquest: "Ridge, Lasso, and Elastic Net Regression" https://www.youtube.com/watch?v=Q81RR3yKn30

For students interested in how machine learning handles multicollinearity and model selection. Ridge regression (L2 regularization) addresses multicollinearity by penalizing large coefficients. Lasso regression (L1 regularization) can automatically set unimportant coefficients to exactly zero, performing variable selection. These are direct extensions of the multiple regression framework.

Seeing Theory: Regression Analysis https://seeing-theory.brown.edu/regression-analysis/index.html

The interactive regression visualization from Brown University, recommended in Chapter 22, also demonstrates multiple regression concepts. You can add predictors and watch $R^2$ change, seeing firsthand that $R^2$ always increases — the motivation for adjusted $R^2$.

UCLA Statistical Consulting: "Regression Diagnostics" https://stats.oarc.ucla.edu/stata/webbooks/reg/chapter2/regressionwith-stataresiduals-diagnostics/

Although written for Stata, this comprehensive guide to residual diagnostics is software-agnostic in its explanation of what to look for and how to interpret diagnostic plots. The visual examples of good and bad residual patterns are among the best available online.

Towards Data Science: "Multicollinearity — How to Detect and Deal with It" https://towardsdatascience.com/multicollinearity-how-to-detect-and-deal-with-it

A practical guide to multicollinearity with Python code, VIF calculations, and strategies for addressing high multicollinearity. Includes concrete examples of how multicollinearity affects coefficient stability.

For the Ethically Engaged

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016). "Machine Bias." ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

The investigative report that ignited the national debate about algorithmic bias in criminal justice. The ProPublica team analyzed the COMPAS algorithm — the real-world version of the algorithm James studies in this chapter's Case Study 2. They found that the algorithm's false positive rate was nearly twice as high for Black defendants as for White defendants, even after controlling for criminal history. This article demonstrates exactly the kind of multiple regression analysis (with ethical implications) discussed in Section 23.13.

Dressel, J. and Farid, H. (2018). "The Accuracy, Fairness, and Limits of Predicting Recidivism." Science Advances, 4(1), eaao5580. A study showing that untrained humans performing a simple online survey predicted recidivism about as accurately as the COMPAS algorithm. This raises a fundamental question about James's analysis: if a simple model with a few variables predicts as well as the algorithm, what justifies the algorithm's complexity (and its racial disparities)?

Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153-163. A mathematical proof that it is impossible for a risk assessment algorithm to simultaneously satisfy three intuitive fairness criteria (calibration, predictive parity, and error rate balance) when the base rates of the outcome differ across groups. This impossibility result has profound implications for James's analysis — the "race effect" may partly reflect an unavoidable mathematical tradeoff rather than algorithmic bias.