Further Reading: Correlation and Simple Linear Regression

Books

For Deeper Understanding

Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li, Applied Linear Statistical Models, 5th edition (2005) The definitive reference for regression at the intermediate level. Part I covers simple linear regression in exhaustive detail — estimation, inference, diagnostics, remedial measures, and matrix formulation. Part II extends to multiple regression. This textbook is widely used in second-semester statistics and econometrics courses. Chapters 1-4 provide the most thorough treatment of simple linear regression you'll find in any single source.

Samprit Chatterjee and Ali S. Hadi, Regression Analysis by Example, 5th edition (2012) A worked-example-driven approach to regression that's more accessible than Kutner et al. Each chapter revolves around a real dataset, building the theory through application. The chapter on residual analysis is particularly strong — it shows residual plot patterns for every kind of model violation with real data, not just idealized diagrams.

Francis J. Anscombe, "Graphs in Statistical Analysis," The American Statistician, 27(1), 17-21 (1973) The original paper presenting Anscombe's Quartet. At just five pages, it's one of the most influential short papers in statistics. Anscombe's argument — that numerical summaries alone can never substitute for graphical inspection of data — was revolutionary at the time and remains relevant today, especially as automated machine learning tools generate models without human visual inspection.

Jacob Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd edition (1988) Chapter 3 covers the correlation coefficient in depth, including power analysis for testing $H_0: \rho = 0$. Cohen's effect size benchmarks for $r$ (small = 0.10, medium = 0.30, large = 0.50) are used across the social sciences. Previously recommended in Chapters 17 and 20.

For the Conceptually Curious

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's treatment of regression (Chapters 11-12) is brilliant for building intuition. His examples — predicting the height of Shaquille O'Neal's kids, understanding the "hot hand" in basketball, and explaining why the Sports Illustrated Jinx isn't real — make regression to the mean genuinely entertaining. This is the book to read if the formulas feel overwhelming and you need the concepts explained in plain English. Previously recommended for Chapters 12, 13, 18, and 20.

Stephen M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (1986) Stigler's masterful history devotes several chapters to the development of regression and correlation. He traces the intellectual lineage from Legendre's method of least squares (1805) through Gauss's formalization (1809) to Galton's regression to the mean (1886) and Pearson's correlation coefficient (1896). Understanding that these tools were invented to solve specific practical problems — predicting astronomical positions, studying heredity — gives them a human dimension that formulas alone lack.

David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th edition (2007) Freedman et al.'s treatment of regression (Chapters 10-12) is renowned for its conceptual clarity. Their emphasis on the distinction between regression for description vs. regression for causation is particularly strong. The "regression fallacy" chapter is the best introductory treatment of regression to the mean in any textbook.

Articles and Papers

Galton, F. (1886). "Regression Towards Mediocrity in Hereditary Stature." Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246-263. The paper that gave regression its name. Galton's observation that tall parents tend to have children closer to the average height — that heights "regress toward mediocrity" — led to the entire field of regression analysis. The paper is readable, surprisingly modern in its reasoning, and historically important. Available free online through JSTOR.

Pearson, K. (1896). "Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia." Philosophical Transactions of the Royal Society of London, Series A, 187, 253-318. Karl Pearson's formalization of the correlation coefficient that bears his name. While mathematically demanding, the first several pages lay out the conceptual framework clearly. Pearson extended Galton's bivariate work to the general case and established the notation ($r$) still used today.

Vigen, T. (2015). Spurious Correlations. New York: Hachette. Tyler Vigen's collection of hilarious and alarming spurious correlations — U.S. spending on science vs. suicides by hanging, margarine consumption vs. divorce rate in Maine, Nicolas Cage films vs. pool drownings. The book (and the website tylervigen.com) is the most entertaining and memorable demonstration of why correlation does not imply causation. Use it to impress friends at dinner parties and alarm them simultaneously.

Matejka, J., and Fitzmaurice, G. (2017). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics Through Simulated Annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 1290-1294. A modern extension of Anscombe's Quartet. The authors created "The Datasaurus Dozen" — 13 datasets (including one shaped like a dinosaur) that all have the same mean, standard deviation, and correlation as Anscombe's data. The paper demonstrates even more dramatically that summary statistics can be completely uninformative about the shape of the data.

Kahneman, D. and Tversky, A. (1973). "On the Psychology of Prediction." Psychological Review, 80(4), 237-251. Kahneman and Tversky's seminal paper on regression to the mean as a cognitive bias. They show that people consistently fail to account for regression to the mean, leading to systematic errors in prediction. Their example of Israeli flight instructors (who punished poor performance and rewarded good performance, then attributed subsequent changes to their feedback rather than to regression) is a classic.

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton: Princeton University Press. For students interested in using regression for causal inference in the social sciences. Angrist and Pischke explain instrumental variables, regression discontinuity, and difference-in-differences — methods that go beyond simple correlation to approach causal claims with observational data. Advanced, but the first three chapters are accessible and provide an excellent framework for thinking about what regression can and cannot tell us about causation.

Online Resources

Penn State STAT 501: "Regression Methods" https://online.stat.psu.edu/stat501/

A free, comprehensive online course covering simple and multiple regression. Lessons 1-5 parallel this chapter's content with clear explanations, worked examples, and interactive components. The residual diagnostics lessons are particularly well done, with side-by-side comparisons of good and bad residual plots.

Statquest: "Linear Regression, Clearly Explained" https://www.youtube.com/watch?v=nk2CQITm_eo

Josh Starmer's YouTube explanation of least squares regression uses step-by-step animations to show how the line minimizes residuals. His follow-up videos on $R^2$, p-values in regression, and residual analysis form a complete visual complement to this chapter. Previously recommended for Chapters 13 and 20.

Seeing Theory: Regression https://seeing-theory.brown.edu/regression-analysis/index.html

An interactive visualization from Brown University that lets you add data points to a scatterplot and watch the regression line update in real time. You can drag points, add outliers, and see how they affect the slope, intercept, and $R^2$. Excellent for building intuition about how least squares works and why outliers matter.

Gapminder (gapminder.org) https://www.gapminder.org/tools/

Hans Rosling's interactive data visualization tool lets you explore correlations between hundreds of global variables — GDP and life expectancy, education and fertility, income and child mortality. The animated time-series scatterplots are the most compelling demonstration of statistical relationships in any online tool. Use it for your progressive project.

Tyler Vigen's Spurious Correlations https://tylervigen.com/spurious-correlations

A continuously updated collection of spurious correlations found by mining large databases. Each example comes with a scatterplot and a correlation coefficient. It's simultaneously funny and educational — the perfect reminder that even $r > 0.95$ doesn't mean anything without a plausible causal mechanism.

Connections to Future Chapters

Chapter 23 (Multiple Regression): Simple regression uses one predictor. Multiple regression uses many, allowing you to control for confounding variables (at least partially). The key new concept is interpreting coefficients "holding other variables constant" — which addresses many of the lurking variable concerns from this chapter. The $R^2$ increases as you add useful predictors, and the variability decomposition ($SS_T = SS_{\text{Reg}} + SS_{\text{Res}}$) extends seamlessly.

Chapter 24 (Logistic Regression): When the response variable is binary (yes/no, 0/1), linear regression produces nonsensical predictions (probabilities below 0 or above 1). Logistic regression replaces the straight line with a sigmoid curve and predicts log-odds rather than raw values. The interpretation changes — from "for each unit increase in $x$, $y$ changes by $b_1$" to "for each unit increase in $x$, the odds of success multiply by $e^{b_1}$" — but the core idea of modeling a relationship between variables remains the same.

Chapter 25 (Communicating with Data): Regression results need to be communicated clearly to non-technical audiences. Chapter 25 covers how to present scatterplots, regression lines, and $R^2$ in reports and presentations. The slope interpretation template from this chapter ("for each one-unit increase in...") becomes the standard for writing results sections.

Chapter 26 (Statistics and AI): Machine learning prediction models are, at their core, extensions of regression. Linear regression minimizes squared error; neural networks minimize a loss function (often squared error or cross-entropy). The weights in a neural network are analogous to regression coefficients. Understanding regression — its assumptions, its limitations, its diagnostic tools — gives you the conceptual foundation for understanding any prediction model.

Historical Note: From Legendre to Galton — The Birth of Regression

The method of least squares was invented twice. The French mathematician Adrien-Marie Legendre published it first, in 1805, as a method for fitting lines to astronomical observations. The German mathematician Carl Friedrich Gauss claimed he had been using the method since 1795 but hadn't published it; he provided a formal justification in 1809 using probability theory. The priority dispute was bitter, but both contributed essential ideas: Legendre's was the practical algorithm, Gauss's was the theoretical foundation.

Neither Legendre nor Gauss used the word "regression." That came from Sir Francis Galton in 1886. Galton was studying heredity — specifically, the heights of parents and their children. He noticed that children of very tall parents tended to be tall but not as tall as their parents, and children of very short parents tended to be short but not as short. Heights "regressed toward mediocrity" (the population mean) across generations.

Galton's student Karl Pearson formalized the correlation coefficient in 1896, establishing the notation $r$ (for "regression") that we still use. The full framework — correlation, regression, least squares, residuals — was essentially complete by 1900.

The irony is that "regression" — the word — describes a specific phenomenon (regression to the mean), but "regression" — the statistical method — is used for any linear prediction problem, most of which have nothing to do with regression to the mean. The name stuck, though, and 140 years later, we still call it regression even when we're predicting house prices from square footage or exam scores from study hours.

Understanding this history enriches your appreciation of regression as more than a calculation. It's a way of thinking about relationships, prediction, and the fundamental challenge of separating signal from noise in data.