Exercises: Correlation and Simple Linear Regression
These exercises progress from conceptual understanding through hand calculations, Python implementation, residual diagnostics, interpretation, and applied analysis. Estimated completion time: 3.5 hours.
Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
A.1. In your own words, explain why you should always create a scatterplot before calculating a correlation coefficient. Use Anscombe's Quartet to support your answer.
A.2. True or false (explain each):
(a) A correlation of $r = -0.85$ indicates a weaker relationship than $r = 0.60$.
(b) If $r = 0$, there is no relationship between $x$ and $y$.
(c) The correlation between height in inches and weight in pounds is different from the correlation between height in centimeters and weight in kilograms.
(d) If $x$ causes $y$, then $r$ must be large.
(e) The regression line always passes through the point $(\bar{x}, \bar{y})$.
(f) A residual of $-5$ means the model over-predicted by 5 units.
A.3. A journalist writes: "A new study found a correlation of $r = 0.72$ between hours spent on social media and symptoms of depression among teenagers. This proves that social media causes depression." Identify all the errors in this statement.
A.4. For each of the following, identify at least one lurking variable that could explain the correlation:
(a) Cities with more hospitals have higher death rates.
(b) Countries that consume more chocolate per capita win more Nobel Prizes.
(c) Students who eat breakfast more frequently have higher GPAs.
(d) States with more gun shops have lower crime rates.
A.5. Explain the difference between interpolation and extrapolation. Which is safer for making predictions? Why?
A.6. A basketball coach says: "Our star player scored 42 points last game — her career high! I expect her to keep scoring around 40 points." Using the concept of regression to the mean, explain why the coach's expectation is likely too optimistic.
Part B: Interpreting Scatterplots and Correlation ⭐
B.1. For each description below, estimate $r$ (choose from: $-0.95$, $-0.50$, $0$, $+0.50$, $+0.95$):
(a) A scatterplot of temperature vs. ice cream sales shows a clear upward trend with data points tightly clustered around a line.
(b) A scatterplot of shoe size vs. IQ shows no discernible pattern — just a cloud of points.
(c) A scatterplot of car age (years) vs. resale value shows a clear downward trend with moderate scatter.
(d) A scatterplot of height and weight for adults shows a general upward trend but with considerable scatter.
(e) A scatterplot of study hours vs. hours of sleep shows a tight downward trend.
B.2. Match each $r$ value to the most likely pair of variables:
| $r$ | Variables |
|---|---|
| $+0.95$ | (i) Height and shoe size for adult women |
| $+0.72$ | (ii) Temperature and ice cream sales |
| $-0.85$ | (iii) Number of absences and final exam score |
| $+0.03$ | (iv) Birth month and salary |
| $-0.40$ | (v) Age and flexibility |
B.3. A scatterplot shows a strong U-shaped relationship between age and happiness (happiness is high for young people, dips in middle age, then rises again). The correlation coefficient is $r = 0.05$. Explain why $r$ fails to capture the relationship.
B.4. Explain why the Pearson correlation coefficient is inappropriate for ordinal data. What alternative measure might you use? (Hint: think about what was discussed regarding ordinal variables in Chapter 2.)
Part C: Hand Calculations ⭐⭐
C.1. Calculate $r$ by hand for the following data:
| $x$ | 1 | 3 | 5 | 7 | 9 |
|---|---|---|---|---|---|
| $y$ | 10 | 18 | 24 | 30 | 38 |
Show all work using the formula $r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$.
C.2. Using the same data from C.1:
(a) Calculate the slope $b_1$ and intercept $b_0$ of the least squares regression line.
(b) Write the regression equation.
(c) Calculate $R^2$ and interpret it.
(d) Predict $y$ when $x = 6$.
(e) Calculate the residual for the observation $(5, 24)$.
C.3. A researcher collects the following data on advertising spending (in thousands of dollars) and monthly revenue (in thousands of dollars):
| Ad Spending ($x$) | 2 | 4 | 6 | 8 | 10 |
|---|---|---|---|---|---|
| Revenue ($y$) | 50 | 55 | 65 | 70 | 80 |
(a) Create a scatterplot (sketch by hand or use Python).
(b) Calculate $r$.
(c) Find the regression equation $\hat{y} = b_0 + b_1 x$.
(d) Interpret the slope in context: "For each additional _ , the predicted _ increases by _ ."
(e) Interpret the intercept. Does it make sense in this context?
(f) Predict revenue for an ad spending of \$7,000. Is this interpolation or extrapolation?
(g) Predict revenue for an ad spending of \$25,000. Is this interpolation or extrapolation? How confident are you in this prediction?
C.4. Given the following summary statistics for two variables:
$n = 20$, $\bar{x} = 50$, $\bar{y} = 100$, $s_x = 10$, $s_y = 25$, $r = 0.80$
(a) Calculate the slope $b_1$ using $b_1 = r \cdot s_y / s_x$.
(b) Calculate the intercept $b_0$.
(c) Write the regression equation.
(d) How would the slope change if the correlation were $r = -0.80$ instead? What would that mean?
Part D: Python Implementation ⭐⭐
D.1. The following data shows the average daily temperature (in degrees Fahrenheit) and the number of ice cream cones sold at a beach stand:
temp = [65, 70, 72, 75, 78, 80, 82, 85, 88, 90, 92, 95]
cones = [105, 130, 140, 155, 170, 185, 195, 215, 235, 245, 260, 280]
(a) Create a scatterplot with labeled axes and title.
(b) Calculate $r$ using scipy.stats.pearsonr() and numpy.corrcoef(). Verify they give the same answer.
(c) Fit a regression line using scipy.stats.linregress() and add it to your scatterplot.
(d) Interpret the slope and $R^2$ in context.
(e) Create a residual plot. Does the linear model seem appropriate?
(f) Predict ice cream sales when the temperature is 83°F. How about 120°F? Which prediction do you trust more?
D.2. Use seaborn.regplot() to create a scatterplot with regression line and 95% confidence band for the temperature/ice cream data from D.1. Describe what the confidence band tells you.
D.3. Use statsmodels OLS to fit the regression from D.1. From the summary output, identify:
(a) The slope and its p-value
(b) The 95% confidence interval for the slope
(c) $R^2$ and adjusted $R^2$
(d) The F-statistic and its p-value
D.4. Write Python code to:
(a) Generate Anscombe's Quartet (all four datasets)
(b) Verify that all four have approximately the same $r$, same regression line, and same $R^2$
(c) Create a 2x2 grid of scatterplots with regression lines
(d) Explain in a comment what this demonstrates
Part E: Anchor Example Applications ⭐⭐
E.1. (Maya — Public Health) Maya has data on 30 counties showing the percentage of residents without health insurance ($x$) and the infant mortality rate per 1,000 live births ($y$). She finds $r = 0.68$, $b_1 = 0.15$, $b_0 = 3.2$.
(a) Interpret the slope in context.
(b) Interpret $R^2$ in context.
(c) A county with 15% uninsured residents has a predicted infant mortality rate of ______ .
(d) Maya's colleague says "This proves that expanding insurance coverage will reduce infant mortality." Is this statement justified? Identify at least two lurking variables.
(e) What kind of study design would be needed to establish that insurance coverage causes lower infant mortality?
E.2. (Alex — StreamVibe) Alex runs a regression of average session length (minutes, $y$) on the number of recommendations shown per page ($x$, ranging from 5 to 25). The regression equation is $\hat{y} = 10.5 + 1.8x$, with $R^2 = 0.45$.
(a) Interpret the slope.
(b) How many minutes of session length does the model predict for a page showing 15 recommendations?
(c) How many minutes for a page showing 50 recommendations? What's wrong with this prediction?
(d) $R^2 = 0.45$. What percentage of the variability in session length is not explained by recommendations shown?
(e) Name two factors that might explain the remaining 55% of variability.
E.3. (James — Criminal Justice) James examines the relationship between an algorithm's predicted risk score ($x$, 1-10) and the defendant's actual number of prior arrests ($y$).
(a) If $r = 0.82$, interpret this correlation in context.
(b) If the regression equation is $\hat{y} = -0.5 + 0.8x$, what does the slope mean?
(c) At risk score = 1, the predicted number of prior arrests is $\hat{y} = -0.5 + 0.8(1) = 0.3$. At risk score = 0, $\hat{y} = -0.5$. Is the intercept meaningful here?
(d) James finds that $r = 0.82$ overall, but when he looks at defendants under age 25, $r = 0.55$, and for defendants over 45, $r = 0.90$. What might explain this difference?
Part F: Residual Analysis and Model Diagnostics ⭐⭐⭐
F.1. For each residual plot pattern described below, identify the problem and suggest a solution:
(a) Residuals show a clear U-shaped pattern when plotted against predicted values.
(b) Residuals fan out (become more spread) as predicted values increase.
(c) Residuals show no pattern, but one point has a residual of $+25$ while all others are between $-5$ and $+5$.
(d) Residuals form two distinct horizontal clusters, one at $+10$ and one at $-10$.
F.2. A regression model predicting house prices from square footage gives the following residual statistics: mean = 0, standard deviation = \$15,000. A house with 2,000 square feet has a predicted price of \$350,000 and an actual price of \$390,000.
(a) What is the residual for this house?
(b) How many standard deviations of residuals is this from zero?
(c) Would you call this an outlier? Why or why not?
F.3. Write Python code to create a comprehensive residual diagnostic for any regression model. Your function should produce four plots: (1) residuals vs. predicted values, (2) histogram of residuals, (3) QQ-plot of residuals, (4) residuals vs. order of observation (to check for time-related patterns).
Part G: Regression to the Mean ⭐⭐⭐
G.1. A school identifies the 20 lowest-performing students on a standardized test and enrolls them in a tutoring program. On the retest, the average score improves by 8 points. The school celebrates the program's success.
(a) Explain how regression to the mean might account for some (or all) of this improvement.
(b) What research design would allow the school to determine whether the tutoring program genuinely helped?
(c) If the correlation between first and second test scores is $r = 0.70$, and the original mean was 500 with $s = 100$, what score would regression to the mean predict for a student who scored 400 on the first test?
G.2. A pharmaceutical company tests a new blood pressure medication. They recruit patients whose blood pressure measurements exceeded 150 mmHg on a screening visit. After 4 weeks of medication, the average blood pressure drops to 142 mmHg.
(a) Explain why regression to the mean makes this result less impressive than it appears.
(b) If a placebo group's average dropped to 147 mmHg over the same period, what is the estimated treatment effect after accounting for regression to the mean?
G.3. Explain the "Sports Illustrated Jinx" using the concept of regression to the mean. Why is it that athletes who appear on the cover often have worse performance afterward, even though the cover itself has no effect?
Part H: Correlation vs. Causation ⭐⭐⭐
H.1. For each of the following correlations, identify the most likely explanation (direct causation, reverse causation, common cause/confounding, or coincidence) and explain your reasoning:
(a) Higher education level is correlated with higher income.
(b) Cities with more fire trucks have more fires.
(c) Countries with higher per capita wine consumption have lower rates of heart disease.
(d) The number of films Nicolas Cage appeared in per year is correlated with the number of drownings in swimming pools.
(e) Hospitals that perform more surgeries have higher mortality rates.
H.2. A health website claims: "People who eat organic food are healthier, proving that organic food prevents disease." A study found $r = 0.45$ between organic food consumption and a health index score.
(a) List three confounding variables that could explain this correlation.
(b) Design an experiment that could test whether organic food causes better health outcomes.
(c) Why might such an experiment be impractical or unethical?
H.3. Consider the well-established correlation between smoking and lung cancer ($r$ is strongly positive).
(a) In the 1950s, the tobacco industry argued that the correlation was due to a confounding variable — perhaps a gene that predisposes people to both smoking and cancer. Why wasn't this argument sufficient to dismiss the causal claim?
(b) List at least three lines of evidence (beyond correlation) that epidemiologists used to establish causation. (Hint: think about Bradford Hill's criteria — temporal precedence, dose-response, consistency, biological plausibility, cessation effects.)
Part I: Comprehensive Analysis ⭐⭐⭐⭐
I.1. The following data shows the percentage of residents living below the poverty line and the crime rate (per 1,000 residents) for 20 neighborhoods:
poverty_pct = [5.2, 8.1, 12.4, 3.7, 18.6, 15.3, 7.5, 22.1, 10.8,
19.4, 6.3, 14.1, 20.7, 4.8, 16.5, 25.3, 9.2, 21.8,
13.3, 17.9]
crime_rate = [12, 18, 28, 8, 42, 35, 15, 52, 24,
45, 14, 30, 48, 10, 38, 58, 20, 50,
26, 40]
Conduct a complete regression analysis:
(a) Create a scatterplot.
(b) Calculate and interpret $r$.
(c) Fit the regression line and interpret slope and intercept.
(d) Report $R^2$ and interpret it.
(e) Create a residual plot and check the LINE conditions.
(f) Predict the crime rate for a neighborhood with 11% poverty. Is this interpolation or extrapolation?
(g) Predict the crime rate for a neighborhood with 40% poverty. Is this interpolation or extrapolation? How confident are you?
(h) Identify at least two lurking variables that could explain the correlation between poverty and crime.
(i) Write a one-paragraph summary of your findings, being careful to distinguish association from causation.
I.2. Download a real dataset of your choice (from Gapminder, the World Bank, or a similar source) that includes at least two numerical variables. Conduct a complete correlation and regression analysis including:
- Scatterplot with regression line
- Correlation coefficient and interpretation
- Regression equation with slope and intercept interpretations
- $R^2$ interpretation
- Residual diagnostics (all four plots from F.3)
- Discussion of correlation vs. causation
- Identification of lurking variables
- Assessment of extrapolation risk
Present your analysis as a mini-report with sections: Introduction, Methods, Results, Discussion.
Part J: Connecting the Dots ⭐⭐⭐
J.1. Explain the connection between $R^2$ in regression and $\eta^2$ in ANOVA (Chapter 20). How are they conceptually similar? How does the decomposition $SS_T = SS_{\text{Reg}} + SS_{\text{Res}}$ relate to $SS_T = SS_B + SS_W$?
J.2. In Chapter 13, you learned that a hypothesis test asks: "Is this result unlikely to have occurred by chance alone?" How does this apply to testing whether the slope $b_1$ is significantly different from zero? What are $H_0$ and $H_a$ for this test?
J.3. In Chapter 4, you learned that randomized experiments establish causation while observational studies can only establish association. How does this principle apply to interpreting regression results? Give an example where a regression slope has a causal interpretation and one where it does not.
J.4. A classmate says: "I ran a regression and got $R^2 = 0.95$, so my model must be correct." Using what you've learned about residual plots and Anscombe's Quartet, explain why this claim could be wrong.
Part K: Excel Practice ⭐⭐
K.1. Using the data from exercise C.3 in Excel:
(a) Create a scatterplot and add a linear trendline. Display the equation and $R^2$ on the chart.
(b) Use the CORREL, SLOPE, INTERCEPT, and RSQ functions to calculate $r$, $b_1$, $b_0$, and $R^2$. Verify they match the trendline values.
(c) Use the FORECAST.LINEAR function (or FORECAST) to predict revenue for an ad spending of \$7,000.
K.2. In the same Excel workbook, create a residual column by computing $y - \hat{y}$ for each observation. Create a scatterplot of residuals vs. predicted values. Does the residual plot suggest any problems with the linear model?