Exercises: Multiple Regression
These exercises progress from conceptual understanding through interpretation of output, Python implementation, diagnostics, and applied analysis. Estimated completion time: 3.5 hours.
Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
A.1. In your own words, explain why a simple regression of $y$ on $x_1$ might give a different coefficient for $x_1$ than a multiple regression of $y$ on $x_1$ and $x_2$. Give a real-world example.
A.2. True or false (explain each):
(a) Adding a new predictor to a multiple regression model always increases $R^2$.
(b) Adding a new predictor to a multiple regression model always increases adjusted $R^2$.
(c) If the F-test is significant, every individual predictor in the model must be significant.
(d) A VIF of 8.5 means the predictor is useless and should be removed.
(e) For a categorical variable with 4 categories, you need 4 indicator (dummy) variables.
(f) The phrase "holding other variables constant" means we actually hold those variables fixed in an experiment.
A.3. A researcher fits two models to predict college GPA from high school data:
- Model 1: $\widehat{\text{GPA}} = 1.2 + 0.5 \times \text{SAT}$ (where SAT is in hundreds), $R^2 = 0.32$
- Model 2: $\widehat{\text{GPA}} = 0.8 + 0.3 \times \text{SAT} + 0.4 \times \text{HS\_GPA}$, $R^2 = 0.48$, Adj. $R^2 = 0.47$
(a) Why did the SAT coefficient drop from 0.5 to 0.3?
(b) What does this tell you about the relationship between SAT scores and high school GPA?
(c) Should the researcher prefer Model 1 or Model 2? Why?
A.4. Explain the difference between $R^2$ and adjusted $R^2$. Why does adjusted $R^2$ sometimes decrease when a new variable is added? Under what conditions would this happen?
A.5. What is Simpson's Paradox? How does multiple regression help address it?
A.6. A study finds that communities with more fire stations have more fires. A politician concludes that "fire stations cause fires." Identify the confounding variable and explain how multiple regression could clarify the relationship.
Part B: Interpreting Multiple Regression Output ⭐⭐
B.1. A researcher predicts apartment rent (in dollars per month) from three variables. Here is the output:
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------
Intercept 285.00 68.42 4.166 0.000 150.50 419.50
sqft 0.92 0.05 18.400 0.000 0.82 1.02
bedrooms 125.40 22.18 5.654 0.000 81.88 168.92
downtown 1.35 0.18 7.500 0.000 0.99 1.71
R² = 0.847 Adj. R² = 0.844 F = 284.5 p(F) < 0.001
(Note: downtown is the distance from downtown in miles.)
(a) Write the regression equation.
(b) Interpret the coefficient for sqft in context.
(c) Interpret the coefficient for bedrooms in context.
(d) A student says: "The coefficient for sqft (0.92) is much smaller than for bedrooms (125.40), so bedrooms are more important." What's wrong with this reasoning?
(e) What does the $R^2$ value tell you?
(f) The coefficient for downtown is positive. Does this make sense? What might be happening?
B.2. A health researcher predicts systolic blood pressure from:
coef std err t P>|t|
-------------------------------------------------------
Intercept 98.540 5.231 18.838 0.000
age 0.452 0.058 7.793 0.000
bmi 1.215 0.182 6.676 0.000
exercise_hours -2.831 0.924 -3.064 0.003
C(smoker)[T.Yes] 4.667 1.845 2.530 0.013
R² = 0.583 Adj. R² = 0.571 F = 48.72 p(F) < 0.001
(a) Interpret each coefficient in context using the "holding other variables constant" template.
(b) What is the reference category for the smoking variable?
(c) Predict the blood pressure for a 45-year-old non-smoker with BMI 28 who exercises 3 hours per week.
(d) Why is the coefficient for exercise_hours negative? Does this make practical sense?
(e) What percentage of the variability in blood pressure is not explained by this model?
B.3. Compare these two regression outputs for predicting house price:
| Simple Model | Multiple Model | |
|---|---|---|
| Intercept | 150,000 | 85,000 |
| Square footage coefficient | 185 | 120 |
| Number of bathrooms | — | 22,500 |
| School district rating | — | 8,500 |
| $R^2$ | 0.62 | 0.81 |
| Adjusted $R^2$ | 0.62 | 0.80 |
(a) Why did the square footage coefficient decrease from 185 to 120?
(b) Interpret the school district rating coefficient.
(c) A homeowner adds a bathroom (increasing from 2 to 3). Based on this model, by how much would you predict the home's value to increase, holding square footage and school rating constant?
Part C: Adjusted $R^2$ and Model Comparison ⭐⭐
C.1. A researcher fits four models predicting employee salary:
| Model | Predictors | $R^2$ | Adj. $R^2$ | F-statistic | p(F) |
|---|---|---|---|---|---|
| 1 | Years experience | 0.48 | 0.47 | 88.3 | < 0.001 |
| 2 | Years exp. + Education level | 0.61 | 0.60 | 74.2 | < 0.001 |
| 3 | Years exp. + Education + Department | 0.68 | 0.66 | 48.1 | < 0.001 |
| 4 | Years exp. + Education + Dept. + Shoe size | 0.681 | 0.655 | 36.2 | < 0.001 |
(a) Which model would you recommend? Why?
(b) What happened when shoe size was added in Model 4? What does the change in adjusted $R^2$ tell you?
(c) If Department has 5 categories, how many dummy variables were included in Model 3?
(d) Why does the F-statistic decrease from Model 1 to Model 4 even though $R^2$ increases?
C.2. A dataset has $n = 30$ observations. A researcher fits a model with $k = 8$ predictors and gets $R^2 = 0.75$.
(a) Calculate adjusted $R^2$ using the formula: $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$
(b) Is the gap between $R^2$ and adjusted $R^2$ concerning? Why?
(c) Does this model follow the "10-15 observations per predictor" guideline?
(d) What would you recommend the researcher do?
C.3. Two students each build a model to predict exam scores. Both have $n = 100$ observations.
- Student A uses 3 predictors and gets $R^2 = 0.55$, adjusted $R^2 = 0.54$.
- Student B uses 15 predictors and gets $R^2 = 0.62$, adjusted $R^2 = 0.56$.
Who has the better model? Justify your answer.
Part D: Multicollinearity ⭐⭐
D.1. A researcher studying home prices calculates VIF for each predictor:
| Variable | VIF |
|---|---|
| Square footage | 4.2 |
| Number of rooms | 12.8 |
| Number of bedrooms | 8.5 |
| Lot size (acres) | 2.1 |
| Year built | 1.8 |
(a) Which predictors have problematic multicollinearity?
(b) Why might "number of rooms" and "number of bedrooms" have high VIF?
(c) Suggest two strategies the researcher could use to address this problem.
D.2. Explain why including both temperature in Fahrenheit and temperature in Celsius as predictors would cause the model to fail. What type of multicollinearity is this?
D.3. A dataset has three predictors: $x_1$, $x_2$, and $x_3$. The correlation matrix is:
| $x_1$ | $x_2$ | $x_3$ | |
|---|---|---|---|
| $x_1$ | 1.00 | 0.92 | 0.15 |
| $x_2$ | 0.92 | 1.00 | 0.18 |
| $x_3$ | 0.15 | 0.18 | 1.00 |
(a) Which pair of predictors is likely to cause multicollinearity problems?
(b) If both $x_1$ and $x_2$ have large individual correlations with $y$, but in the multiple regression only one is significant, why might this happen?
(c) If you had to choose between $x_1$ and $x_2$, what criteria would you use?
Part E: Dummy Variables ⭐⭐
E.1. A model predicts starting salary (in thousands) using years of education, years of experience, and gender:
coef std err t P>|t|
------------------------------------------------------
Intercept 18.250 2.840 6.426 0.000
education 2.870 0.310 9.258 0.000
experience 1.540 0.185 8.324 0.000
C(gender)[T.Male] 3.180 0.920 3.457 0.001
(a) What is the reference category for gender?
(b) Interpret the gender coefficient.
(c) Predict the starting salary for a female with 16 years of education and 5 years of experience.
(d) Predict the starting salary for a male with 16 years of education and 5 years of experience.
(e) Does this model prove that gender causes salary differences? Why or why not?
E.2. A regression includes region as a predictor with four categories: North, South, East, West.
(a) How many dummy variables are needed?
(b) If "South" is the reference category, write out the dummy variable definitions.
(c) The coefficient for C(region)[T.West] is $-5,200$ with $p = 0.024$. Interpret this in context.
E.3. A researcher accidentally includes all 4 dummy variables (for a variable with 4 categories) plus an intercept in the model. What will happen? Why?
Part F: F-Test and Individual t-Tests ⭐⭐
F.1. A model predicts test scores using study hours, sleep hours, and number of absences. The output shows:
- F-statistic = 42.8, p(F) < 0.001
- Individual p-values: study_hours = 0.001, sleep_hours = 0.312, absences = 0.008
(a) What does the significant F-test tell you?
(b) Based on the individual p-values, which predictors are significant at $\alpha = 0.05$?
(c) Should sleep_hours be removed from the model? What additional considerations might influence this decision?
F.2. A model has F-statistic = 1.23, p(F) = 0.304, but one individual predictor has p = 0.041.
(a) Is this situation possible? Why or why not?
(b) What would you conclude about this model?
(c) How might this happen?
F.3. Write the null and alternative hypotheses for:
(a) The overall F-test in a model with 4 predictors.
(b) The individual t-test for the second predictor ($x_2$) in that model.
(c) Explain the logical difference between these two tests.
Part G: Residual Diagnostics ⭐⭐
G.1. For each residual plot description below, identify the violated condition and suggest a remedy:
(a) The residuals vs. predicted values plot shows a clear U-shaped curve.
(b) The residuals vs. predicted values plot shows a "fan" shape — residuals spread out as predicted values increase.
(c) The QQ-plot of residuals shows heavy tails (points curve away from the line at both ends).
(d) The residuals show a clear increasing trend over time (for data collected sequentially).
G.2. A residual plot looks like random scatter around zero with no discernible pattern. What does this tell you about:
(a) The linearity assumption?
(b) The equal variance (homoscedasticity) assumption?
(c) The overall appropriateness of the model?
G.3. A student fits a multiple regression model, checks the residual plot, and sees a clear curved pattern. They respond by adding more predictors. Why is this the wrong approach? What should they do instead?
Part H: Interaction Terms ⭐⭐⭐
H.1. A model predicts job satisfaction (on a 1-100 scale) from salary (in thousands) and remote_work (0 = office, 1 = remote):
$$\widehat{\text{satisfaction}} = 30 + 0.5 \times \text{salary} + 8 \times \text{remote} + 0.3 \times (\text{salary} \times \text{remote})$$
(a) What is the predicted satisfaction for an office worker earning $60K?
(b) What is the predicted satisfaction for a remote worker earning $60K?
(c) What is the effect of a $10K salary increase for office workers?
(d) What is the effect of a $10K salary increase for remote workers?
(e) Explain in plain language what the interaction term tells us.
H.2. A researcher tests the interaction between exercise frequency and age on cholesterol level. The interaction term has $p = 0.003$.
(a) What does this significant interaction mean?
(b) Can the researcher still interpret the main effects (exercise and age) individually? Why or why not?
(c) How would you recommend the researcher present these results?
Part I: Python Implementation ⭐⭐⭐
I.1. Using the following dataset, complete a full multiple regression analysis in Python:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 80
data = pd.DataFrame({
'study_hours': np.random.uniform(1, 15, n),
'sleep_hours': np.random.normal(7, 1.5, n).clip(3, 12),
'absences': np.random.poisson(3, n),
'tutoring': np.random.choice(['Yes', 'No'], n, p=[0.3, 0.7])
})
data['exam_score'] = (
40 +
3.5 * data['study_hours'] +
2.0 * data['sleep_hours'] +
-4.0 * data['absences'] +
5.0 * (data['tutoring'] == 'Yes').astype(int) +
np.random.normal(0, 6, n)
).clip(0, 100)
(a) Fit a multiple regression model predicting exam_score from all four predictors.
(b) Report and interpret each coefficient using the "holding other variables constant" template.
(c) Report $R^2$, adjusted $R^2$, and the F-test result.
(d) Calculate VIF for the numerical predictors. Are there multicollinearity concerns?
(e) Create a residual plot and QQ-plot. Do the residuals satisfy the LINE conditions?
(f) Compare the model with and without tutoring. Does including it improve the model?
I.2. Using Maya's community health data (provided in Section 23.3 of the chapter):
(a) Fit the simple regression of ER rate on poverty rate. Record the slope and $R^2$.
(b) Fit the multiple regression of ER rate on poverty rate, uninsured percentage, and AQI.
(c) By how much did the poverty rate coefficient change? Calculate the percentage change.
(d) Test for an interaction between poverty rate and AQI. Is it significant?
(e) Create a complete set of residual diagnostics for the multiple regression model.
Part J: Applied Analysis ⭐⭐⭐
J.1. Maya's Policy Brief. Maya needs to advise the county health board on where to invest to reduce ER overcrowding. Based on her multiple regression results:
(a) Which factor has the largest practical impact on ER visit rates?
(b) A board member says: "Let's focus all our funding on reducing poverty." Based on the regression results, what would you advise?
(c) The model shows poverty rate, uninsured percentage, and AQI all contribute. What does this suggest about effective policy interventions?
(d) What are the limitations of using this regression model for policy decisions? (Consider: causation, unmeasured variables, ecological fallacy.)
J.2. Alex's Business Case. Alex presents StreamVibe leadership with the finding that Premium subscribers watch 7.85 hours more per week than Free subscribers, controlling for content diversity and recommendation accuracy.
(a) A VP says: "So if we convert Free users to Premium, each one will watch 8 more hours per week." What's wrong with this reasoning?
(b) What is the difference between the descriptive interpretation of the coefficient and a causal interpretation?
(c) How might Alex test whether upgrading users actually causes more engagement?
J.3. James's Ethical Analysis. James finds that the race effect on risk scores shrinks from 1.35 (simple regression) to 0.78 (multiple regression) when controlling for criminal history.
(a) A colleague argues: "The race effect dropped by 42%, so most of the disparity is explained by criminal history. The algorithm is mostly fair." Evaluate this argument.
(b) Another colleague argues: "But criminal history itself reflects systemic racism — more policing in Black neighborhoods, harsher sentencing. Controlling for it is controlling for the very bias we're trying to detect." Evaluate this argument.
(c) What additional variables might James want to include in his model?
(d) Can multiple regression alone determine whether the algorithm is "fair"? What else is needed?
Part K: Challenge Problems ⭐⭐⭐⭐
K.1. Building a Model from Scratch. Download a public dataset (suggestions: World Happiness Report, U.S. College Scorecard, or Gapminder) and build a multiple regression model from scratch.
Your analysis should include:
- A clear research question
- At least 3 predictors (including at least one categorical variable)
- Exploratory scatterplots and correlation matrix
- Simple regression for comparison
- Full multiple regression with statsmodels
- VIF analysis
- Residual diagnostics (all four LINE conditions)
- A written interpretation of results (1-2 paragraphs)
- A discussion of limitations (confounders, causation, generalizability)
K.2. The Overfitting Experiment. Use the following code to explore overfitting:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
np.random.seed(42)
n = 30 # Small sample
# True model: y depends on x1 and x2 only
df = pd.DataFrame({
'x1': np.random.normal(0, 1, n),
'x2': np.random.normal(0, 1, n)
})
df['y'] = 3 + 2 * df['x1'] - 1.5 * df['x2'] + np.random.normal(0, 2, n)
# Add 10 random noise variables (no real relationship to y)
for i in range(3, 13):
df[f'x{i}'] = np.random.normal(0, 1, n)
# Fit models with increasing numbers of predictors
for k in [2, 4, 6, 8, 10, 12]:
predictors = ' + '.join([f'x{i}' for i in range(1, k+1)])
model = smf.ols(f'y ~ {predictors}', data=df).fit()
print(f"k={k:2d} predictors: R²={model.rsquared:.4f}, "
f"Adj R²={model.rsquared_adj:.4f}, "
f"F p-value={model.f_pvalue:.4f}")
(a) Run the code. What happens to $R^2$ as you add more (noise) predictors?
(b) What happens to adjusted $R^2$?
(c) At what point does adjusted $R^2$ start decreasing?
(d) What happens to the F-test p-value?
(e) Write a paragraph explaining what this experiment teaches about model building.
K.3. Standardized Coefficients. The partial regression coefficients from a regression model depend on the units of measurement. To compare the relative importance of predictors, researchers sometimes use standardized coefficients (beta weights), computed by standardizing all variables to have mean 0 and standard deviation 1 before fitting the model.
(a) Using any dataset with at least 3 numerical predictors, fit a multiple regression model.
(b) Standardize all variables using $z_i = (x_i - \bar{x}_i) / s_i$.
(c) Refit the model using the standardized variables. The coefficients are now the standardized regression coefficients.
(d) Which predictor has the largest standardized coefficient? What does this tell you about relative importance?
(e) Why are standardized coefficients useful for comparing predictors but unstandardized coefficients more useful for making predictions?
Part L: Connections and Reflection ⭐
L.1. In Chapter 22, you learned about the decomposition $SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$ and $R^2 = SS_{\text{Reg}} / SS_{\text{Total}}$. In Chapter 20 (ANOVA), you learned about $SS_{\text{Total}} = SS_{\text{Between}} + SS_{\text{Within}}$ and $\eta^2 = SS_B / SS_T$. Now in multiple regression, these ideas are extended to multiple predictors.
(a) How is multiple regression related to ANOVA conceptually?
(b) In what sense is ANOVA a special case of multiple regression?
(c) If you ran an ANOVA with one grouping variable (3 groups) and then ran a regression with 2 dummy variables for the same grouping variable, would you get the same F-statistic? Why?
L.2. Reflect on the threshold concept "holding other variables constant."
(a) In your own words, explain what this phrase means.
(b) Give an example from your own life or interests where you would want to "hold other variables constant" to understand a relationship.
(c) Why is this concept harder than it first appears?
L.3. How does multiple regression connect to the larger theme that "AI models are regression on steroids" (Theme 3)? What specific concepts from this chapter would you see again in a machine learning course?