Quiz: Correlation and Simple Linear Regression
Test your understanding of scatterplots, correlation, regression, residuals, $R^2$, extrapolation, regression to the mean, and the correlation-causation distinction. Try to answer each question before revealing the answer.
1. Before computing a correlation coefficient, you should always:
(a) Check whether the data is normally distributed (b) Create a scatterplot (c) Run a hypothesis test (d) Remove outliers
Answer
**(b) Create a scatterplot.** Anscombe's Quartet demonstrates that four datasets with identical correlation coefficients ($r \approx 0.816$), identical regression lines, and identical $R^2$ values can have completely different patterns. Only a scatterplot reveals the true nature of the relationship — whether it's linear, curved, driven by outliers, or nonexistent. Always plot first.2. A Pearson correlation coefficient of $r = -0.92$ indicates:
(a) A weak negative relationship (b) A strong positive relationship (c) A strong negative relationship (d) Almost no relationship
Answer
**(c) A strong negative relationship.** The sign of $r$ indicates the direction (negative = as $x$ increases, $y$ tends to decrease), and the magnitude $|r| = 0.92$ indicates a very strong linear association. Don't confuse the negative sign with "weak" — $r = -0.92$ represents a stronger relationship than $r = +0.60$.3. The correlation between height measured in inches and weight measured in pounds is $r = 0.72$. If height is converted to centimeters and weight to kilograms, the correlation will be:
(a) Different, because the units changed (b) 0.72, because correlation has no units (c) Higher, because metric units are more precise (d) It depends on the conversion factors
Answer
**(b) 0.72, because correlation has no units.** The Pearson correlation coefficient is unitless — it's calculated from z-scores, which standardize both variables. Converting units is a linear transformation, and $r$ is invariant under linear transformations. Whether you measure height in inches, centimeters, or furlongs, the correlation doesn't change.4. A study finds $r = 0.85$ between per capita chocolate consumption and Nobel Prize winners per 10 million population across countries. This means:
(a) Eating more chocolate causes countries to produce more Nobel laureates (b) Winning Nobel Prizes causes countries to eat more chocolate (c) There is a strong positive linear association, but we cannot determine causation (d) The correlation must be wrong because the relationship doesn't make sense
Answer
**(c) There is a strong positive linear association, but we cannot determine causation.** This is a famous spurious correlation. The most likely explanation is a common cause (confounding variable): national wealth is associated with both higher chocolate consumption and greater investment in education and research. The correlation is *real* — the numbers do co-vary — but the relationship is not causal. Buying more chocolate for your country will not produce more Nobel laureates.5. In the regression equation $\hat{y} = 12.5 + 3.2x$, the slope of 3.2 means:
(a) When $x = 0$, $y = 3.2$ (b) For each one-unit increase in $x$, $\hat{y}$ increases by 3.2 units on average (c) The correlation between $x$ and $y$ is 3.2 (d) 3.2% of the variability in $y$ is explained by $x$
Answer
**(b) For each one-unit increase in $x$, $\hat{y}$ increases by 3.2 units on average.** The slope $b_1$ represents the predicted change in $y$ for each one-unit increase in $x$. The "on average" qualifier is important — not every observation will change by exactly 3.2 units. The slope describes the average trend across all observations. Note: the intercept 12.5 is the predicted value when $x = 0$, and neither slope nor intercept equals $r$ or $R^2$.6. A regression of exam scores ($y$) on study hours ($x$) gives $\hat{y} = 45 + 6x$. The intercept of 45 means:
(a) The minimum possible exam score is 45 (b) A student who studies for 45 hours will pass (c) A student who studies 0 hours is predicted to score 45 (d) The average exam score is 45
Answer
**(c) A student who studies 0 hours is predicted to score 45.** The intercept $b_0$ is the predicted value of $y$ when $x = 0$. In this case, it suggests that a student with zero study hours would score about 45 on average — perhaps reflecting the baseline knowledge they bring to the exam. However, intercept interpretations should always be evaluated for practical meaning. If $x = 0$ is outside the range of the data, the intercept is just a mathematical anchor.7. The least squares regression line minimizes:
(a) The sum of the residuals (b) The sum of the absolute values of the residuals (c) The sum of the squared residuals (d) The largest residual
Answer
**(c) The sum of the squared residuals.** "Least squares" means exactly what it says: the line is chosen to minimize $\sum e_i^2 = \sum (y_i - \hat{y}_i)^2$. We square the residuals for the same reason we square deviations in the variance formula: to prevent positive and negative residuals from canceling, and to penalize large errors more heavily. Note that the sum of (unsquared) residuals $\sum e_i$ is always zero for a least squares line — that's a property, not the criterion.8. An $R^2$ value of 0.64 means:
(a) The correlation between $x$ and $y$ is 0.64 (b) 64% of the variability in $y$ is explained by the linear relationship with $x$ (c) The model correctly predicts 64% of the observations (d) The slope of the regression line is 0.64
Answer
**(b) 64% of the variability in $y$ is explained by the linear relationship with $x$.** $R^2$ is the coefficient of determination — the proportion of total variability in $y$ that the regression model explains. The remaining 36% is unexplained variability (captured in the residuals). Note: the correlation $r = \sqrt{0.64} = 0.80$ (taking the sign from the slope direction), not 0.64. Also, $R^2$ tells you nothing about *which observations* are correctly predicted.9. A residual plot shows residuals fanning out as the predicted values increase (a funnel or megaphone shape). This indicates a violation of:
(a) Linearity (b) Independence (c) Normality (d) Equal variance (homoscedasticity)
Answer
**(d) Equal variance (homoscedasticity).** A fan shape in the residual plot means the spread of the residuals is not constant — it increases with the predicted value. This is called **heteroscedasticity** (non-constant variance). It means the model's predictions are less precise for larger values of $y$ than for smaller values. Common remedies include applying a log transformation to $y$ or using weighted least squares.10. A regression model predicting house prices from square footage is fit using data for houses between 800 and 3,500 square feet. Using this model to predict the price of a 6,000-square-foot mansion is an example of:
(a) Interpolation (b) Extrapolation (c) Imputation (d) Transformation
Answer
**(b) Extrapolation.** Extrapolation is making predictions for $x$ values outside the range of the observed data. The model was built on homes from 800 to 3,500 sq ft. Predicting for 6,000 sq ft assumes the linear relationship holds far beyond the observed range — a dangerous assumption. Mansion-level homes may follow different pricing patterns (luxury features, land value, diminishing returns on size). Interpolation (predicting within the observed range) is much safer.11. The correlation between parent height and adult child height is $r = 0.5$. A parent who is 2 standard deviations above the average height would have a child whose predicted height is:
(a) 2 standard deviations above average (b) 1 standard deviation above average (c) 0.5 standard deviations above average (d) Exactly average
Answer
**(b) 1 standard deviation above average.** This is regression to the mean in action. With $r = 0.5$, the predicted z-score for the child is $r \times z_{\text{parent}} = 0.5 \times 2 = 1.0$. The child is predicted to be above average, but *less extreme* than the parent. Since $|r| < 1$, predicted values are always pulled toward the mean. This is Galton's original insight and the origin of the term "regression."12. Regression to the mean occurs because:
(a) There is a causal force pulling extreme values toward the average (b) People who score extremely tend to score lower on retesting due to fatigue (c) The correlation between two measurements is typically less than perfect ($|r| < 1$) (d) The measurement instrument loses accuracy for extreme values
Answer
**(c) The correlation between two measurements is typically less than perfect ($|r| < 1$).** Regression to the mean is a mathematical consequence of imperfect correlation, not a causal phenomenon. Some of what makes an observation extreme is "signal" (real ability, genetic factors) and some is "noise" (lucky day, measurement error, favorable conditions). The noise doesn't repeat, so the next observation is likely less extreme. This is why it's called a "threshold concept" — understanding it prevents misinterpreting natural statistical phenomena as real effects.13. Which of the following statements about the Pearson correlation coefficient is FALSE?
(a) $r$ is always between $-1$ and $+1$ (b) $r$ measures the strength of any relationship between two variables (c) $r$ has no units (d) The correlation between $x$ and $y$ equals the correlation between $y$ and $x$
Answer
**(b) $r$ measures the strength of any relationship between two variables.** This is false because $r$ only measures the strength of *linear* relationships. A perfect parabolic or sinusoidal relationship could have $r = 0$. This is one of the most important limitations of the Pearson correlation coefficient — it can completely miss strong nonlinear patterns. Always create a scatterplot to verify that the relationship is approximately linear before interpreting $r$.14. In simple linear regression, the residual $e_i = y_i - \hat{y}_i$ represents:
(a) The predicted value of $y$ for observation $i$ (b) The distance between the observed and predicted values of $y$ (c) The slope of the regression line at point $i$ (d) The probability that the prediction is correct
Answer
**(b) The distance between the observed and predicted values of $y$.** The residual is how much the regression line "misses" for a particular observation. Positive residuals mean the observed value is above the line (the model under-predicted); negative residuals mean the observed value is below the line (the model over-predicted). Residuals always sum to zero for a least squares regression line, and analyzing their patterns is the primary diagnostic tool for assessing model fit.15. The regression equation $\hat{y} = 100 - 2.5x$ has a slope of $-2.5$ and the $R^2 = 0.81$. Which of the following is true?
(a) $r = 0.81$ (b) $r = -0.81$ (c) $r = 0.90$ (d) $r = -0.90$
Answer
**(d) $r = -0.90$.** Since $R^2 = r^2$, we know $|r| = \sqrt{0.81} = 0.90$. But which sign? The slope is negative ($-2.5$), which means the relationship is negative — as $x$ increases, $y$ decreases. Therefore, $r$ must be negative: $r = -0.90$.16. A researcher finds a strong correlation between the number of firefighters responding to a fire and the amount of property damage. She concludes that firefighters cause property damage. What is the most likely explanation?
(a) Reverse causation — property damage attracts more firefighters (b) Common cause — larger fires cause both more damage and more firefighters (c) Coincidence — the correlation is spurious (d) Direct causation — firefighters accidentally cause more damage
Answer
**(b) Common cause — larger fires cause both more damage and more firefighters.** The lurking variable is fire severity. Larger, more intense fires cause more property damage AND require more firefighters. The firefighters aren't causing the damage — they're responding to the same underlying factor that causes the damage. Reducing the number of firefighters would make things *worse*, not better. This is a classic confounding variable example.17. Which of the following is NOT an assumption for regression inference (the LINE conditions)?
(a) Linearity (b) Independence (c) Equal sample sizes for $x$ and $y$ (d) Equal variance of residuals
Answer
**(c) Equal sample sizes for $x$ and $y$.** This isn't an assumption because in regression, each observation has both an $x$ and a $y$ value — the sample sizes are always equal by construction. The LINE conditions are: **L**inearity (the relationship is linear), **I**ndependence (observations are independent), **N**ormality (residuals are approximately normal), and **E**qual variance (residuals have constant spread across all $x$ values). These conditions must hold for inference (confidence intervals and hypothesis tests) to be valid.18. A school district selects students who scored in the bottom 10% on a standardized test and provides them with tutoring. On the retest, their average score improves by 12 points. The most likely explanation is:
(a) The tutoring program was highly effective (b) The students worked harder because they felt singled out (c) Regression to the mean — extreme scores tend to be followed by less extreme scores (d) The test was easier the second time
Answer
**(c) Regression to the mean — extreme scores tend to be followed by less extreme scores.** When you select students specifically because they scored extremely low, some of that extreme score was due to bad luck (illness on test day, guessing wrong, random variation). On the retest, their luck is likely to be closer to average, so their scores improve — even without any intervention. This is the medical trap applied to education. A randomized controlled trial with a control group that receives no tutoring would be needed to determine the true tutoring effect, because the control group would also experience regression to the mean.19. In the formula $b_1 = r \cdot \frac{s_y}{s_x}$, if $r = 0$, then $b_1 = 0$. In plain English, this means:
(a) There is a perfectly horizontal regression line through the data (b) The line passes through the origin (c) There is no data (d) The model predicts $\hat{y} = \bar{y}$ for every value of $x$
Answer
**(d) The model predicts $\hat{y} = \bar{y}$ for every value of $x$.** When $b_1 = 0$, the regression equation becomes $\hat{y} = b_0 + 0 \cdot x = b_0$. Since $b_0 = \bar{y} - b_1\bar{x} = \bar{y} - 0 = \bar{y}$, the regression line is horizontal at $\hat{y} = \bar{y}$. The variable $x$ provides no information for predicting $y$, so the best prediction is simply the mean of $y$. Both (a) and (d) are essentially saying the same thing — the regression line is flat — but (d) captures the full meaning.20. The fundamental connection between ANOVA (Chapter 20) and regression is:
(a) Both use the F-distribution (b) Both decompose total variability into explained and unexplained components (c) Both require normally distributed data (d) Both compare exactly two groups