Exercises: Analysis of Variance (ANOVA)

These exercises progress from conceptual understanding through hand calculations, Python implementation, assumption checking, post-hoc analysis, and interpretation. Estimated completion time: 3.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)


Part A: Conceptual Understanding ⭐

A.1. In your own words, explain why running multiple two-sample t-tests is problematic when comparing more than two groups. Use a specific numerical example — if you have 6 groups, how many pairwise comparisons are needed, and what's the probability of at least one false positive at $\alpha = 0.05$?

A.2. Explain the difference between "between-group variability" and "within-group variability" using a non-statistical analogy. (For example, think about differences between cities vs. differences within a single city.)

A.3. True or false (explain each):

(a) A significant ANOVA result tells you which specific groups are different from each other.

(b) The F-statistic can be negative.

(c) If all group means are identical, $SS_{\text{Between}} = 0$.

(d) ANOVA can only be used when all groups have the same sample size.

(e) A large $\eta^2$ guarantees that the ANOVA is statistically significant.

(f) Tukey's HSD should be performed regardless of whether the ANOVA result is significant.

A.4. A researcher conducts a one-way ANOVA with $k = 5$ groups and gets $F(4, 45) = 1.03$, $p = 0.402$. She writes: "The five treatment groups produce the same outcomes." What's wrong with this conclusion?

A.5. Explain why ANOVA is called "analysis of variance" even though it's used to compare means. How does comparing variances help us draw conclusions about means?

A.6. A study finds $F(2, 57) = 4.12$, $p = 0.021$, $\eta^2 = 0.13$. Another study finds $F(2, 570) = 4.12$, $p = 0.017$, $\eta^2 = 0.014$. Both have the same F-statistic. Why do they have different $\eta^2$ values, and what does this tell you about interpreting effect sizes?


Part B: Understanding the ANOVA Table ⭐

B.1. Fill in the missing values in this ANOVA table:

Source SS df MS F p
Between ? 3 120 ?
Within 960 ? ?
Total ? 39

How many groups were in this study? How many total observations?

B.2. Fill in the missing values:

Source SS df MS F p
Between 450 ? ? 9.00
Within ? 45 ?
Total 2700 47

B.3. A one-way ANOVA with $k = 4$ groups and $N = 60$ observations yields $SS_B = 210$ and $SS_W = 840$.

(a) Compute $SS_T$, $df_B$, $df_W$, $MS_B$, $MS_W$, $F$, and $\eta^2$.

(b) Using the benchmarks from the chapter, would you call this a small, medium, or large effect?

(c) Verify that $df_T = df_B + df_W$.

B.4. An ANOVA table shows $MS_B = 50$ and $MS_W = 50$, giving $F = 1.0$. What does this mean in plain English? Is there likely a significant difference among the groups?


Part C: Hand Calculations ⭐⭐

C.1. Three teaching methods are compared using test scores from 15 students (5 per group):

Method A Method B Method C
78 85 92
82 88 89
75 90 95
80 82 91
85 85 88

(a) Calculate the group means and the grand mean.

(b) Calculate $SS_B$, $SS_W$, and $SS_T$. Verify that $SS_T = SS_B + SS_W$.

(c) Calculate $MS_B$, $MS_W$, and the $F$-statistic.

(d) The critical value for $F(2, 12)$ at $\alpha = 0.05$ is 3.89. Is the result significant?

(e) Calculate $\eta^2$ and interpret it.

C.2. Consider these group means with $n = 10$ per group and $MS_W = 25$:

  • Scenario A: Group means = 50, 51, 52
  • Scenario B: Group means = 50, 55, 60

(a) Calculate $SS_B$ for each scenario.

(b) Calculate the F-statistic for each scenario.

(c) Explain in plain language why Scenario B produces a larger F.

C.3. Consider these two datasets, each with three groups of $n = 8$:

  • Dataset 1: Group means = 40, 50, 60; $MS_W = 20$
  • Dataset 2: Group means = 40, 50, 60; $MS_W = 200$

(a) Calculate $F$ for each dataset.

(b) Which dataset is more likely to produce a significant result? Why?

(c) Connect this to the "signal vs. noise" interpretation of the F-statistic.


Part D: Assumptions and Diagnostics ⭐⭐

D.1. For each scenario, identify which ANOVA assumption is most likely violated and suggest an alternative approach:

(a) Patients are measured at three time points (baseline, 6 months, 12 months). The researcher uses one-way ANOVA to compare the means.

(b) A survey of income by education level (High School, Bachelor's, Master's, PhD) shows that the PhD group has a standard deviation 4 times larger than the High School group.

(c) Customer satisfaction scores are compared across 5 store locations, but 3 locations have $n = 100$ customers while 2 locations have $n = 8$ customers.

(d) A researcher compares anxiety levels across 4 therapy groups, but all participants in each group come from the same family.

D.2. A researcher reports: "Levene's test was significant ($p = 0.003$), so we could not use ANOVA." Is this conclusion correct? What alternatives are available?

D.3. Group sample sizes and standard deviations are:

Group $n$ $s$
A 30 5.2
B 28 4.8
C 32 11.1
D 25 5.5

(a) What is the ratio of largest to smallest SD? Does this violate the rule of thumb?

(b) The group sizes are roughly equal. Does this help or hurt the robustness of ANOVA?

(c) What would you recommend: proceed with standard ANOVA, use Welch's ANOVA, or use a nonparametric test?


Part E: Post-Hoc Analysis ⭐⭐

E.1. An ANOVA with four groups yields $F(3, 76) = 5.82$, $p = 0.001$. The group means are:

Group Mean
A 42.3
B 38.7
C 45.1
D 43.8

(a) Which pairs of groups do you expect Tukey's HSD to find significantly different? Explain your reasoning.

(b) Why can't you determine significance just by looking at the means?

(c) How many pairwise comparisons will Tukey's HSD evaluate?

E.2. A Tukey HSD output shows:

   group1  group2  meandiff  p-adj   lower   upper  reject
   Low     Medium    4.2     0.312   -2.1    10.5   False
   Low     High     12.8     0.001    6.5    19.1   True
   Medium  High      8.6     0.008    2.3    14.9   True

(a) Which specific pairs of groups are significantly different?

(b) Write a one-sentence interpretation of each comparison.

(c) What does the p-adj column represent, and why is it different from the p-value you'd get from a regular t-test?

E.3. A researcher runs an ANOVA with $F(4, 95) = 1.87$, $p = 0.122$, and then runs Tukey's HSD anyway, finding one "significant" pair ($p_{\text{adj}} = 0.041$). What's wrong with this approach?


Part F: Python Implementation ⭐⭐

F.1. Using the data below, conduct a complete ANOVA analysis in Python:

diet_a = [4.2, 3.8, 5.1, 4.5, 3.9, 4.8, 5.0, 4.3]
diet_b = [6.1, 5.5, 6.8, 5.9, 6.3, 5.7, 6.5, 6.0]
diet_c = [4.0, 4.5, 3.7, 4.2, 4.8, 3.5, 4.1, 4.6]
diet_d = [5.2, 4.8, 5.5, 5.0, 4.6, 5.3, 5.1, 4.9]

Include: (a) descriptive statistics, (b) Levene's test, (c) one-way ANOVA, (d) $\eta^2$, (e) Tukey's HSD (if appropriate), and (f) a written interpretation.

F.2. Write Python code that manually computes $SS_B$, $SS_W$, $SS_T$, $MS_B$, $MS_W$, and $F$ from raw data — without using scipy.stats.f_oneway(). Verify your results match f_oneway().

F.3. Modify the following code to handle unequal group sizes properly:

groups = [
    [10, 12, 15, 11, 13],           # n = 5
    [20, 22, 18, 25, 19, 21, 23],   # n = 7
    [14, 16, 15]                     # n = 3
]

Run the ANOVA and comment on any concerns about the results given the unequal sample sizes.


Part G: Interpretation and Application ⭐⭐⭐

G.1. (Maya's extended analysis) Maya expands her study to include a fifth intervention (telemedicine consultations). The complete results are:

Program $n$ Mean SD
Vaccination 25 74.2 8.1
Nutrition 25 69.8 7.5
Fitness 25 83.4 6.9
Telemedicine 25 76.1 9.2
Control 25 58.3 8.8

ANOVA: $F(4, 120) = 34.67$, $p < 0.001$, $\eta^2 = 0.54$

(a) Interpret the F-statistic and p-value in context.

(b) Interpret $\eta^2 = 0.54$ in context. What does it mean that 54% of the variability is explained by program assignment?

(c) If you were advising Maya's public health department, what additional information would you want from the Tukey's HSD results before making a policy recommendation?

(d) Would it matter whether patients were randomly assigned to programs or chose them? How does study design affect your interpretation?

G.2. (Alex's extended analysis) Alex wants to compare not just subscription tiers but also test whether the new algorithm's effect varies by tier. This would require a two-way ANOVA (algorithm $\times$ tier). Without doing the calculation:

(a) What would the null hypotheses be for a two-way ANOVA? (Hint: there are three — one for each main effect and one for the interaction.)

(b) What would it mean if the interaction were significant?

(c) Why can't Alex answer this question with a one-way ANOVA?

G.3. A pharmaceutical company compares four drug dosages (placebo, 5mg, 10mg, 20mg) and reports $F(3, 196) = 3.12$, $p = 0.027$, $\eta^2 = 0.046$. The Tukey HSD shows that only the 20mg group differs significantly from placebo.

(a) The effect size is "small" by Cohen's benchmarks. Does this mean the drug is ineffective?

(b) How might a clinician interpret "small $\eta^2$ but significant F" differently from a policy maker?

(c) What additional information would you want to see before making a recommendation?


Part H: Connecting Concepts ⭐⭐⭐

H.1. Show that when $k = 2$ (two groups), the F-statistic from ANOVA equals the square of the two-sample t-statistic from Chapter 16: $F = t^2$. (Hint: start with $SS_B = n_1(\bar{x}_1 - \bar{x})^2 + n_2(\bar{x}_2 - \bar{x})^2$ and simplify using $\bar{x} = (n_1\bar{x}_1 + n_2\bar{x}_2)/N$ for equal group sizes $n_1 = n_2 = n$.)

H.2. Explain the relationship between $\eta^2$ in ANOVA and $r^2$ in regression. (If you've read ahead to Chapter 22: show that for a regression with one categorical predictor using indicator variables, $R^2 = \eta^2$.)

H.3. A researcher applies both the Bonferroni correction and Tukey's HSD to the same data with $k = 5$ groups.

(a) How many comparisons does each method make?

(b) The Bonferroni approach uses $\alpha_{\text{individual}} = 0.05/10 = 0.005$ for each test. Why is Tukey's HSD generally less conservative?

(c) Under what circumstances might you prefer Bonferroni over Tukey?

H.4. (Effect size comparison across methods) For the same dataset, a researcher reports: - ANOVA: $\eta^2 = 0.15$ - Two-sample t-test (Group 1 vs. 2): Cohen's $d = 0.40$ - Two-sample t-test (Group 1 vs. 3): Cohen's $d = 0.82$

(a) How can $\eta^2$ be "large" while one of the pairwise comparisons has only a "small-to-medium" effect?

(b) Why is it important to report both the overall effect size and pairwise effect sizes?


Part I: Critical Thinking and Research Design ⭐⭐⭐⭐

I.1. (Power analysis for ANOVA) Maya is designing a new study comparing three intervention programs. She expects a medium effect size ($\eta^2 = 0.06$, corresponding to Cohen's $f = 0.25$).

(a) Using the power analysis concepts from Chapter 17, explain what factors affect the power of an ANOVA test.

(b) In Python, use statsmodels.stats.power.FTestAnovaPower() to determine the sample size per group needed for 80% power at $\alpha = 0.05$:

from statsmodels.stats.power import FTestAnovaPower
power_analysis = FTestAnovaPower()
n_per_group = power_analysis.solve_power(
    effect_size=0.25,  # Cohen's f for medium effect
    k_groups=3,
    alpha=0.05,
    power=0.80
)
print(f"Required n per group: {n_per_group:.0f}")

(c) How does the required sample size change if Maya adds a fourth group?

I.2. A published study reports: "The three teaching methods produced significantly different exam scores ($F(2, 87) = 3.21$, $p = 0.045$). Method B was the most effective." The paper includes no post-hoc tests, no effect sizes, and no confidence intervals.

(a) List at least four problems with this reporting.

(b) The $p$-value is barely below 0.05. Referring to Chapter 17's discussion of "the garden of forking paths," what questions should you ask about how the researchers arrived at this analysis?

(c) Rewrite the results paragraph to meet the reporting standards from this chapter.

I.3. (Theme 5 — Multiple comparisons inflate false positives) Design a simulation study that demonstrates the multiple comparisons problem:

import numpy as np
from scipy import stats

np.random.seed(42)
n_simulations = 10000
n_per_group = 20
k_groups = 5
alpha = 0.05

# Count how often at least one pairwise t-test is significant
# when ALL groups come from the same distribution
false_positive_ttest = 0
false_positive_anova = 0

for _ in range(n_simulations):
    groups = [np.random.normal(50, 10, n_per_group) for _ in range(k_groups)]

    # Method 1: Pairwise t-tests
    found_sig = False
    for i in range(k_groups):
        for j in range(i+1, k_groups):
            _, p = stats.ttest_ind(groups[i], groups[j])
            if p < alpha:
                found_sig = True
                break
        if found_sig:
            break
    if found_sig:
        false_positive_ttest += 1

    # Method 2: ANOVA
    _, p_anova = stats.f_oneway(*groups)
    if p_anova < alpha:
        false_positive_anova += 1

print(f"Multiple t-tests false positive rate: "
      f"{false_positive_ttest/n_simulations:.3f}")
print(f"ANOVA false positive rate: "
      f"{false_positive_anova/n_simulations:.3f}")

Run this simulation and comment on the results. What does it demonstrate about why ANOVA is preferred?


Part J: Mixed Practice ⭐⭐

J.1. For each scenario, state whether you would use a t-test, ANOVA, chi-square test, or nonparametric test, and explain why:

(a) Comparing mean blood pressure across 4 age groups

(b) Comparing the proportion of voters supporting a candidate in 5 states

(c) Comparing mean salaries between male and female employees

(d) Comparing median home prices across 6 neighborhoods (data are heavily right-skewed)

(e) Determining whether test scores differ across 3 classrooms with 8 students each (scores appear non-normal)

J.2. A sports analyst reports the following for scoring averages across four basketball positions:

$F(3, 116) = 8.45$, $p < 0.001$, $\eta^2 = 0.18$

Tukey's HSD reveals: - Centers vs. Guards: $p = 0.001$, mean difference = 5.2 - Centers vs. Forwards: $p = 0.543$, mean difference = 1.1 - Centers vs. Wings: $p = 0.023$, mean difference = 3.8 - Guards vs. Forwards: $p = 0.008$, mean difference = 4.1 - Guards vs. Wings: $p = 0.412$, mean difference = 1.4 - Forwards vs. Wings: $p = 0.089$, mean difference = 2.7

(a) Which specific positions score significantly differently from each other?

(b) Write a one-paragraph summary of these results suitable for a coaching report.

(c) Calculate how many of the 6 comparisons are significant. Does this pattern make sense given $\eta^2 = 0.18$?


Part K: Spaced Review from Earlier Chapters ⭐

K.1. (From Ch.6) ANOVA uses variance to compare means. In your own words, explain the connection between the sample variance formula $s^2 = \sum(x_i - \bar{x})^2 / (n-1)$ and the within-group mean square $MS_W = SS_W/(N-k)$. Why does $MS_W$ divide by $N - k$ instead of $N - 1$?

K.2. (From Ch.16) Show that for two groups, a one-way ANOVA and a two-sample t-test give the same p-value. Test this with the following data:

from scipy import stats

group1 = [12, 15, 14, 13, 16]
group2 = [18, 20, 17, 19, 21]

# Two-sample t-test
t_stat, p_ttest = stats.ttest_ind(group1, group2)

# One-way ANOVA
F_stat, p_anova = stats.f_oneway(group1, group2)

print(f"t-test: t = {t_stat:.4f}, p = {p_ttest:.6f}")
print(f"ANOVA:  F = {F_stat:.4f}, p = {p_anova:.6f}")
print(f"t² = {t_stat**2:.4f}, F = {F_stat:.4f}")

K.3. (From Ch.17) A study with $k = 3$ groups and $n = 15$ per group finds $\eta^2 = 0.04$. Using the effect size benchmarks from both Chapter 17 (Cohen's $d$) and this chapter ($\eta^2$), characterize this as a small, medium, or large effect. How does the conversion between $\eta^2$ and Cohen's $f$ work? ($f = \sqrt{\eta^2 / (1 - \eta^2)}$)