Exercises: Comparing Two Groups

These exercises progress from conceptual understanding through applied two-group tests, condition checking, test selection, and Python/Excel implementation. Estimated completion time: 3.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)


Part A: Conceptual Understanding ⭐

A.1. Explain in your own words the difference between independent samples and paired (dependent) samples. Give one example of each from everyday life (not from the textbook).

A.2. A classmate says: "I compared the average test scores of my morning class and afternoon class using a paired t-test, because both groups took the same test." Explain why this is incorrect. What test should be used, and why?

A.3. True or false (explain each):

(a) If two groups have different sample sizes, you cannot use the two-sample t-test.

(b) The paired t-test uses $df = n_1 + n_2 - 2$, where $n_1$ and $n_2$ are the sizes of the two groups.

(c) Welch's t-test is preferred over the equal-variance (pooled) t-test because it works whether or not the population variances are equal.

(d) A statistically significant difference between two groups proves that the treatment caused the difference.

(e) The two-proportion z-test uses the pooled proportion in the standard error for the test statistic, but not for the confidence interval.

(f) If two individual confidence intervals (one for each group) overlap, the difference between the groups cannot be statistically significant.

A.4. Why does the pooled standard error for the difference in two means add the variances rather than the standard deviations? Give an intuitive explanation.

A.5. A researcher measures the blood pressure of 30 patients before and after a new medication. She reports: "I used a two-sample t-test and found no significant difference." What mistake did she make, and how might it have affected her results?

A.6. Explain why the paired t-test is often more powerful than the independent-samples t-test when the data are genuinely paired. Use the concepts of "between-subject variability" and "within-subject variability" in your explanation.


Part B: Identifying the Right Test ⭐

B.1. For each scenario, identify: (i) whether the comparison involves means or proportions, (ii) whether the data are independent or paired, (iii) which test to use, and (iv) the null and alternative hypotheses.

(a) A pharmaceutical company randomly assigns 200 patients to receive Drug A and 200 to receive Drug B. They compare average recovery times.

(b) A tutoring center measures 40 students' math scores before and after an 8-week tutoring program.

(c) A political scientist compares the proportion of voters who support a ballot measure in two different counties.

(d) An ophthalmologist measures the intraocular pressure in the left eye and right eye of 50 patients.

(e) A company compares the proportion of defective items from Factory A vs. Factory B.

(f) A sports analyst compares the same 20 cities' marathon participation rates in 2019 vs. 2024.

(g) A psychologist compares reaction times of 60 subjects randomly assigned to a caffeine group or a placebo group.

(h) A restaurant measures customer satisfaction on the same 15 dishes using two different presentation styles.

B.2. Sam is comparing Daria's three-point shooting percentage this season (25/65 = 38.5%) to last season (60/200 = 30.0%). Should this be analyzed as: (a) A paired t-test on individual game data? (b) A two-proportion z-test on season totals? (c) Either, depending on how the data are structured?

Explain the tradeoffs of each approach. Under what circumstances would each be appropriate?


Part C: Two-Sample t-Tests (Independent Groups) ⭐⭐

C.1. A university wants to compare the average GPA of students in its honors program to students not in the honors program.

  • Honors: $n_1 = 45$, $\bar{x}_1 = 3.52$, $s_1 = 0.31$
  • Non-honors: $n_2 = 120$, $\bar{x}_2 = 3.01$, $s_2 = 0.58$

(a) State the hypotheses for a two-tailed test. (b) Check the conditions for a two-sample t-test. (c) Calculate the test statistic using Welch's formula. (d) Find the approximate p-value. (e) State your conclusion at $\alpha = 0.05$. (f) Construct a 95% CI for the difference in means. (g) Is this an observational study or an experiment? Can you conclude that the honors program causes higher GPAs?

C.2. An education researcher compares standardized reading scores between two school districts.

  • District A: $n_1 = 35$, $\bar{x}_1 = 78.3$, $s_1 = 12.4$
  • District B: $n_2 = 42$, $\bar{x}_2 = 82.1$, $s_2 = 10.8$

(a) Conduct a two-tailed Welch's t-test at $\alpha = 0.05$. (b) Construct a 95% CI for $\mu_A - \mu_B$. (c) Does the CI support your test conclusion? Explain.

C.3. Alex is testing a new feature on StreamVibe. Users randomly assigned to the feature spent an average of 52.3 minutes per session ($n = 180$, $s = 22.1$), while the control group spent 48.7 minutes ($n = 175$, $s = 19.8$).

(a) Conduct the appropriate test. (b) Construct a 95% CI for the difference. (c) If the minimum practically meaningful difference is 5 minutes, what does the CI tell you about practical significance?


Part D: Paired t-Tests ⭐⭐

D.1. A fitness program measures participants' resting heart rates (bpm) before and after a 12-week program:

Participant Before After Difference
1 78 72 ?
2 82 78 ?
3 75 74 ?
4 88 80 ?
5 71 70 ?
6 85 76 ?
7 79 75 ?
8 90 82 ?
9 73 71 ?
10 86 78 ?

(a) Compute the differences (Before − After). Is a positive difference an improvement? (b) Calculate $\bar{d}$ and $s_d$. (c) State the hypotheses for a one-tailed test (testing whether heart rate decreased). (d) Compute the test statistic. (e) Find the p-value. (f) Construct a 95% CI for the mean decrease. (g) Write a conclusion in context.

D.2. A consumer group tests whether there's a price difference for identical grocery items at two competing stores. They record prices for 20 randomly selected items at both stores. The mean price difference (Store A − Store B) is $\bar{d} = \$0.23$ with $s_d = \$0.85$.

(a) Conduct a paired t-test at $\alpha = 0.05$. (b) Construct a 95% CI for the mean price difference. (c) Is the mean difference of $0.23 practically significant for a typical grocery trip of 30 items?

D.3. A psychologist measures anxiety scores (on a 0-100 scale) for 15 patients before and after a therapy program. The summary statistics for the differences (Before − After) are: $\bar{d} = 8.3$, $s_d = 12.1$.

(a) Conduct a two-tailed paired t-test. (b) What concerns might you have about normality with $n = 15$? (c) How would you check the normality condition, and what would you do if it were violated?


Part E: Two-Proportion z-Tests ⭐⭐

E.1. A hospital compares surgical infection rates between two sterilization protocols.

  • Protocol A: 12 infections in 450 surgeries
  • Protocol B: 23 infections in 480 surgeries

(a) State the hypotheses. (b) Compute the pooled proportion. (c) Calculate the test statistic. (d) Find the p-value. (e) Construct a 95% CI for $p_A - p_B$. (f) Conclude at $\alpha = 0.05$.

E.2. Professor Washington examines whether the algorithm's false positive rate differs by race. Among defendants who did not re-offend:

  • White defendants: 38 out of 285 were flagged as high-risk (false positives)
  • Black defendants: 67 out of 215 were flagged as high-risk (false positives)

(a) Compute the false positive rates for each group. (b) Conduct a two-proportion z-test. (c) Construct a 95% CI for the difference in false positive rates. (d) Discuss the practical and ethical implications of this finding.

E.3. A public health department compares vaccination rates:

  • Urban area: 682 out of 850 adults vaccinated
  • Rural area: 541 out of 820 adults vaccinated

(a) Conduct a two-proportion z-test. (b) Construct a 95% CI for the difference. (c) Maya notes that the urban and rural areas differ in average age, income, and access to clinics. How does this affect the interpretation of the results?


Part F: Condition Checking ⭐⭐

F.1. For each scenario, identify which condition(s) for the two-sample t-test might be violated:

(a) Comparing exam scores of students in two sections of the same course. One section has 8 students, and the exam scores are heavily right-skewed.

(b) Comparing daily sales at two store locations over the same 30-day period. Sales on one day may be correlated with sales on the next day at the same location.

(c) Comparing customer wait times at two restaurants. The sample from Restaurant A includes only Friday evening data, while Restaurant B data spans the entire week.

(d) Comparing average home prices in two neighborhoods, using data from 200 homes in each. Both distributions are right-skewed.

F.2. A researcher has $n_1 = 8$ observations from Group 1 and $n_2 = 10$ from Group 2. Both groups appear roughly symmetric with no outliers. Is a two-sample t-test appropriate? What additional information would help you decide?

F.3. Explain why the success-failure condition for the two-proportion z-test uses the pooled proportion rather than the individual sample proportions.


Part G: Connecting Tests to Study Design ⭐⭐

G.1. For each of the following findings, state whether the conclusion is justified. If not, explain why.

(a) "Students who use the new study app scored 8 points higher on the exam than students who didn't. The two-sample t-test was significant at $\alpha = 0.05$. Therefore, the app improves exam scores."

(b) "We randomly assigned participants to a meditation group or a control group. The meditation group had significantly lower stress scores ($p = 0.003$). We conclude that meditation reduces stress."

(c) "Countries with higher education spending have lower crime rates. The two-sample t-test comparing high-spending vs. low-spending countries was significant. Therefore, increasing education spending will reduce crime."

G.2. Alex runs an A/B test and finds $p = 0.03$. A colleague says: "That's significant, but the difference is only 4.5 minutes. Is it worth changing the algorithm?" Discuss how statistical significance and practical significance differ in this context. What additional information would help Alex make the business decision?


Part H: Integrated Problems ⭐⭐⭐

H.1. A company tests whether a new website design increases the proportion of visitors who make a purchase. They randomly show visitors either the old design or the new design:

  • Old design: 84 purchases out of 1,200 visitors ($\hat{p}_1 = 0.070$)
  • New design: 112 purchases out of 1,250 visitors ($\hat{p}_2 = 0.0896$)

(a) Conduct a one-tailed two-proportion z-test ($H_a: p_{\text{new}} > p_{\text{old}}$). (b) Construct a 95% CI for the difference in purchase rates. (c) If the company averages $50 in profit per purchase and gets 500,000 visitors per month, estimate the monthly revenue impact of switching to the new design. (d) Given the CI from part (b), what is the range of plausible monthly revenue impacts?

H.2. A clinical trial compares the effectiveness of two weight-loss programs. Researchers randomly assign 30 participants to Program A and 30 to Program B, measuring weight loss (in pounds) after 12 weeks.

  • Program A: $\bar{x}_1 = 12.4$ lbs, $s_1 = 6.8$
  • Program B: $\bar{x}_2 = 9.1$ lbs, $s_2 = 7.2$

(a) Conduct a Welch's t-test. (b) Construct a 95% CI for the difference. (c) Would you recommend Program A over Program B based on these results alone? What else would you want to know?

H.3. An environmental scientist measures air quality (PM2.5 levels, μg/m³) at 15 locations in a city before and after a new emissions regulation takes effect.

Location Before After
1 35.2 28.4
2 42.1 38.7
3 28.6 25.1
4 51.3 40.2
5 33.8 30.5
6 45.9 39.1
7 38.4 34.8
8 29.7 27.3
9 47.5 41.2
10 36.1 31.8
11 41.8 36.4
12 55.2 44.1
13 31.4 29.6
14 44.3 38.5
15 39.6 35.2

(a) Why is this a paired design? (b) Compute the differences and summary statistics. (c) Check the conditions for a paired t-test. (d) Conduct the test. Did air quality improve significantly? (e) Construct a 95% CI for the average improvement. (f) A city official claims the regulation reduced PM2.5 by "at least 5 μg/m³ on average." Does your CI support this claim? (g) What potential confounders could explain the improvement besides the regulation?


Part I: Python Practice ⭐⭐

I.1. Use scipy.stats.ttest_ind() to compare the following two datasets:

group_a = [23, 28, 31, 25, 27, 30, 22, 29, 26, 33,
           24, 28, 35, 27, 31, 26, 29, 32, 25, 30]
group_b = [19, 24, 27, 21, 23, 26, 20, 25, 22, 28,
           18, 24, 30, 23, 27, 22, 25, 28, 21, 26]

(a) Conduct Welch's t-test. Report the t-statistic and p-value. (b) Construct a 95% CI for the difference in means. (c) Create side-by-side box plots of the two groups.

I.2. Use scipy.stats.ttest_rel() to test whether scores improved in this before/after dataset:

before = [72, 68, 85, 78, 62, 90, 75, 81, 69, 77,
          83, 71, 88, 74, 66]
after  = [78, 74, 88, 82, 70, 92, 80, 85, 75, 82,
          87, 76, 91, 79, 73]

(a) Compute and plot the differences. (b) Check the normality of the differences using a histogram and Shapiro-Wilk test. (c) Conduct the paired t-test. Report results.

I.3. Use statsmodels.stats.proportion.proportions_ztest() to test:

  • Group 1: 145 successes out of 500
  • Group 2: 178 successes out of 520

(a) Conduct a two-tailed two-proportion z-test. (b) Compute the 95% CI for the difference in proportions manually (using the unpooled SE formula).


Part J: Excel Practice ⭐⭐

J.1. Given data in columns A (Group 1) and B (Group 2):

(a) Write the Excel formula for a Welch's two-tailed t-test. (b) Write formulas to compute the difference in means and the standard error. (c) Compute a 95% CI for the difference using Excel functions.

J.2. For a two-proportion z-test with: - Group 1: 45 successes in 300 trials - Group 2: 62 successes in 320 trials

Write the Excel formulas for: (a) The two sample proportions (b) The pooled proportion (c) The standard error (d) The z-statistic (e) The two-tailed p-value


Part K: Critical Thinking ⭐⭐⭐

K.1. A news headline reads: "Study Finds No Difference Between Organic and Conventional Produce in Nutrient Content (p = 0.48)." A friend concludes: "So organic food is no healthier." Identify at least three problems with this conclusion.

K.2. Two researchers study the same question — whether a tutoring program improves test scores. Researcher A uses an independent-samples design (50 tutored, 50 untutored students) and finds $p = 0.08$. Researcher B uses a paired design (30 students tested before and after tutoring) and finds $p = 0.002$. Explain how both results can be valid and why the paired design produced a more significant result despite having fewer participants.

K.3. A data scientist at a tech company runs 20 A/B tests simultaneously and finds that 3 of them are "statistically significant" at $\alpha = 0.05$. Should the company implement all three changes? Explain the multiple testing problem in this context and suggest a solution.

K.4. Maya compares diabetes rates between an industrial community and a suburban community and finds a significant difference. A colleague argues: "This just reflects income differences, not pollution." How could Maya design a follow-up study to address this concern? What role does Chapter 4's distinction between observational and experimental studies play here?


Part L: Synthesis and Reflection ⭐⭐⭐⭐

L.1. The Complete Analysis. Sam has data on the Riverside Raptors' performance in the first half (games 1-20) and second half (games 21-41) of the season. He also has data on five other teams in the division.

(a) Design a study comparing the Raptors' first-half vs. second-half performance using the appropriate test. Justify your choice. (b) Design a study comparing the Raptors' overall performance to the division average. What test would you use? (c) Design a study comparing the Raptors' win percentage to the league's best team. What test would you use? (d) For each analysis, discuss whether a significant finding would support a causal interpretation.

L.2. Meta-Reflection. You now have three tools for comparing two groups. Write a short essay (300-500 words) explaining: (a) How these three tests are fundamentally similar (same logic, different details) (b) Why choosing the right test matters (give a concrete example of what goes wrong with the wrong choice) (c) How the study design (experimental vs. observational) limits the conclusions you can draw from any of these tests

L.3. Research Design. Design a study to test whether a new employee wellness program reduces sick days. (a) Describe an independent-samples design. What are its strengths and weaknesses? (b) Describe a paired (before-after) design. What are its strengths and weaknesses? (c) Which design would you recommend? Why? (d) What sample size considerations would you need to address? (Preview for Chapter 17.)