Exercises: Chi-Square Tests: Categorical Data Analysis
These exercises progress from conceptual understanding through goodness-of-fit tests, tests of independence, effect sizes, residual analysis, and Python implementation. Estimated completion time: 3.5 hours.
Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
A.1. In your own words, explain how the chi-square test differs from the $t$-test and $z$-test. What type of data does each handle, and what kind of question does each answer?
A.2. Why do we divide $(O - E)^2$ by $E$ in the chi-square formula, rather than just summing $(O - E)^2$? Give a specific example where omitting this division would be misleading.
A.3. True or false (explain each):
(a) A chi-square statistic can be negative.
(b) The chi-square test of independence tells you which cells are responsible for the association.
(c) If all observed frequencies equal the expected frequencies, the chi-square statistic is 0.
(d) The expected frequency condition requires that all observed counts be at least 5.
(e) Cramer's V is independent of sample size.
(f) A two-category goodness-of-fit test gives the same p-value as a two-tailed one-proportion $z$-test.
A.4. Explain why the chi-square test is always one-tailed (right-tailed). Why can't a very small chi-square value be evidence against $H_0$?
A.5. A researcher finds $\chi^2 = 15.3$ with $p = 0.002$ for a test of independence. She concludes: "Race causes differences in sentencing outcomes." What is wrong with this conclusion? Identify at least two problems.
A.6. In one sentence each, state the key difference between:
(a) A goodness-of-fit test and a test of independence
(b) Observed frequencies and expected frequencies
(c) The chi-square statistic and Cramer's V
(d) A chi-square test of independence and a chi-square test of homogeneity
Part B: Expected Frequencies and the Chi-Square Statistic ⭐
B.1. A marketing team surveyed 300 customers about their preferred social media platform:
| Platform | Observed |
|---|---|
| 95 | |
| TikTok | 82 |
| Twitter/X | 48 |
| 45 | |
| 30 |
(a) If the null hypothesis is that all platforms are equally preferred, what are the expected frequencies?
(b) Compute the chi-square statistic by hand.
(c) How many degrees of freedom does this test have?
(d) Using Python or a chi-square table, find the p-value. What is your conclusion at $\alpha = 0.05$?
B.2. A genetics student crosses two pea plants and expects the offspring to show a 9:3:3:1 ratio of phenotypes. Out of 160 offspring, she observes:
| Phenotype | Observed |
|---|---|
| Round, Yellow | 86 |
| Round, Green | 35 |
| Wrinkled, Yellow | 26 |
| Wrinkled, Green | 13 |
(a) Calculate the expected frequencies based on the 9:3:3:1 ratio.
(b) Compute the chi-square statistic.
(c) With $df = 3$, find the p-value. Does the data support the expected genetic ratio?
B.3. Consider this contingency table:
| Category A | Category B | Total | |
|---|---|---|---|
| Group 1 | 40 | 60 | 100 |
| Group 2 | 60 | 40 | 100 |
| Total | 100 | 100 | 200 |
(a) Calculate all four expected frequencies.
(b) Compute the chi-square statistic.
(c) How many degrees of freedom?
(d) This is a $2 \times 2$ table. Verify that $\chi^2 = z^2$ by computing the two-proportion $z$-test statistic from Chapter 16 (Group 1's proportion in Category A vs. Group 2's proportion).
Part C: Goodness-of-Fit Tests ⭐⭐
C.1. A six-sided die is rolled 120 times with the following results:
| Face | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Count | 25 | 17 | 22 | 18 | 16 | 22 |
(a) State the null and alternative hypotheses.
(b) Calculate expected frequencies.
(c) Compute $\chi^2$ and find the p-value with $df = 5$.
(d) At $\alpha = 0.05$, is the die fair?
(e) Which face contributes most to $\chi^2$? What does this suggest?
C.2. A hospital emergency department records the day of the week for 700 patient arrivals:
| Day | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|---|
| Arrivals | 112 | 88 | 95 | 90 | 108 | 115 | 92 |
(a) Test whether ED arrivals are uniformly distributed across days of the week.
(b) If the test is significant, which days contribute most to the departure from uniformity?
(c) A colleague suggests that the hospital should expect more arrivals on weekends (Sat/Sun: 16% each) and fewer on weekdays (Mon-Fri: 13.6% each). Test this alternative hypothesis about the distribution.
C.3. Maya is studying whether the distribution of blood types in her hospital's patient population matches the expected distribution for the U.S. population:
| Blood Type | Expected % | Observed (n = 500) |
|---|---|---|
| O+ | 37.4% | 205 |
| A+ | 35.7% | 162 |
| B+ | 8.5% | 55 |
| AB+ | 3.4% | 18 |
| O- | 6.6% | 28 |
| A- | 6.3% | 22 |
| B- | 1.5% | 7 |
| AB- | 0.6% | 3 |
(a) Compute expected frequencies for each blood type.
(b) Check the expected frequency condition. Are there any problems?
(c) If conditions are not met, suggest how to address the issue.
(d) Conduct the test (combining categories if necessary) and interpret the results.
Part D: Tests of Independence ⭐⭐
D.1. A study examined the relationship between exercise frequency and self-reported stress level:
| Low Stress | Moderate Stress | High Stress | Total | |
|---|---|---|---|---|
| Frequent Exercise | 85 | 45 | 20 | 150 |
| Occasional Exercise | 55 | 70 | 25 | 150 |
| No Exercise | 30 | 55 | 65 | 150 |
| Total | 170 | 170 | 110 | 450 |
(a) Compute all expected frequencies.
(b) Verify that the conditions for a chi-square test are met.
(c) Compute the chi-square statistic.
(d) Find the p-value with the appropriate degrees of freedom.
(e) Compute Cramer's V and interpret the strength of association.
(f) Compute standardized residuals. Which cells show the strongest departures from independence?
D.2. Alex wants to know whether users who engage with StreamVibe's recommendation algorithm differ in their satisfaction ratings. Data from 400 users:
| Satisfied | Neutral | Dissatisfied | Total | |
|---|---|---|---|---|
| Uses Recommendations | 95 | 60 | 45 | 200 |
| Ignores Recommendations | 55 | 50 | 95 | 200 |
| Total | 150 | 110 | 140 | 400 |
(a) Conduct a complete chi-square test of independence.
(b) Calculate Cramer's V.
(c) Compute standardized residuals and interpret them.
(d) What business recommendation would Alex make based on these findings?
D.3. A political science researcher surveyed 600 registered voters about their party affiliation and their position on a proposed policy:
| Support | Oppose | Undecided | Total | |
|---|---|---|---|---|
| Party A | 120 | 60 | 20 | 200 |
| Party B | 50 | 110 | 40 | 200 |
| Independent | 70 | 80 | 50 | 200 |
| Total | 240 | 250 | 110 | 600 |
(a) Is party affiliation independent of position on the policy?
(b) Which party-position combinations show the largest departures from independence?
(c) How does this relate to the theme that "categorical data often describes people"?
Part E: Cramer's V and Effect Sizes ⭐⭐
E.1. Two studies examine the relationship between smoking status (smoker/non-smoker) and lung disease (yes/no):
- Study A: $n = 200$, $\chi^2 = 8.0$
- Study B: $n = 2{,}000$, $\chi^2 = 80.0$
(a) Compute Cramer's V for each study.
(b) What do you notice? What does this tell you about comparing chi-square statistics directly?
(c) Both studies have the same Cramer's V. Explain why the p-values would be very different.
E.2. A researcher reports: "We found a statistically significant association between major and career satisfaction ($\chi^2 = 14.2$, $df = 6$, $p = 0.027$)." The sample size was 2,500.
(a) Compute Cramer's V. (Hint: the table is at least $4 \times 3$ since $df = (r-1)(c-1) = 6$.)
(b) Is this a practically meaningful association? Justify your answer.
(c) What would you recommend the researcher include in their report alongside the p-value?
E.3. Match each Cramer's V value with the most appropriate scenario:
(a) $V = 0.85$: ______
(b) $V = 0.12$: ______
(c) $V = 0.35$: ______
Scenarios: 1. Knowing a customer's age group gives a moderate prediction of their product preference. 2. Gender almost perfectly predicts a person's name (in most cultures). 3. There's a slight tendency for coffee drinkers to prefer window seats at cafes.
Part F: Conditions and Troubleshooting ⭐⭐
F.1. For each scenario, determine whether the conditions for a chi-square test are met. If not, suggest a remedy.
(a) A $3 \times 4$ table with $n = 40$ where three cells have expected frequencies below 3.
(b) A survey where married couples were asked separately about their TV preferences, and both spouses' responses are included in the contingency table.
(c) A goodness-of-fit test with 8 categories where one expected count is 4.2 but all others exceed 20.
(d) A $2 \times 2$ table where one expected frequency is 3.8 but the sample is from a true random sample of 50 people.
F.2. A researcher studying voting behavior creates this table:
| Voted | Did Not Vote | Total | |
|---|---|---|---|
| Urban | 85 | 15 | 100 |
| Suburban | 72 | 28 | 100 |
| Rural | 8 | 2 | 10 |
| Total | 165 | 45 | 210 |
(a) Calculate expected frequencies for all cells.
(b) Which cells have expected frequencies below 5?
(c) Propose two different ways to fix this problem. What are the trade-offs of each approach?
Part G: Connecting to Prior Chapters ⭐⭐
G.1. (Connects to Ch.8) Return to the contingency table you built in Chapter 8's exercises. Now apply a chi-square test of independence to that table. Compare what you learned from descriptive analysis (conditional probabilities, marginal probabilities) to what the formal test reveals. Does the test confirm what you suspected from the descriptive analysis?
G.2. (Connects to Ch.14) Consider a goodness-of-fit test with only two categories: Success (observed = 65) and Failure (observed = 35), with $n = 100$. The hypothesized proportion of success is 0.50.
(a) Conduct the chi-square goodness-of-fit test.
(b) Conduct a two-tailed one-proportion $z$-test for the same data.
(c) Verify that $\chi^2 = z^2$ (within rounding error).
(d) Verify that both tests give the same p-value.
(e) When would you prefer the chi-square test over the $z$-test? When would you prefer the $z$-test?
G.3. (Connects to Ch.17) James found that bail decisions are associated with race ($\chi^2 = 25.48$, $df = 3$, $p < 0.001$, $V = 0.206$).
(a) Is this result statistically significant? Is it practically significant?
(b) Using the effect size benchmarks from Chapter 17 (adapted for Cramer's V), how would you characterize the strength of this association?
(c) James's sample is $n = 600$. If he repeated the study with $n = 6{,}000$ and found the same proportional pattern, what would happen to $\chi^2$? What would happen to Cramer's V?
(d) Write a paragraph interpreting these results as James might present them to a policy audience, incorporating both statistical and practical significance.
Part H: Python Implementation ⭐⭐⭐
H.1. Using Python, load a dataset of your choice (or use the sample data below) and conduct both types of chi-square test.
Sample data (student survey):
import pandas as pd
import numpy as np
np.random.seed(42)
n = 400
data = pd.DataFrame({
'major': np.random.choice(
['STEM', 'Business', 'Humanities', 'Social Science'],
size=n, p=[0.30, 0.25, 0.20, 0.25]),
'study_hours': np.random.choice(
['<10 hrs', '10-20 hrs', '>20 hrs'],
size=n, p=[0.25, 0.50, 0.25]),
'grade': np.random.choice(
['A', 'B', 'C', 'D/F'],
size=n, p=[0.20, 0.35, 0.30, 0.15])
})
(a) Conduct a goodness-of-fit test to determine whether majors are equally distributed.
(b) Create a contingency table of major vs. grade using pd.crosstab().
(c) Conduct a test of independence for major and grade.
(d) Compute Cramer's V and standardized residuals.
(e) Create a grouped bar chart visualizing the relationship.
H.2. Write a Python function that takes a contingency table (as a 2D NumPy array) and produces a complete formatted report including: - Observed and expected frequency tables - Chi-square statistic, degrees of freedom, and p-value - Cramer's V with interpretation - Standardized residuals with cells $|r| > 2$ flagged - A visual heatmap of the standardized residuals
Test your function on James's bail data and Alex's StreamVibe data.
H.3. (Connects to Ch.18) Use a permutation test to replicate the chi-square test of independence for James's bail data:
(a) Pool all 600 observations and randomly shuffle the race labels 10,000 times.
(b) For each shuffled dataset, compute the chi-square statistic.
(c) Calculate the permutation p-value: proportion of shuffled $\chi^2$ values $\geq$ observed $\chi^2$.
(d) Compare the permutation p-value to the chi-square test p-value. Do they agree?
(e) When might the permutation approach be preferable to the chi-square approximation?
Part I: Real-World Applications ⭐⭐⭐
I.1. A pharmaceutical company tested a new drug against a placebo. Side effects were categorized as None, Mild, or Severe:
| None | Mild | Severe | Total | |
|---|---|---|---|---|
| Drug | 180 | 45 | 25 | 250 |
| Placebo | 200 | 35 | 15 | 250 |
| Total | 380 | 80 | 40 | 500 |
(a) Conduct a complete chi-square test of independence.
(b) Calculate Cramer's V.
(c) Calculate standardized residuals. Where does the drug differ most from placebo?
(d) A medical reviewer asks: "Is the drug safe?" How would you answer using both the p-value and the effect size?
(e) Would this analysis change if the company tested at $\alpha = 0.01$ instead of $\alpha = 0.05$?
I.2. A school district collected data on school type and student absenteeism:
| Low Absences (<5) | Moderate (5-15) | High (>15) | Total | |
|---|---|---|---|---|
| Elementary | 245 | 85 | 20 | 350 |
| Middle School | 160 | 110 | 30 | 300 |
| High School | 140 | 120 | 90 | 350 |
| Total | 545 | 315 | 140 | 1000 |
(a) Conduct a chi-square test of independence.
(b) Compute Cramer's V.
(c) Use standardized residuals to identify which school-absenteeism combinations are most noteworthy.
(d) A school board member says: "This proves that attending high school causes higher absenteeism." Respond to this claim using what you know about causation vs. association (Theme 5).
Part J: Synthesis and Critical Thinking ⭐⭐⭐⭐
J.1. Category Design Matters. Consider James's bail data. Re-analyze the data using only two racial categories: White (n = 200) vs. Non-White (n = 400).
(a) Create the new $2 \times 2$ table.
(b) Conduct the chi-square test and compute Cramer's V.
(c) Compare the results to the original four-category analysis ($V = 0.206$). Is information lost? What specific patterns disappear when you collapse categories?
(d) Now add a fifth category by splitting "Other" into "Asian" (n = 35, bail granted: 31) and "Other" (n = 15, bail granted: 12). What happens to the expected frequency condition?
(e) Write a paragraph reflecting on how the choice of categories shapes what chi-square tests can reveal. Connect this to Theme 2 (the human stories behind the data).
J.2. Meta-Analysis. Five different studies examine the relationship between neighborhood type (Urban/Suburban/Rural) and health outcome (Healthy/Chronic Condition). Each reports:
| Study | $n$ | $\chi^2$ | $df$ | $p$ | $V$ |
|---|---|---|---|---|---|
| A | 200 | 3.8 | 2 | 0.150 | 0.14 |
| B | 500 | 12.5 | 2 | 0.002 | 0.16 |
| C | 150 | 2.1 | 2 | 0.350 | 0.12 |
| D | 1000 | 28.0 | 2 | <0.001 | 0.17 |
| E | 300 | 5.4 | 2 | 0.067 | 0.13 |
(a) Which studies reached statistical significance at $\alpha = 0.05$?
(b) Despite differing conclusions about significance, what do the Cramer's V values suggest about the consistency of the effect?
(c) Why do some studies reach significance while others don't, even though the effect sizes are similar?
(d) A policymaker reads Studies A, C, and E and concludes "neighborhood type doesn't affect health." A different policymaker reads Studies B and D and concludes "it does." Who's right? What would you advise?
J.3. Ethics of Classification. An AI company builds a content moderation algorithm that classifies posts into categories: Hate Speech, Misinformation, Spam, and Clean. They test whether the algorithm's classifications are independent of the poster's demographic group.
(a) What are the ethical implications if the chi-square test is significant (i.e., classification is not independent of demographic group)?
(b) What if the test is non-significant? Does that guarantee fairness?
(c) How might Cramer's V help in this context? What level of $V$ would be concerning?
(d) What standardized residual pattern would be most alarming? (Hint: think about which cell — which combination of demographic group and classification — represents the worst outcome for affected users.)
(e) Connect this to Professor Washington's work in earlier chapters. How does this scenario relate to algorithmic bias?
Part K: Quick Checks ⭐
K.1. A chi-square test of independence yields $\chi^2 = 3.84$ with $df = 1$. Is this significant at $\alpha = 0.05$? (Hint: the critical value for $\chi^2$ with $df = 1$ at $\alpha = 0.05$ is 3.841.)
K.2. For a $5 \times 3$ contingency table, how many degrees of freedom does the test of independence have?
K.3. A goodness-of-fit test with 6 categories and $n = 120$ produces expected frequencies of 20 each. Are the conditions met?
K.4. A researcher reports Cramer's V = 0.42 for a $3 \times 4$ table. How would you describe the strength of this association?
K.5. Can you use a chi-square test to determine whether a student's GPA (measured as a continuous variable from 0.0 to 4.0) is related to their major? Why or why not? What modification would make this possible?