Exercises: Chi-Square Tests: Categorical Data Analysis

Contributors

Exercises: Chi-Square Tests: Categorical Data Analysis

These exercises progress from conceptual understanding through goodness-of-fit tests, tests of independence, effect sizes, residual analysis, and Python implementation. Estimated completion time: 3.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. In your own words, explain how the chi-square test differs from the $t$-test and $z$-test. What type of data does each handle, and what kind of question does each answer?

A.2. Why do we divide $(O - E)^2$ by $E$ in the chi-square formula, rather than just summing $(O - E)^2$? Give a specific example where omitting this division would be misleading.

A.3. True or false (explain each):

(a) A chi-square statistic can be negative.

(b) The chi-square test of independence tells you which cells are responsible for the association.

(c) If all observed frequencies equal the expected frequencies, the chi-square statistic is 0.

(d) The expected frequency condition requires that all observed counts be at least 5.

(e) Cramer's V is independent of sample size.

(f) A two-category goodness-of-fit test gives the same p-value as a two-tailed one-proportion $z$-test.

A.4. Explain why the chi-square test is always one-tailed (right-tailed). Why can't a very small chi-square value be evidence against $H_0$?

A.5. A researcher finds $\chi^2 = 15.3$ with $p = 0.002$ for a test of independence. She concludes: "Race causes differences in sentencing outcomes." What is wrong with this conclusion? Identify at least two problems.

A.6. In one sentence each, state the key difference between:

(a) A goodness-of-fit test and a test of independence

(b) Observed frequencies and expected frequencies

(c) The chi-square statistic and Cramer's V

(d) A chi-square test of independence and a chi-square test of homogeneity

Part B: Expected Frequencies and the Chi-Square Statistic ⭐

B.1. A marketing team surveyed 300 customers about their preferred social media platform:

Platform	Observed
Instagram	95
TikTok	82
Twitter/X	48
Facebook	45
LinkedIn	30

(a) If the null hypothesis is that all platforms are equally preferred, what are the expected frequencies?

(b) Compute the chi-square statistic by hand.

(c) How many degrees of freedom does this test have?

(d) Using Python or a chi-square table, find the p-value. What is your conclusion at $\alpha = 0.05$?

B.2. A genetics student crosses two pea plants and expects the offspring to show a 9:3:3:1 ratio of phenotypes. Out of 160 offspring, she observes:

Phenotype	Observed
Round, Yellow	86
Round, Green	35
Wrinkled, Yellow	26
Wrinkled, Green	13

(a) Calculate the expected frequencies based on the 9:3:3:1 ratio.

(b) Compute the chi-square statistic.

(c) With $df = 3$, find the p-value. Does the data support the expected genetic ratio?

B.3. Consider this contingency table:

	Category A	Category B	Total
Group 1	40	60	100
Group 2	60	40	100
Total	100	100	200

(a) Calculate all four expected frequencies.

(b) Compute the chi-square statistic.

(c) How many degrees of freedom?

(d) This is a $2 \times 2$ table. Verify that $\chi^2 = z^2$ by computing the two-proportion $z$-test statistic from Chapter 16 (Group 1's proportion in Category A vs. Group 2's proportion).

Part C: Goodness-of-Fit Tests ⭐⭐

C.1. A six-sided die is rolled 120 times with the following results:

Face	1	2	3	4	5	6
Count	25	17	22	18	16	22

(a) State the null and alternative hypotheses.

(b) Calculate expected frequencies.

(c) Compute $\chi^2$ and find the p-value with $df = 5$.

(d) At $\alpha = 0.05$, is the die fair?

(e) Which face contributes most to $\chi^2$? What does this suggest?

C.2. A hospital emergency department records the day of the week for 700 patient arrivals:

Day	Mon	Tue	Wed	Thu	Fri	Sat	Sun
Arrivals	112	88	95	90	108	115	92

(a) Test whether ED arrivals are uniformly distributed across days of the week.

(b) If the test is significant, which days contribute most to the departure from uniformity?

(c) A colleague suggests that the hospital should expect more arrivals on weekends (Sat/Sun: 16% each) and fewer on weekdays (Mon-Fri: 13.6% each). Test this alternative hypothesis about the distribution.

C.3. Maya is studying whether the distribution of blood types in her hospital's patient population matches the expected distribution for the U.S. population:

Blood Type	Expected %	Observed (n = 500)
O+	37.4%	205
A+	35.7%	162
B+	8.5%	55
AB+	3.4%	18
O-	6.6%	28
A-	6.3%	22
B-	1.5%	7
AB-	0.6%	3

(a) Compute expected frequencies for each blood type.

(b) Check the expected frequency condition. Are there any problems?

(c) If conditions are not met, suggest how to address the issue.

(d) Conduct the test (combining categories if necessary) and interpret the results.

Part D: Tests of Independence ⭐⭐

D.1. A study examined the relationship between exercise frequency and self-reported stress level:

	Low Stress	Moderate Stress	High Stress	Total
Frequent Exercise	85	45	20	150
Occasional Exercise	55	70	25	150
No Exercise	30	55	65	150
Total	170	170	110	450

(a) Compute all expected frequencies.

(b) Verify that the conditions for a chi-square test are met.

(c) Compute the chi-square statistic.

(d) Find the p-value with the appropriate degrees of freedom.

(e) Compute Cramer's V and interpret the strength of association.

(f) Compute standardized residuals. Which cells show the strongest departures from independence?

D.2. Alex wants to know whether users who engage with StreamVibe's recommendation algorithm differ in their satisfaction ratings. Data from 400 users:

	Satisfied	Neutral	Dissatisfied	Total
Uses Recommendations	95	60	45	200
Ignores Recommendations	55	50	95	200
Total	150	110	140	400

(a) Conduct a complete chi-square test of independence.

(b) Calculate Cramer's V.

(c) Compute standardized residuals and interpret them.

(d) What business recommendation would Alex make based on these findings?

D.3. A political science researcher surveyed 600 registered voters about their party affiliation and their position on a proposed policy:

	Support	Oppose	Undecided	Total
Party A	120	60	20	200
Party B	50	110	40	200
Independent	70	80	50	200
Total	240	250	110	600

(a) Is party affiliation independent of position on the policy?

(b) Which party-position combinations show the largest departures from independence?

(c) How does this relate to the theme that "categorical data often describes people"?

Part E: Cramer's V and Effect Sizes ⭐⭐

E.1. Two studies examine the relationship between smoking status (smoker/non-smoker) and lung disease (yes/no):

Study A: $n = 200$, $\chi^2 = 8.0$
Study B: $n = 2{,}000$, $\chi^2 = 80.0$

(a) Compute Cramer's V for each study.

(b) What do you notice? What does this tell you about comparing chi-square statistics directly?

(c) Both studies have the same Cramer's V. Explain why the p-values would be very different.

E.2. A researcher reports: "We found a statistically significant association between major and career satisfaction ($\chi^2 = 14.2$, $df = 6$, $p = 0.027$)." The sample size was 2,500.

(a) Compute Cramer's V. (Hint: the table is at least $4 \times 3$ since $df = (r-1)(c-1) = 6$.)

(b) Is this a practically meaningful association? Justify your answer.

(c) What would you recommend the researcher include in their report alongside the p-value?

E.3. Match each Cramer's V value with the most appropriate scenario:

(a) $V = 0.85$: ______

(b) $V = 0.12$: ______

(c) $V = 0.35$: ______

Scenarios: 1. Knowing a customer's age group gives a moderate prediction of their product preference. 2. Gender almost perfectly predicts a person's name (in most cultures). 3. There's a slight tendency for coffee drinkers to prefer window seats at cafes.

Part F: Conditions and Troubleshooting ⭐⭐

F.1. For each scenario, determine whether the conditions for a chi-square test are met. If not, suggest a remedy.

(a) A $3 \times 4$ table with $n = 40$ where three cells have expected frequencies below 3.

(b) A survey where married couples were asked separately about their TV preferences, and both spouses' responses are included in the contingency table.

(c) A goodness-of-fit test with 8 categories where one expected count is 4.2 but all others exceed 20.

(d) A $2 \times 2$ table where one expected frequency is 3.8 but the sample is from a true random sample of 50 people.

F.2. A researcher studying voting behavior creates this table:

	Voted	Did Not Vote	Total
Urban	85	15	100
Suburban	72	28	100
Rural	8	2	10
Total	165	45	210

(a) Calculate expected frequencies for all cells.

(b) Which cells have expected frequencies below 5?

(c) Propose two different ways to fix this problem. What are the trade-offs of each approach?

Part G: Connecting to Prior Chapters ⭐⭐

G.1. (Connects to Ch.8) Return to the contingency table you built in Chapter 8's exercises. Now apply a chi-square test of independence to that table. Compare what you learned from descriptive analysis (conditional probabilities, marginal probabilities) to what the formal test reveals. Does the test confirm what you suspected from the descriptive analysis?

G.2. (Connects to Ch.14) Consider a goodness-of-fit test with only two categories: Success (observed = 65) and Failure (observed = 35), with $n = 100$. The hypothesized proportion of success is 0.50.

(a) Conduct the chi-square goodness-of-fit test.

(b) Conduct a two-tailed one-proportion $z$-test for the same data.

(c) Verify that $\chi^2 = z^2$ (within rounding error).

(d) Verify that both tests give the same p-value.

(e) When would you prefer the chi-square test over the $z$-test? When would you prefer the $z$-test?

G.3. (Connects to Ch.17) James found that bail decisions are associated with race ($\chi^2 = 25.48$, $df = 3$, $p < 0.001$, $V = 0.206$).

(a) Is this result statistically significant? Is it practically significant?

(b) Using the effect size benchmarks from Chapter 17 (adapted for Cramer's V), how would you characterize the strength of this association?

(c) James's sample is $n = 600$. If he repeated the study with $n = 6{,}000$ and found the same proportional pattern, what would happen to $\chi^2$? What would happen to Cramer's V?

(d) Write a paragraph interpreting these results as James might present them to a policy audience, incorporating both statistical and practical significance.

Part H: Python Implementation ⭐⭐⭐

H.1. Using Python, load a dataset of your choice (or use the sample data below) and conduct both types of chi-square test.

Sample data (student survey):

import pandas as pd
import numpy as np

np.random.seed(42)
n = 400

data = pd.DataFrame({
    'major': np.random.choice(
        ['STEM', 'Business', 'Humanities', 'Social Science'],
        size=n, p=[0.30, 0.25, 0.20, 0.25]),
    'study_hours': np.random.choice(
        ['<10 hrs', '10-20 hrs', '>20 hrs'],
        size=n, p=[0.25, 0.50, 0.25]),
    'grade': np.random.choice(
        ['A', 'B', 'C', 'D/F'],
        size=n, p=[0.20, 0.35, 0.30, 0.15])
})

(a) Conduct a goodness-of-fit test to determine whether majors are equally distributed.

(b) Create a contingency table of major vs. grade using pd.crosstab().

(c) Conduct a test of independence for major and grade.

(d) Compute Cramer's V and standardized residuals.

(e) Create a grouped bar chart visualizing the relationship.

H.2. Write a Python function that takes a contingency table (as a 2D NumPy array) and produces a complete formatted report including: - Observed and expected frequency tables - Chi-square statistic, degrees of freedom, and p-value - Cramer's V with interpretation - Standardized residuals with cells $|r| > 2$ flagged - A visual heatmap of the standardized residuals

Test your function on James's bail data and Alex's StreamVibe data.

H.3. (Connects to Ch.18) Use a permutation test to replicate the chi-square test of independence for James's bail data:

(a) Pool all 600 observations and randomly shuffle the race labels 10,000 times.

(b) For each shuffled dataset, compute the chi-square statistic.

(c) Calculate the permutation p-value: proportion of shuffled $\chi^2$ values $\geq$ observed $\chi^2$.

(d) Compare the permutation p-value to the chi-square test p-value. Do they agree?

(e) When might the permutation approach be preferable to the chi-square approximation?

Part I: Real-World Applications ⭐⭐⭐

I.1. A pharmaceutical company tested a new drug against a placebo. Side effects were categorized as None, Mild, or Severe:

	None	Mild	Severe	Total
Drug	180	45	25	250
Placebo	200	35	15	250
Total	380	80	40	500

(a) Conduct a complete chi-square test of independence.

(b) Calculate Cramer's V.

(c) Calculate standardized residuals. Where does the drug differ most from placebo?

(d) A medical reviewer asks: "Is the drug safe?" How would you answer using both the p-value and the effect size?

(e) Would this analysis change if the company tested at $\alpha = 0.01$ instead of $\alpha = 0.05$?

I.2. A school district collected data on school type and student absenteeism:

	Low Absences (<5)	Moderate (5-15)	High (>15)	Total
Elementary	245	85	20	350
Middle School	160	110	30	300
High School	140	120	90	350
Total	545	315	140	1000

(a) Conduct a chi-square test of independence.

(b) Compute Cramer's V.

(c) Use standardized residuals to identify which school-absenteeism combinations are most noteworthy.

(d) A school board member says: "This proves that attending high school causes higher absenteeism." Respond to this claim using what you know about causation vs. association (Theme 5).

Part J: Synthesis and Critical Thinking ⭐⭐⭐⭐

J.1. Category Design Matters. Consider James's bail data. Re-analyze the data using only two racial categories: White (n = 200) vs. Non-White (n = 400).

(a) Create the new $2 \times 2$ table.

(b) Conduct the chi-square test and compute Cramer's V.

(c) Compare the results to the original four-category analysis ($V = 0.206$). Is information lost? What specific patterns disappear when you collapse categories?

(d) Now add a fifth category by splitting "Other" into "Asian" (n = 35, bail granted: 31) and "Other" (n = 15, bail granted: 12). What happens to the expected frequency condition?

(e) Write a paragraph reflecting on how the choice of categories shapes what chi-square tests can reveal. Connect this to Theme 2 (the human stories behind the data).

J.2. Meta-Analysis. Five different studies examine the relationship between neighborhood type (Urban/Suburban/Rural) and health outcome (Healthy/Chronic Condition). Each reports:

Study	$n$	$\chi^2$	$df$	$p$	$V$
A	200	3.8	2	0.150	0.14
B	500	12.5	2	0.002	0.16
C	150	2.1	2	0.350	0.12
D	1000	28.0	2	<0.001	0.17
E	300	5.4	2	0.067	0.13

(a) Which studies reached statistical significance at $\alpha = 0.05$?

(b) Despite differing conclusions about significance, what do the Cramer's V values suggest about the consistency of the effect?

(c) Why do some studies reach significance while others don't, even though the effect sizes are similar?

(d) A policymaker reads Studies A, C, and E and concludes "neighborhood type doesn't affect health." A different policymaker reads Studies B and D and concludes "it does." Who's right? What would you advise?

J.3. Ethics of Classification. An AI company builds a content moderation algorithm that classifies posts into categories: Hate Speech, Misinformation, Spam, and Clean. They test whether the algorithm's classifications are independent of the poster's demographic group.

(a) What are the ethical implications if the chi-square test is significant (i.e., classification is not independent of demographic group)?

(b) What if the test is non-significant? Does that guarantee fairness?

(c) How might Cramer's V help in this context? What level of $V$ would be concerning?

(d) What standardized residual pattern would be most alarming? (Hint: think about which cell — which combination of demographic group and classification — represents the worst outcome for affected users.)

(e) Connect this to Professor Washington's work in earlier chapters. How does this scenario relate to algorithmic bias?

Part K: Quick Checks ⭐

K.1. A chi-square test of independence yields $\chi^2 = 3.84$ with $df = 1$. Is this significant at $\alpha = 0.05$? (Hint: the critical value for $\chi^2$ with $df = 1$ at $\alpha = 0.05$ is 3.841.)

K.2. For a $5 \times 3$ contingency table, how many degrees of freedom does the test of independence have?

K.3. A goodness-of-fit test with 6 categories and $n = 120$ produces expected frequencies of 20 each. Are the conditions met?

K.4. A researcher reports Cramer's V = 0.42 for a $3 \times 4$ table. How would you describe the strength of this association?

K.5. Can you use a chi-square test to determine whether a student's GPA (measured as a continuous variable from 0.0 to 4.0) is related to their major? Why or why not? What modification would make this possible?