Case Study 1: Maya's Regional Disease Distribution Analysis

The Setup

Dr. Maya Chen has been tracking a respiratory illness across her county for three years. In the early months, cases seemed scattered randomly. But as data accumulated, a troubling pattern began to emerge: certain neighborhoods appeared to bear a disproportionate burden of disease. The question is whether this pattern is real — or whether Maya is seeing patterns in noise.

This isn't the first time Maya has wondered whether apparent geographic clusters are meaningful. In Chapter 8, she used contingency tables to organize disease data by region. In Chapter 14, she tested whether a single community's disease rate exceeded the county average. In Chapter 16, she compared two communities directly. But now she wants to go bigger: she wants to test whether the entire distribution of cases across all regions matches what population demographics alone would predict.

She has two questions: 1. Goodness-of-fit: Does the distribution of cases across four regions match the population distribution? (If not, some regions are disproportionately affected.) 2. Independence: Is the severity of illness (mild, moderate, severe) independent of region? (If not, some regions may face worse outcomes, not just more cases.)

Both questions require chi-square tests.

The Data

Maya's county is divided into four public health regions. She has collected data on 400 confirmed cases over the past year.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# ============================================================
# QUESTION 1: Goodness-of-Fit — Regional Distribution
# ============================================================

# Regional populations and case counts
regions = ['North', 'South', 'East', 'West']
populations = np.array([100000, 120000, 80000, 100000])
pop_proportions = populations / populations.sum()

observed_cases = np.array([82, 108, 122, 88])
total_cases = observed_cases.sum()

expected_cases = pop_proportions * total_cases

print("=" * 60)
print("GOODNESS-OF-FIT TEST: Regional Case Distribution")
print("=" * 60)
print(f"\n{'Region':<10} {'Pop Share':<12} {'Expected':<12} {'Observed':<12}")
print("-" * 46)
for i, r in enumerate(regions):
    print(f"{r:<10} {pop_proportions[i]:<12.1%} "
          f"{expected_cases[i]:<12.1f} {observed_cases[i]:<12d}")
print(f"{'Total':<10} {'100.0%':<12} "
      f"{total_cases:<12.1f} {total_cases:<12d}")

Output:

Region     Pop Share    Expected     Observed
----------------------------------------------
North      25.0%        100.0        82
South      30.0%        120.0        108
East       20.0%        80.0         122
West       25.0%        100.0        88
Total      100.0%       400.0        400

Something jumps out immediately: the East region, with only 20% of the population, has 30.5% of the cases (122 out of 400). Is this a real disparity, or could it be chance?

Goodness-of-Fit Test

# Step 1: Hypotheses
# H0: Cases are distributed proportionally to population
# Ha: Cases are NOT distributed proportionally to population

# Step 2: Check conditions
print("\nCondition Check:")
print(f"  Minimum expected frequency: {expected_cases.min():.1f}")
print(f"  All expected >= 5? {'Yes' if expected_cases.min() >= 5 else 'No'}")

# Step 3: Compute chi-square
chi2_stat, p_value = stats.chisquare(observed_cases, f_exp=expected_cases)
df = len(regions) - 1

print(f"\nResults:")
print(f"  Chi-square statistic: {chi2_stat:.3f}")
print(f"  Degrees of freedom:   {df}")
print(f"  P-value:              {p_value:.4f}")

# Step 4: Contributions to chi-square
print(f"\nContributions to chi-square:")
contributions = (observed_cases - expected_cases)**2 / expected_cases
for i, r in enumerate(regions):
    direction = "more than expected" if observed_cases[i] > expected_cases[i] else "fewer than expected"
    print(f"  {r}: {contributions[i]:.3f} ({direction})")

Output:

Results:
  Chi-square statistic: 18.867
  Degrees of freedom:   3
  P-value:              0.0003

Contributions to chi-square:
  North: 3.240 (fewer than expected)
  South: 1.200 (fewer than expected)
  East: 22.050 (more than expected)
  West: 1.440 (fewer than expected)

Interpreting the Results

The result is highly significant ($\chi^2 = 18.87$, $df = 3$, $p = 0.0003$). The disease is not distributed proportionally to population. The East region is bearing a disproportionate burden.

Looking at the contributions, the East region alone accounts for $22.05 / 18.87 = 84.6\% of the total chi-square value. This is where the departure from expected is concentrated. The other three regions have mild deviations that, by themselves, wouldn't be remarkable.

# Visualize observed vs. expected
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
x = np.arange(len(regions))
width = 0.35
axes[0].bar(x - width/2, observed_cases, width, label='Observed',
            color='#E74C3C', edgecolor='black', alpha=0.8)
axes[0].bar(x + width/2, expected_cases, width, label='Expected',
            color='#3498DB', edgecolor='black', alpha=0.8)
axes[0].set_xlabel('Region', fontsize=12)
axes[0].set_ylabel('Number of Cases', fontsize=12)
axes[0].set_title('Observed vs. Expected Disease Cases by Region', fontsize=13)
axes[0].set_xticks(x)
axes[0].set_xticklabels(regions)
axes[0].legend(fontsize=11)

# Contribution bar chart
colors = ['#2ECC71' if c < 3.841 else '#E74C3C' for c in contributions]
axes[1].bar(regions, contributions, color=colors, edgecolor='black')
axes[1].axhline(y=3.841, color='red', linestyle='--', linewidth=1,
                label=r'Critical value ($\chi^2_{1,0.05}$ = 3.841)')
axes[1].set_xlabel('Region', fontsize=12)
axes[1].set_ylabel(r'Contribution to $\chi^2$', fontsize=12)
axes[1].set_title('Which Regions Drive the Significant Result?', fontsize=13)
axes[1].legend(fontsize=10)

plt.tight_layout()
plt.show()

The visualization makes the story unmistakable: the East region's bar towers above the rest. The 3.841 threshold line (the critical value for a single-df chi-square test at $\alpha = 0.05$) provides useful context — the East region's contribution alone exceeds the critical value for the entire test.

The Deeper Question: Independence of Severity and Region

Maya now asks a harder question. Not only are there more cases in the East, but are the cases worse there? If severity depends on region, that suggests not just unequal exposure but possibly unequal access to care, environmental factors, or differences in the population's vulnerability.

# ============================================================
# QUESTION 2: Test of Independence — Severity vs. Region
# ============================================================

# Cross-tabulation of region by disease severity
severity_data = np.array([
    [38, 30, 14],    # North: mild, moderate, severe
    [52, 38, 18],    # South
    [32, 48, 42],    # East
    [42, 32, 14]     # West
])

severity_labels = ['Mild', 'Moderate', 'Severe']

# Create labeled DataFrame
severity_df = pd.DataFrame(
    severity_data,
    index=regions,
    columns=severity_labels
)
severity_df['Total'] = severity_df.sum(axis=1)

print("\n" + "=" * 60)
print("TEST OF INDEPENDENCE: Severity vs. Region")
print("=" * 60)
print("\nObserved Frequencies:")
print(severity_df)
print(f"\nColumn Totals: {severity_data.sum(axis=0)}")
print(f"Grand Total:   {severity_data.sum()}")

# Run the test
chi2, p, dof, expected = stats.chi2_contingency(severity_data)

print(f"\nExpected Frequencies (under independence):")
exp_df = pd.DataFrame(
    np.round(expected, 1),
    index=regions,
    columns=severity_labels
)
print(exp_df)

# Check conditions
print(f"\nMin expected frequency: {expected.min():.1f} "
      f"({'OK' if expected.min() >= 5 else 'WARNING'})")

# Results
print(f"\nChi-square statistic: {chi2:.3f}")
print(f"Degrees of freedom:   {dof}")
print(f"P-value:              {p:.6f}")

# Cramer's V
n = severity_data.sum()
k = min(severity_data.shape) - 1
V = np.sqrt(chi2 / (n * k))
print(f"Cramer's V:           {V:.3f}")

# Standardized residuals
residuals = (severity_data - expected) / np.sqrt(expected)
print(f"\nStandardized Residuals:")
res_df = pd.DataFrame(
    np.round(residuals, 2),
    index=regions,
    columns=severity_labels
)
print(res_df)

# Highlight notable residuals
print(f"\nNotable residuals (|r| > 2):")
for i, r in enumerate(regions):
    for j, s in enumerate(severity_labels):
        if abs(residuals[i, j]) > 2:
            direction = "MORE" if residuals[i, j] > 0 else "FEWER"
            print(f"  {r} x {s}: {residuals[i,j]:+.2f} — "
                  f"{direction} {s.lower()} cases than expected")

Output:

Chi-square statistic: 25.764
Degrees of freedom:   6
P-value:              0.000247
Cramer's V:           0.180

Standardized Residuals:
       Mild  Moderate  Severe
North  1.23     -0.62   -0.59
South  1.54     -0.43   -0.87
East  -2.76      1.65    3.52
West   1.18     -0.07   -1.11

Notable residuals (|r| > 2):
  East x Mild: -2.76 — FEWER mild cases than expected
  East x Severe: +3.52 — MORE severe cases than expected

The Full Story

The test of independence is highly significant ($\chi^2 = 25.76$, $df = 6$, $p < 0.001$). Disease severity is not independent of region. Cramer's V = 0.180 indicates a small but meaningful association.

The standardized residuals reveal the pattern: the East region doesn't just have more cases — it has more severe cases than expected (residual = +3.52) and fewer mild cases than expected (residual = -2.76). The North, South, and West regions show the opposite pattern: slightly more mild cases and fewer severe cases than independence would predict.

This has profound public health implications. If the East region is experiencing both more cases and more severe cases, something about that region — environmental exposures, healthcare access, population vulnerability, or some combination — requires urgent investigation.

Maya's Report to the Public Health Committee

Maya synthesizes her findings into a presentation:

Key Finding 1: The distribution of respiratory illness cases across the county is significantly disproportionate ($\chi^2 = 18.87$, $p < 0.001$). The East region, with 20% of the population, accounts for 30.5% of cases.

Key Finding 2: Disease severity varies by region ($\chi^2 = 25.76$, $p < 0.001$, Cramer's V = 0.18). The East region has significantly more severe cases and fewer mild cases than expected.

Limitations: These are observational data. The chi-square test establishes that the disparity exists; it does not identify the cause. Possible explanations include: - Environmental factors (the East region contains two industrial facilities) - Healthcare access (the East region has the longest average distance to a hospital) - Socioeconomic factors (the East region has the lowest median household income) - Population composition (age, pre-existing conditions)

Recommendation: Allocate additional epidemiological resources to the East region. Conduct a follow-up study controlling for age, income, healthcare access, and environmental exposure to identify the drivers of this disparity.

Connecting to the Bigger Picture

Maya's analysis illustrates why chi-square tests matter for public health. A simple comparison of case counts might have revealed the East region's overrepresentation, but the formal test provides three things that informal observation cannot:

  1. Quantified evidence. The p-value tells us the pattern is unlikely to be chance ($p < 0.001$).
  2. A framework for severity. The test of independence reveals that it's not just how many cases but how bad they are that varies by region.
  3. Effect size in context. Cramer's V = 0.18 tells us the association is real but modest — region matters, but it's far from the only factor affecting disease outcomes.

This is Theme 1 in action: statistics as a superpower. Without the chi-square test, Maya might suspect a pattern. With it, she can demonstrate the pattern to administrators, justify resource allocation, and design targeted follow-up studies.

But it's also Theme 2: the human stories behind the data. Each of those 122 cases in the East region represents a person who got sick. The 42 severe cases represent families who endured hospitalization, lost wages, and fear. The categories in Maya's contingency table — North, South, East, West; Mild, Moderate, Severe — compress human suffering into cells of a table. The chi-square test reveals that this suffering is unequally distributed. The next step — figuring out why — is where public health meets social justice.

Try It Yourself

  1. If Maya's East region data were removed and only the North, South, and West regions were tested for uniform distribution (proportional to their populations), would the test still be significant? Conduct the analysis.

  2. Maya suspects that the two industrial facilities in the East region may be contributing to the disparity. Design a study that could test this hypothesis. What data would you need? What confounders would you control for? What chi-square test (or other test) would you use?

  3. If Maya disaggregated the East region data by census tract (5 tracts within the East region), what would happen to the expected frequencies? Might the expected frequency condition be violated? How would you handle this?