Case Study 1: Maya's Multi-Predictor Model for ER Visit Rates

Contributors

Case Study 1: Maya's Multi-Predictor Model for ER Visit Rates

The Setup

Dr. Maya Chen is back before the county health board, and this time she's prepared to do more than show a correlation.

In Chapter 22, Maya demonstrated a strong linear relationship between poverty rate and ER visit rates ($r = 0.98$, $R^2 = 0.96$). The board was impressed. But Maya knew the simple regression was telling an incomplete story.

"Poverty is clearly associated with ER overcrowding," she told the board last month. "But before we allocate $12 million in intervention funding based on a single-predictor model, we need to understand what about poverty drives ER utilization. Is it that people in poverty can't afford primary care? Is it environmental factors — poor air quality in economically disadvantaged areas? Is poverty itself the causal agent, or is it a proxy for other factors we could target more effectively?"

The board chair nodded: "So what do you recommend?"

"I recommend we build a model that includes the factors we think matter — poverty, insurance coverage, and air quality — and see what each one contributes independently."

This is the story of that analysis.

The Data

Maya has expanded her dataset to include two additional variables for each of the 25 communities in her health district:

Poverty rate (% of households below the federal poverty line)
Uninsured percentage (% of residents without health insurance)
AQI (Air Quality Index — higher values indicate worse air quality)
ER visit rate (annual visits per 1,000 residents)

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# ============================================================
# MAYA'S FULL MULTIPLE REGRESSION ANALYSIS
# ============================================================

communities = pd.DataFrame({
    'community': [f'Community {i+1}' for i in range(25)],
    'poverty_rate': [8.2, 12.5, 15.3, 5.1, 22.6, 18.4, 9.7, 25.1, 11.3,
                     19.8, 6.4, 14.7, 21.3, 7.8, 16.9, 28.4, 10.2, 23.7,
                     13.6, 20.5, 4.3, 17.2, 26.8, 8.9, 15.8],
    'er_rate': [145, 198, 225, 110, 310, 265, 155, 345, 175,
                280, 120, 215, 295, 138, 240, 380, 165, 320,
                200, 290, 95, 250, 365, 148, 230],
    'uninsured_pct': [6.5, 10.2, 14.1, 3.8, 20.5, 16.8, 7.3, 23.2, 9.8,
                      18.1, 5.1, 12.9, 19.7, 6.2, 15.3, 25.8, 8.4, 21.6,
                      11.5, 18.9, 3.2, 15.0, 24.3, 7.1, 13.7],
    'aqi': [42, 55, 63, 35, 78, 70, 48, 85, 50,
            72, 38, 60, 75, 44, 65, 90, 46, 80,
            58, 74, 32, 67, 88, 45, 62]
})

print("=" * 70)
print("MAYA'S COMMUNITY HEALTH DATA: FIRST 10 ROWS")
print("=" * 70)
print(communities.head(10).to_string(index=False))
print(f"\nn = {len(communities)} communities")
print(f"\nDescriptive Statistics:")
print(communities[['poverty_rate', 'uninsured_pct', 'aqi',
                    'er_rate']].describe().round(2))

Phase 1: Exploratory Analysis

Before fitting any models, Maya examines the relationships among all variables.

# ---- PHASE 1: EXPLORATORY ANALYSIS ----
print("\n" + "=" * 70)
print("PHASE 1: EXPLORATORY ANALYSIS")
print("=" * 70)

# Correlation matrix
print("\nCorrelation Matrix:")
corr = communities[['poverty_rate', 'uninsured_pct', 'aqi',
                     'er_rate']].corr()
print(corr.round(3))

# Visualize correlation matrix
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0,
            fmt='.3f', square=True, ax=ax,
            xticklabels=['Poverty', 'Uninsured', 'AQI', 'ER Rate'],
            yticklabels=['Poverty', 'Uninsured', 'AQI', 'ER Rate'])
ax.set_title('Correlation Matrix: Community Health Variables',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Scatterplot matrix
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

predictors = ['poverty_rate', 'uninsured_pct', 'aqi']
labels = ['Poverty Rate (%)', 'Uninsured (%)', 'Air Quality Index']

for i, (pred, label) in enumerate(zip(predictors, labels)):
    axes[i].scatter(communities[pred], communities['er_rate'],
                     color='steelblue', edgecolors='navy', s=60, alpha=0.7)
    axes[i].set_xlabel(label, fontsize=11)
    axes[i].set_ylabel('ER Visit Rate (per 1,000)', fontsize=11)
    r_val = communities[pred].corr(communities['er_rate'])
    axes[i].set_title(f'r = {r_val:.3f}', fontsize=12, fontweight='bold')

plt.suptitle('ER Visit Rate vs. Each Predictor', fontsize=14,
             fontweight='bold')
plt.tight_layout()
plt.show()

What Maya Sees

The correlation matrix reveals a critical issue:

	Poverty	Uninsured	AQI	ER Rate
Poverty	1.000	0.988	0.967	0.981
Uninsured	0.988	1.000	0.952	0.970
AQI	0.967	0.952	1.000	0.965
ER Rate	0.981	0.970	0.965	1.000

All three predictors are very highly correlated with each other and with ER visit rates. This is a classic case of multicollinearity. Poverty, lack of insurance, and poor air quality tend to cluster in the same communities — they're different manifestations of the same underlying disadvantage.

"This is going to make it hard to separate the individual effects," Maya notes. "But that's exactly why we need multiple regression — the simple correlations are misleading because they don't account for the overlap."

Phase 2: Simple vs. Multiple Regression

# ---- PHASE 2: SIMPLE VS. MULTIPLE REGRESSION ----
print("\n" + "=" * 70)
print("PHASE 2: SIMPLE VS. MULTIPLE REGRESSION")
print("=" * 70)

# Model 1: Poverty only (from Ch.22)
model1 = smf.ols('er_rate ~ poverty_rate', data=communities).fit()
print("\nMODEL 1: ER Rate ~ Poverty Rate")
print(f"  Slope = {model1.params['poverty_rate']:.2f}")
print(f"  R² = {model1.rsquared:.4f}")
print(f"  Adj. R² = {model1.rsquared_adj:.4f}")

# Model 2: All three predictors
model2 = smf.ols('er_rate ~ poverty_rate + uninsured_pct + aqi',
                  data=communities).fit()
print("\nMODEL 2: ER Rate ~ Poverty + Uninsured + AQI")
print(model2.summary())

The Key Comparison

	Model 1 (Poverty Only)	Model 2 (All Three)
Poverty rate coefficient	11.38	5.81
Uninsured % coefficient	—	3.26
AQI coefficient	—	1.12
$R^2$	0.962	0.979
Adjusted $R^2$	0.960	0.976

Maya circles the poverty coefficient change: 11.38 to 5.81 — a 49% decrease.

"Look at this," she tells her research assistant. "When we only looked at poverty, we attributed 11.38 additional ER visits per percentage point of poverty. But almost half of that was actually due to the lack of insurance and air quality that accompanies poverty. The unique contribution of poverty — holding insurance and air quality constant — is 5.81."

This is the threshold concept in action: "holding other variables constant" reveals a different, more nuanced picture than the simple regression.

Phase 3: Multicollinearity Check

# ---- PHASE 3: MULTICOLLINEARITY ----
print("\n" + "=" * 70)
print("PHASE 3: MULTICOLLINEARITY DIAGNOSTICS")
print("=" * 70)

X = communities[['poverty_rate', 'uninsured_pct', 'aqi']]
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i)
                   for i in range(X.shape[1])]
print("\nVariance Inflation Factors:")
print(vif_data.to_string(index=False))

        Variable       VIF
   poverty_rate      8.74
  uninsured_pct      7.21
            aqi      4.35

"The VIF values are high," Maya acknowledges to the board. "Poverty and insurance coverage are strongly correlated — communities with high poverty tend to have high uninsured rates. This means our individual coefficient estimates are less precise than we'd like. But the overall model predictions are still reliable."

She adds a crucial caveat: "With VIF values this high, I'd be cautious about over-interpreting the exact size of each coefficient. The general pattern — all three factors contribute — is reliable. The precise partitioning of effects is less certain."

Phase 4: Residual Diagnostics

# ---- PHASE 4: RESIDUAL DIAGNOSTICS ----
print("\n" + "=" * 70)
print("PHASE 4: RESIDUAL DIAGNOSTICS")
print("=" * 70)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Residuals vs. Predicted
axes[0, 0].scatter(model2.fittedvalues, model2.resid,
                    color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[0, 0].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[0, 0].set_xlabel('Predicted ER Visit Rate', fontsize=11)
axes[0, 0].set_ylabel('Residual', fontsize=11)
axes[0, 0].set_title('Residuals vs. Predicted Values', fontsize=12,
                       fontweight='bold')

# 2. QQ-plot
(theoretical, sample), (slope, intercept, r) = stats.probplot(
    model2.resid, dist="norm"
)
axes[0, 1].scatter(theoretical, sample, color='steelblue',
                    edgecolors='navy', s=60, alpha=0.7)
axes[0, 1].plot(theoretical, slope * np.array(theoretical) + intercept,
                'r--', linewidth=1.5)
axes[0, 1].set_xlabel('Theoretical Quantiles', fontsize=11)
axes[0, 1].set_ylabel('Sample Quantiles', fontsize=11)
axes[0, 1].set_title('Normal QQ-Plot of Residuals', fontsize=12,
                       fontweight='bold')

# 3. Histogram of residuals
axes[1, 0].hist(model2.resid, bins=8, color='steelblue',
                edgecolor='navy', alpha=0.7)
axes[1, 0].set_xlabel('Residual', fontsize=11)
axes[1, 0].set_ylabel('Frequency', fontsize=11)
axes[1, 0].set_title('Distribution of Residuals', fontsize=12,
                       fontweight='bold')

# 4. Scale-Location plot (sqrt of standardized residuals vs predicted)
std_resid = model2.get_influence().resid_studentized_internal
axes[1, 1].scatter(model2.fittedvalues, np.sqrt(np.abs(std_resid)),
                    color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[1, 1].set_xlabel('Predicted Values', fontsize=11)
axes[1, 1].set_ylabel('√|Standardized Residuals|', fontsize=11)
axes[1, 1].set_title('Scale-Location Plot', fontsize=12,
                       fontweight='bold')

plt.suptitle('Residual Diagnostics for Maya\'s Multiple Regression',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Shapiro-Wilk test for normality of residuals
stat, p_val = stats.shapiro(model2.resid)
print(f"\nShapiro-Wilk test on residuals: W = {stat:.4f}, p = {p_val:.4f}")
if p_val > 0.05:
    print("-> Fail to reject normality (residuals appear approximately normal)")
else:
    print("-> Evidence against normality of residuals")

Maya's Diagnostic Assessment

LINE Condition	Assessment	Evidence
Linearity	Satisfied	Residuals vs. predicted shows no curved pattern
Independence	Assumed satisfied	Communities are separate geographic units
Normality	Approximately satisfied	QQ-plot is reasonably straight; Shapiro-Wilk not significant
Equal variance	Approximately satisfied	No fan or funnel shape in residual plot

"The diagnostics look acceptable," Maya reports. "With only 25 communities, we don't have a lot of power to detect violations, but nothing raises a red flag."

Phase 5: Model Predictions

# ---- PHASE 5: MODEL PREDICTIONS ----
print("\n" + "=" * 70)
print("PHASE 5: MODEL PREDICTIONS")
print("=" * 70)

# Predict ER rate for specific community profiles
scenarios = pd.DataFrame({
    'Scenario': [
        'Low-poverty, insured, clean air',
        'High-poverty, uninsured, poor air',
        'Medium poverty, but good insurance',
        'Medium poverty, but poor air quality'
    ],
    'poverty_rate': [5, 25, 15, 15],
    'uninsured_pct': [4, 22, 5, 12],
    'aqi': [35, 85, 50, 80]
})

for _, row in scenarios.iterrows():
    pred = model2.predict(row[['poverty_rate', 'uninsured_pct', 'aqi']]
                          .to_frame().T)
    print(f"\n{row['Scenario']}:")
    print(f"  Poverty = {row['poverty_rate']}%, "
          f"Uninsured = {row['uninsured_pct']}%, "
          f"AQI = {row['aqi']}")
    print(f"  Predicted ER rate: {pred.values[0]:.0f} visits per 1,000")

Key Predictions

Scenario	Poverty	Uninsured	AQI	Predicted ER Rate
Best case	5%	4%	35	96 per 1,000
Worst case	25%	22%	85	356 per 1,000
Medium poverty, good insurance	15%	5%	50	185 per 1,000
Medium poverty, bad air	15%	12%	80	241 per 1,000

The last two rows are particularly telling: two communities with the same poverty rate but different insurance coverage and air quality have very different predicted ER visit rates (185 vs. 241). This is the practical implication of a multi-factor model.

Phase 6: Policy Implications

# ---- PHASE 6: PRACTICAL IMPACT COMPARISON ----
print("\n" + "=" * 70)
print("PHASE 6: POLICY IMPACT ANALYSIS")
print("=" * 70)

# Compare the impact of realistic interventions
print("\nWhat would a realistic improvement in each factor do?")
print("-" * 60)

# Start from a "typical high-risk community"
baseline = {'poverty_rate': 20, 'uninsured_pct': 18, 'aqi': 75}
baseline_pred = model2.predict(pd.DataFrame([baseline])).values[0]
print(f"\nBaseline community: poverty={baseline['poverty_rate']}%, "
      f"uninsured={baseline['uninsured_pct']}%, AQI={baseline['aqi']}")
print(f"Predicted ER rate: {baseline_pred:.0f}")

# Intervention 1: Reduce poverty by 5 percentage points
intervention1 = baseline.copy()
intervention1['poverty_rate'] -= 5
pred1 = model2.predict(pd.DataFrame([intervention1])).values[0]
print(f"\nIntervention 1: Reduce poverty by 5 pts → {pred1:.0f} "
      f"(reduction of {baseline_pred - pred1:.0f} visits)")

# Intervention 2: Expand insurance by 5 percentage points
intervention2 = baseline.copy()
intervention2['uninsured_pct'] -= 5
pred2 = model2.predict(pd.DataFrame([intervention2])).values[0]
print(f"Intervention 2: Reduce uninsured by 5 pts → {pred2:.0f} "
      f"(reduction of {baseline_pred - pred2:.0f} visits)")

# Intervention 3: Improve air quality by 15 AQI points
intervention3 = baseline.copy()
intervention3['aqi'] -= 15
pred3 = model2.predict(pd.DataFrame([intervention3])).values[0]
print(f"Intervention 3: Improve AQI by 15 pts → {pred3:.0f} "
      f"(reduction of {baseline_pred - pred3:.0f} visits)")

# Combined intervention
combined = baseline.copy()
combined['poverty_rate'] -= 5
combined['uninsured_pct'] -= 5
combined['aqi'] -= 15
pred_combined = model2.predict(pd.DataFrame([combined])).values[0]
print(f"\nCombined intervention → {pred_combined:.0f} "
      f"(reduction of {baseline_pred - pred_combined:.0f} visits)")

Maya's Recommendation to the Board

Intervention	Estimated ER Rate Reduction
Reduce poverty by 5 percentage points	~29 fewer visits per 1,000
Expand insurance by 5 percentage points	~16 fewer visits per 1,000
Improve air quality by 15 AQI points	~17 fewer visits per 1,000
All three combined	~62 fewer visits per 1,000

"The model suggests that a multi-pronged approach would be most effective," Maya tells the board. "If I had to prioritize one intervention, reducing poverty has the largest per-unit impact. But poverty is the hardest factor to change. Expanding insurance coverage and improving air quality are more directly actionable, and together they contribute almost as much as the poverty reduction alone."

The Caveats Maya Doesn't Skip

Maya is a careful researcher. She presents three critical caveats:

1. Causation. "This is observational data. We observed that communities with lower uninsured rates have lower ER visit rates. We did not randomly assign insurance coverage. There could be unmeasured confounders — community wealth, healthcare infrastructure, cultural factors — that we haven't captured. These estimates assume the relationships are causal, but they might not be."

2. Multicollinearity. "The VIF values for poverty and uninsured percentage are above 7. This means the exact split between their effects is uncertain. I'm confident that both matter, but the precise coefficients could shift with different data."

3. Ecological fallacy. "We're analyzing community-level data — averages across entire communities. What's true at the community level may not apply to individuals. A community with 20% poverty has a higher ER rate than one with 10% poverty, but we can't conclude that any individual's poverty status predicts their ER usage. Individual-level data would be needed for that."

Discussion Questions

Maya's simple regression attributed 11.38 ER visits per poverty percentage point, while the multiple regression attributed only 5.81. Which number should the health board use for planning? Why?
The uninsured percentage was borderline significant ($p = 0.057$). Should Maya include it in her final model? What are the arguments for and against?
A board member says: "Why not just include every variable we can find — school quality, hospital proximity, demographics — and let the regression sort it out?" What would you tell them?
If the board wanted to prove that expanding insurance coverage reduces ER visits, what study design would you recommend? Why can't Maya's regression provide that proof?
How does this analysis illustrate the chapter's threshold concept of "holding other variables constant"? Where specifically did this concept change the conclusions?