Case Study 1: Maya's Emergency Department Wait Times — A Public Health Investigation

The Setup

Dr. Maya Chen has been hearing complaints. Patients arriving at Riverside County Medical Center's emergency department describe hours-long waits — some leaving without being seen, others delaying critical care. The hospital administration insists that wait times are "in line with national averages."

Maya knows that vague assurances aren't data. She decides to investigate systematically.

The Centers for Medicare and Medicaid Services (CMS) publishes national benchmarks for ED performance. One key metric: median time from arrival to seeing a provider should not exceed 28 minutes for patients with non-emergent conditions. But Maya is interested in average total ED visit duration — from arrival to departure — which includes triage, waiting, treatment, and discharge. For this metric, a commonly cited benchmark is 240 minutes (4 hours).

Maya's question: Does the average total ED visit time in Riverside County exceed 240 minutes?

The Data

Maya obtains a random sample of 42 ED visit records from the past three months (with patient identifiers removed per IRB protocol). Here are the total visit times in minutes:

185, 312, 228, 195, 267, 340, 210, 255, 198, 278,
245, 302, 232, 188, 265, 290, 215, 272, 248, 310,
220, 258, 295, 175, 280, 240, 305, 225, 270, 252,
318, 205, 285, 235, 262, 298, 192, 275, 250, 315,
230, 268

The Analysis

Step 1: Explore the Data

Before testing anything, Maya examines her data.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

visit_times = np.array([
    185, 312, 228, 195, 267, 340, 210, 255, 198, 278,
    245, 302, 232, 188, 265, 290, 215, 272, 248, 310,
    220, 258, 295, 175, 280, 240, 305, 225, 270, 252,
    318, 205, 285, 235, 262, 298, 192, 275, 250, 315,
    230, 268
])

# Summary statistics
n = len(visit_times)
x_bar = np.mean(visit_times)
s = np.std(visit_times, ddof=1)
median = np.median(visit_times)

print(f"n = {n}")
print(f"Mean = {x_bar:.1f} minutes")
print(f"Median = {median:.1f} minutes")
print(f"Std Dev = {s:.1f} minutes")
print(f"Min = {np.min(visit_times)}, Max = {np.max(visit_times)}")
print(f"Range = {np.max(visit_times) - np.min(visit_times)} minutes")

Output:

n = 42
Mean = 256.4 minutes
Median = 258.0 minutes
Std Dev = 42.3 minutes
Min = 175, Max = 340
Range = 165 minutes

The mean (256.4) and median (258.0) are close, suggesting the distribution is roughly symmetric. The standard deviation of 42.3 minutes indicates substantial variability — some patients are in and out in under 3 hours, while others spend over 5 hours.

Step 2: Check Conditions

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
axes[0].hist(visit_times, bins=12, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(240, color='red', linestyle='--', linewidth=2, label='Benchmark (240 min)')
axes[0].axvline(x_bar, color='green', linestyle='-', linewidth=2, label=f'Sample Mean ({x_bar:.1f})')
axes[0].set_xlabel('Total Visit Time (minutes)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of ED Visit Times')
axes[0].legend(fontsize=8)

# Boxplot
axes[1].boxplot(visit_times, vert=True)
axes[1].axhline(240, color='red', linestyle='--', label='Benchmark')
axes[1].set_title('Boxplot')
axes[1].set_ylabel('Minutes')
axes[1].legend(fontsize=8)

# QQ-plot
stats.probplot(visit_times, dist="norm", plot=axes[2])
axes[2].set_title('QQ-Plot')

plt.tight_layout()
plt.show()

# Shapiro-Wilk test
w_stat, p_shapiro = stats.shapiro(visit_times)
print(f"\nShapiro-Wilk test: W = {w_stat:.4f}, p = {p_shapiro:.4f}")

Output:

Shapiro-Wilk test: W = 0.9721, p = 0.3846

Assessment:

  1. Randomness: Maya obtained a random sample of ED records. ✓
  2. Independence: 42 visits is far less than 10% of all ED visits over three months. Individual visits are independent (one patient's visit doesn't affect another's timing, aside from waiting room congestion — which is captured in the data). ✓
  3. Normality: With $n = 42 \geq 30$, the CLT handles moderate non-normality. The histogram shows a roughly symmetric distribution, the QQ-plot is approximately linear, and the Shapiro-Wilk test shows no evidence of non-normality ($p = 0.38$). ✓

All conditions are met.

Step 3: Conduct the t-Test

Maya's hypotheses:

$$H_0: \mu = 240 \text{ minutes} \quad (\text{wait times are at the benchmark})$$ $$H_a: \mu > 240 \text{ minutes} \quad (\text{wait times exceed the benchmark})$$

# One-sample t-test (one-tailed: greater)
result = stats.ttest_1samp(visit_times, popmean=240, alternative='greater')

print(f"One-sample t-test:")
print(f"  t = {result.statistic:.4f}")
print(f"  p-value (one-tailed, greater) = {result.pvalue:.4f}")
print(f"  df = {n - 1}")

# Manual calculation for verification
se = s / np.sqrt(n)
t_manual = (x_bar - 240) / se
print(f"\nManual check:")
print(f"  SE = {s:.1f} / √{n} = {se:.2f}")
print(f"  t = ({x_bar:.1f} - 240) / {se:.2f} = {t_manual:.4f}")

Output:

One-sample t-test:
  t = 2.5128
  p-value (one-tailed, greater) = 0.0081
  df = 41

Manual check:
  SE = 42.3 / √42 = 6.53
  t = (256.4 - 240) / 6.53 = 2.5128

Step 4: Confidence Interval

# 95% confidence interval
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * se
ci_lower = x_bar - margin
ci_upper = x_bar + margin

print(f"\n95% Confidence Interval:")
print(f"  t* = {t_star:.3f}")
print(f"  Margin of error = {margin:.1f} minutes")
print(f"  CI: ({ci_lower:.1f}, {ci_upper:.1f}) minutes")
print(f"  Does CI contain 240? {'Yes' if ci_lower <= 240 <= ci_upper else 'No'}")

# One-sided 95% lower bound (more relevant for this one-sided test)
t_star_one = stats.t.ppf(0.95, df=n-1)
lower_bound = x_bar - t_star_one * se
print(f"\n95% Lower Confidence Bound:")
print(f"  μ ≥ {lower_bound:.1f} minutes (with 95% confidence)")

Output:

95% Confidence Interval:
  t* = 2.020
  Margin of error = 13.2 minutes
  CI: (243.2, 269.6) minutes
  Does CI contain 240? No

95% Lower Confidence Bound:
  μ ≥ 245.5 minutes (with 95% confidence)

Step 5: Conclusion

At $\alpha = 0.05$: Since $p = 0.008 < 0.05$, we reject $H_0$.

Statistical conclusion: There is statistically significant evidence at the 0.05 level that the average total ED visit time in Riverside County exceeds the national benchmark of 240 minutes ($t_{41} = 2.51$, $p = 0.008$).

Practical conclusion: The 95% confidence interval of (243.2, 269.6) minutes indicates that the true average visit time is plausibly 3 to 30 minutes above the benchmark. The point estimate suggests visits average about 256 minutes — approximately 16 minutes longer than the benchmark.

The Human Story

Maya presents these findings to the hospital's Quality Improvement Committee. The reaction is mixed.

The CFO asks: "Is 16 extra minutes really a problem? That seems small."

Maya responds: "Sixteen minutes is the average. Some patients wait much longer — our data shows visit times as long as 340 minutes. And 16 minutes might not sound like much, but multiply it by the roughly 150 ED visits per day, and that's 2,400 extra patient-minutes per day — or 40 additional patient-hours that our department is managing beyond what the benchmark expects. That's the equivalent of losing several nursing shifts."

The Chief of Emergency Medicine adds another layer: "The patients who leave without being seen aren't in this dataset at all. They walked out before completing their visit. If we could include them, the true picture might be even worse — or it might be better if they're the short-visit patients. Either way, there's selection bias in the data we can analyze."

A nurse manager raises a concern Maya hadn't considered: "Wait times aren't uniform throughout the day. At 3 AM, things are quiet. At 7 PM on a Friday, it's chaos. Does the average across all hours really tell us what we need to know?"

This is an excellent point — and it previews the kind of thinking that leads to more sophisticated analyses. A comparison of mean wait times during peak vs. off-peak hours would be a two-sample t-test (Chapter 16). An analysis of wait times across multiple time periods would use ANOVA (Chapter 20).

Deeper Questions

Is the Benchmark the Right Comparison?

Maya used 240 minutes as the benchmark, but benchmarks have limitations:

  1. The benchmark is a national average. Maya's county is rural, with limited specialist availability. A more appropriate comparison might be the average for rural hospitals specifically.

  2. The benchmark doesn't account for case mix. If Riverside County's ED sees a higher proportion of complex cases, longer visit times might be appropriate, not problematic.

  3. The benchmark may be outdated. Healthcare delivery has changed, and benchmarks from even five years ago may not reflect current realities.

This illustrates a broader lesson: the choice of $\mu_0$ in a hypothesis test matters enormously. Testing against the wrong benchmark can lead to correct statistics but wrong conclusions.

The Confidence Interval Tells a Richer Story

Notice that the hypothesis test gives a binary answer: "Yes, wait times exceed 240 minutes." But the confidence interval tells a much richer story:

  • The true mean is probably between 243 and 270 minutes
  • The excess over the benchmark could be as small as 3 minutes or as large as 30 minutes
  • This range matters for planning: if the excess is only 3 minutes, the hospital might not need to take action; if it's 30 minutes, urgent intervention is needed

The confidence interval should always accompany a hypothesis test. Together, they tell you whether an effect exists (the test) and how large it plausibly is (the interval).

What Happens Next

Based on Maya's analysis, the Quality Improvement Committee:

  1. Commissions a larger study stratified by time of day, day of week, and patient acuity level
  2. Requests data on patients who left without being seen (the "missing" data problem from Chapter 7)
  3. Compares wait times across similar rural hospitals to establish a more appropriate benchmark
  4. Sets a target: reduce average total visit time to below 240 minutes within 12 months

Maya's t-test didn't solve the problem. But it did something equally important: it turned a vague complaint ("wait times seem long") into a precise, evidence-based finding ("average visit times exceed the national benchmark by 3 to 30 minutes, with 95% confidence"). That's the difference between frustration and action.

Discussion Questions

  1. The CFO asked if 16 extra minutes "really matters." How would you distinguish statistical significance from practical significance in this context? (This is the central theme of Chapter 17.)

  2. The nurse manager's point about time-of-day variation suggests the data might not be well-described by a single mean. What analyses would you recommend to investigate this further?

  3. The Chief of Emergency Medicine raised the issue of patients leaving without being seen. How does this create a form of selection bias? Would including these patients likely increase or decrease the estimated mean visit time?

  4. If Maya had found $p = 0.15$ instead of $p = 0.008$, could she conclude that wait times are at the benchmark? Explain why "fail to reject $H_0$" is different from "accept $H_0$."

  5. Maya's sample of 42 visits covers three months. What assumptions are we making about the stability of the process over time? Under what circumstances would those assumptions be violated?