Case Study 1: Maya's Hospital Readmission Prediction Model

Contributors

Case Study 1: Maya's Hospital Readmission Prediction Model

The Setup

Dr. Maya Chen has a problem that keeps her up at night.

Every month, roughly one in six patients discharged from Riverside General Hospital is readmitted within 30 days. Each unplanned readmission costs the hospital an average of $15,000, strains already-stretched nursing staff, and — most importantly — means a patient suffered a setback that might have been prevented.

"If we could predict which patients are most likely to be readmitted," Maya told the chief medical officer, "we could target our follow-up resources. Home health visits. Medication counseling. Discharge planning. We can't afford to give these services to everyone, but if we focus on the highest-risk patients, we could reduce readmissions by 20-30%."

The CMO asked the natural question: "How do we predict who's at risk?"

Maya's answer: logistic regression.

The Data

Maya collected records for 600 patients discharged over the past year. For each patient, she recorded:

age: patient age in years
prior_admissions: number of hospital admissions in the past 12 months
num_medications: number of medications prescribed at discharge
length_of_stay: days spent in the hospital
has_pcp: whether the patient has a primary care physician (1 = yes, 0 = no)
charlson_index: Charlson Comorbidity Index (0-10 scale measuring illness severity)
readmitted: whether the patient was readmitted within 30 days (1 = yes, 0 = no)

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, auc, accuracy_score)
import matplotlib.pyplot as plt
import seaborn as sns

# ============================================================
# MAYA'S READMISSION PREDICTION — COMPLETE WORKFLOW
# ============================================================

np.random.seed(2024)
n = 600

age = np.random.normal(65, 15, n).clip(18, 95).astype(int)
prior_admissions = np.random.poisson(1.2, n)
num_medications = np.random.poisson(7, n).clip(0, 20)
length_of_stay = np.random.exponential(4, n).clip(1, 30).round(0).astype(int)
has_pcp = np.random.binomial(1, 0.65, n)
charlson_index = np.random.poisson(2, n).clip(0, 10)

# True readmission model
log_odds = (-3.2
            + 0.015 * age
            + 0.40 * prior_admissions
            + 0.06 * num_medications
            + 0.03 * length_of_stay
            - 0.50 * has_pcp
            + 0.20 * charlson_index)
prob_readmit = 1 / (1 + np.exp(-log_odds))
readmitted = np.random.binomial(1, prob_readmit)

patients = pd.DataFrame({
    'age': age,
    'prior_admissions': prior_admissions,
    'num_medications': num_medications,
    'length_of_stay': length_of_stay,
    'has_pcp': has_pcp,
    'charlson_index': charlson_index,
    'readmitted': readmitted
})

print("=" * 60)
print("MAYA'S READMISSION DATA — OVERVIEW")
print("=" * 60)
print(f"\nTotal patients: {len(patients)}")
print(f"Readmission rate: {patients['readmitted'].mean():.1%}")
print(f"\nDescriptive statistics:")
print(patients.describe().round(2))

Step 1: Exploratory Analysis

Before fitting any model, Maya explores how each predictor relates to readmission.

# Readmission rates by key factors
print("\n--- Readmission rates by prior admissions ---")
print(patients.groupby('prior_admissions')['readmitted']
      .agg(['mean', 'count']).round(3))

print("\n--- Readmission rates by PCP status ---")
print(patients.groupby('has_pcp')['readmitted']
      .agg(['mean', 'count']).round(3))

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Age distribution by readmission status
for status, color, label in [(0, 'steelblue', 'Not readmitted'),
                               (1, 'coral', 'Readmitted')]:
    subset = patients[patients['readmitted'] == status]
    axes[0, 0].hist(subset['age'], bins=15, alpha=0.6,
                    color=color, label=label)
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# Prior admissions
rates = patients.groupby('prior_admissions')['readmitted'].mean()
axes[0, 1].bar(rates.index, rates.values, color='navy', alpha=0.7)
axes[0, 1].set_title('Readmission Rate by Prior Admissions')
axes[0, 1].set_xlabel('Prior Admissions')
axes[0, 1].set_ylabel('Readmission Rate')

# PCP status
rates_pcp = patients.groupby('has_pcp')['readmitted'].mean()
axes[0, 2].bar(['No PCP', 'Has PCP'], rates_pcp.values,
               color=['coral', 'steelblue'], alpha=0.7)
axes[0, 2].set_title('Readmission Rate by PCP Status')
axes[0, 2].set_ylabel('Readmission Rate')

# Medications
for status, color, label in [(0, 'steelblue', 'Not readmitted'),
                               (1, 'coral', 'Readmitted')]:
    subset = patients[patients['readmitted'] == status]
    axes[1, 0].hist(subset['num_medications'], bins=15, alpha=0.6,
                    color=color, label=label)
axes[1, 0].set_title('Number of Medications')
axes[1, 0].legend()

# Length of stay
for status, color, label in [(0, 'steelblue', 'Not readmitted'),
                               (1, 'coral', 'Readmitted')]:
    subset = patients[patients['readmitted'] == status]
    axes[1, 1].hist(subset['length_of_stay'], bins=15, alpha=0.6,
                    color=color, label=label)
axes[1, 1].set_title('Length of Stay')
axes[1, 1].legend()

# Charlson Index
rates_charlson = patients.groupby('charlson_index')['readmitted'].mean()
axes[1, 2].bar(rates_charlson.index, rates_charlson.values,
               color='navy', alpha=0.7)
axes[1, 2].set_title('Readmission Rate by Charlson Index')
axes[1, 2].set_xlabel('Charlson Comorbidity Index')
axes[1, 2].set_ylabel('Readmission Rate')

plt.suptitle("Maya's Readmission Analysis: Exploratory Plots",
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Maya notices several patterns: readmission rates climb steadily with the number of prior admissions and the Charlson index. Patients without a primary care physician have noticeably higher readmission rates. The age and medication effects are more subtle.

Step 2: Fit the Logistic Regression Model

# ============================================================
# STATISTICAL MODEL (statsmodels — for inference)
# ============================================================

predictors = ['age', 'prior_admissions', 'num_medications',
              'length_of_stay', 'has_pcp', 'charlson_index']

X_sm = sm.add_constant(patients[predictors])
logit_model = sm.Logit(patients['readmitted'], X_sm).fit()

print("\n" + "=" * 60)
print("LOGISTIC REGRESSION RESULTS")
print("=" * 60)
print(logit_model.summary())

Step 3: Interpret the Results

# Odds ratios with 95% confidence intervals
print("\n" + "=" * 60)
print("ODDS RATIOS WITH 95% CONFIDENCE INTERVALS")
print("=" * 60)

odds_ratios = np.exp(logit_model.params)
ci = np.exp(logit_model.conf_int())
ci.columns = ['OR Lower', 'OR Upper']

results_table = pd.DataFrame({
    'Coef (b)': logit_model.params.round(4),
    'Odds Ratio': odds_ratios.round(3),
    'OR Lower': ci['OR Lower'].round(3),
    'OR Upper': ci['OR Upper'].round(3),
    'p-value': logit_model.pvalues.round(4)
})
print(results_table)

Maya presents the findings at the weekly clinical meeting:

"We fitted a logistic regression model predicting 30-day readmission from six patient characteristics. Here are the key findings:

Prior admissions is the strongest predictor. Each additional prior admission in the past year is associated with approximately a 49% increase in the odds of readmission (OR approximately 1.49, p < 0.001), holding all other factors constant. This aligns with the clinical literature — prior utilization is the strongest single predictor of future utilization.

Having a primary care physician is the strongest protective factor. Patients with a PCP have approximately 39% lower odds of readmission compared to patients without one (OR approximately 0.61, p < 0.01), all else being equal. This is our most actionable finding — if we can connect discharged patients with primary care, we may reduce readmissions.

The Charlson Comorbidity Index matters: each one-point increase is associated with about a 22% increase in odds (OR approximately 1.22, p < 0.01). Sicker patients are harder to keep out of the hospital.

Number of medications has a small positive association (OR approximately 1.06), consistent with polypharmacy as a risk factor. Age has a very small effect per year, and length of stay has a modest effect. Both should be interpreted cautiously."

Step 4: Model Evaluation

# ============================================================
# PREDICTION MODEL (sklearn — for evaluation)
# ============================================================

X = patients[predictors]
y = patients['readmitted']

# Train-test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Fit sklearn logistic regression
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train, y_train)

# Predictions on test set
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# --- Confusion Matrix ---
print("\n" + "=" * 60)
print("MODEL EVALUATION ON TEST SET")
print("=" * 60)

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(f"                  Predicted Not   Predicted Yes")
print(f"Actually Not:        {cm[0,0]:>4}            {cm[0,1]:>4}")
print(f"Actually Yes:        {cm[1,0]:>4}            {cm[1,1]:>4}")

# --- Metrics ---
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=['Not Readmitted', 'Readmitted']))

# --- Visualize Confusion Matrix ---
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Not Readmitted', 'Readmitted'],
            yticklabels=['Not Readmitted', 'Readmitted'],
            ax=ax)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title("Maya's Readmission Model: Confusion Matrix",
             fontsize=14)
plt.tight_layout()
plt.show()

Step 5: ROC Curve and AUC

# --- ROC Curve ---
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='navy', linewidth=2,
         label=f'Maya\'s Model (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--',
         label='Random Guessing (AUC = 0.50)')
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity)', fontsize=12)
plt.title("ROC Curve: Readmission Prediction Model", fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nAUC: {roc_auc:.3f}")
print(f"Interpretation: The model has {'acceptable' if roc_auc >= 0.7 else 'poor'} "
      f"discriminative ability.")

Step 6: Choosing the Right Threshold

Maya's model has a default threshold of 0.5, but the hospital's situation demands a different balance.

"Missing a high-risk patient is more costly than providing unnecessary follow-up," Maya argues. "A missed readmission costs $15,000 and a patient setback. Providing follow-up care to a patient who wouldn't have been readmitted costs about $800. The cost ratio is nearly 20 to 1."

# Threshold analysis
print("\n" + "=" * 60)
print("THRESHOLD ANALYSIS")
print("=" * 60)

thresholds_to_test = [0.2, 0.3, 0.4, 0.5, 0.6]

for t in thresholds_to_test:
    y_pred_t = (y_prob >= t).astype(int)
    cm_t = confusion_matrix(y_test, y_pred_t)
    tn, fp, fn, tp = cm_t.ravel()

    sens = tp / (tp + fn) if (tp + fn) > 0 else 0
    spec = tn / (tn + fp) if (tn + fp) > 0 else 0
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    pct_flagged = (tp + fp) / len(y_test)

    # Cost analysis
    cost_fn = fn * 15000  # cost of missed readmissions
    cost_fp = fp * 800    # cost of unnecessary follow-up
    total_cost = cost_fn + cost_fp

    print(f"\nThreshold = {t}:")
    print(f"  Sensitivity: {sens:.3f}  |  Specificity: {spec:.3f}")
    print(f"  Precision:   {prec:.3f}  |  % Flagged:   {pct_flagged:.1%}")
    print(f"  Cost of missed readmissions: ${cost_fn:,.0f}")
    print(f"  Cost of unnecessary follow-up: ${cost_fp:,.0f}")
    print(f"  TOTAL COST: ${total_cost:,.0f}")

The Decision

Maya presents the threshold analysis to the CMO.

"At the default threshold of 0.5, we catch about 75-80% of readmissions but miss the rest. Lowering the threshold to 0.3 catches about 90%, but it also flags more patients who won't be readmitted — increasing follow-up costs. However, the math is clear: the savings from preventing readmissions far outweigh the cost of the extra follow-up visits."

The CMO asks the critical question: "Does the model work equally well for all patient groups?"

Step 7: Fairness Check

# Check performance by age group
print("\n" + "=" * 60)
print("FAIRNESS CHECK: PERFORMANCE BY AGE GROUP")
print("=" * 60)

patients_test = X_test.copy()
patients_test['y_true'] = y_test.values
patients_test['y_prob'] = y_prob
patients_test['y_pred'] = y_pred

patients_test['age_group'] = pd.cut(
    patients_test['age'],
    bins=[0, 50, 65, 80, 100],
    labels=['Under 50', '50-65', '66-80', 'Over 80']
)

for group in patients_test['age_group'].cat.categories:
    subset = patients_test[patients_test['age_group'] == group]
    if len(subset) < 10:
        continue

    cm_g = confusion_matrix(subset['y_true'], subset['y_pred'])
    if cm_g.shape == (2, 2):
        tn, fp, fn, tp = cm_g.ravel()
        fpr_g = fp / (fp + tn) if (fp + tn) > 0 else 0
        fnr_g = fn / (fn + tp) if (fn + tp) > 0 else 0
        acc_g = (tp + tn) / len(subset)

        print(f"\n{group} (n={len(subset)}):")
        print(f"  Accuracy: {acc_g:.3f}")
        print(f"  False Positive Rate: {fpr_g:.3f}")
        print(f"  False Negative Rate: {fnr_g:.3f}")

Analysis and Takeaways

This case study demonstrates the complete logistic regression workflow for a real-world healthcare application:

1. The business case is clear. Hospital readmissions are costly and preventable. Logistic regression provides a principled way to identify high-risk patients and allocate scarce resources.

2. Interpretation matters. Odds ratios translate statistical output into clinical insight. The finding that PCP access is the strongest modifiable predictor directly suggests an intervention: connect discharged patients with primary care.

3. The threshold is a policy decision. The choice between a 0.3 and 0.5 threshold isn't a statistical question — it's a question about values. How much are we willing to spend on follow-up to prevent one readmission? The model gives us the numbers; humans make the call.

4. Fairness must be checked. A model with good overall performance can still underperform for specific groups. If the model has a higher false negative rate for elderly patients (missing more readmissions among the most vulnerable), that's an ethical problem that needs to be addressed.

5. The connection to AI is direct. Maya's model is the same type of model that large hospital systems deploy at scale. The difference is that commercial systems may use hundreds of predictors instead of six, and gradient-boosted trees instead of logistic regression. But the evaluation framework — confusion matrix, ROC curve, threshold optimization, fairness audit — is identical.

Discussion Questions

Should hospitals be required to disclose the algorithms they use for patient risk stratification? What are the arguments for and against transparency?
Maya's model shows that patients without a primary care physician have higher readmission odds. Is the solution to "add PCP access" to the model's recommendations, or to address the systemic issue of why some patients lack primary care?
If the model has a higher false negative rate for patients with rare diseases (who look unusual to the model), should it be deployed anyway? How would you address this limitation?
The Charlson Comorbidity Index is a summary measure that combines multiple conditions. Would the model be better with the individual conditions as separate predictors? What tradeoff would this involve? (Hint: think about the model building strategies from Chapter 23.)