Case Study 2: James's Analysis of Bail Decisions by Race

The Setup

Professor James Washington has been studying the criminal justice system for over a decade. In Chapter 16, he used a two-proportion $z$-test to compare algorithm-generated recommendations and judge decisions, and to expose racial disparities in false positive rates. But those analyses compared only two groups at a time. Now he has a larger dataset and a more comprehensive question.

A metropolitan courthouse has provided James with anonymized data on 600 pretrial bail decisions over six months. Each record includes the bail outcome (granted or denied) and the defendant's race as recorded by the court (White, Black, Hispanic, or Other). James wants to answer a straightforward question: Are bail decisions independent of race?

If the answer is no — if bail outcomes are associated with race — then the system may be operating unfairly, regardless of whether the bias is intentional. And if the answer is yes — if bail is independent of race — that's important too, because it would suggest that whatever factors drive bail decisions, race isn't one of them (at least not directly).

This is exactly the kind of question a chi-square test of independence was built to answer.

The Data

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Bail decision data
observed = np.array([
    [142, 58],     # White:    142 granted, 58 denied
    [108, 92],     # Black:    108 granted, 92 denied
    [87, 63],      # Hispanic: 87 granted, 63 denied
    [43, 7]        # Other:    43 granted, 7 denied
])

race_labels = ['White', 'Black', 'Hispanic', 'Other']
decision_labels = ['Granted', 'Denied']

# Create a labeled DataFrame
df = pd.DataFrame(observed, index=race_labels, columns=decision_labels)
df['Total'] = df.sum(axis=1)
df['% Granted'] = (df['Granted'] / df['Total'] * 100).round(1)

print("=" * 60)
print("BAIL DECISIONS BY RACE")
print("=" * 60)
print(df)
print(f"\nOverall: {observed[:, 0].sum()}/{observed.sum()} "
      f"= {observed[:, 0].sum()/observed.sum():.1%} granted")

Output:

           Granted  Denied  Total  % Granted
White          142      58    200       71.0
Black          108      92    200       54.0
Hispanic        87      63    150       58.0
Other           43       7     50       86.0
Overall: 380/600 = 63.3% granted

The disparities are visible in the raw data. White defendants were granted bail 71% of the time, Black defendants 54%, Hispanic defendants 58%, and defendants in the "Other" category 86%. The question is whether these differences are large enough to be statistically significant — or whether they could arise from random variation in a system that's actually race-neutral.

The Complete Analysis

# ============================================================
# Step 1: Hypotheses
# ============================================================
print("\n" + "=" * 60)
print("STEP 1: HYPOTHESES")
print("=" * 60)
print("H0: Bail decisions are independent of race")
print("Ha: Bail decisions are NOT independent of race")

# ============================================================
# Step 2: Expected frequencies
# ============================================================
chi2, p, dof, expected = stats.chi2_contingency(observed)

print("\n" + "=" * 60)
print("STEP 2: EXPECTED FREQUENCIES")
print("=" * 60)
exp_df = pd.DataFrame(
    np.round(expected, 2),
    index=race_labels,
    columns=decision_labels
)
print(exp_df)
print("\nInterpretation: If bail were independent of race, 63.3% of")
print("EVERY racial group would be granted bail (proportional to")
print("the overall rate).")

# ============================================================
# Step 3: Check conditions
# ============================================================
print("\n" + "=" * 60)
print("STEP 3: CONDITIONS")
print("=" * 60)
print(f"  Random sampling: Data from a 6-month census of cases (not")
print(f"    a random sample, but represents the full population of")
print(f"    cases during this period)")
print(f"  Independence: Each defendant appears once")
print(f"  Expected frequencies >= 5:")
print(f"    Minimum expected: {expected.min():.2f}")
print(f"    All >= 5? {'Yes' if expected.min() >= 5 else 'No'}")

# ============================================================
# Step 4: Chi-square statistic
# ============================================================
print("\n" + "=" * 60)
print("STEP 4: CHI-SQUARE STATISTIC")
print("=" * 60)

# Cell-by-cell contributions
contributions = (observed - expected)**2 / expected
contrib_df = pd.DataFrame(
    np.round(contributions, 3),
    index=race_labels,
    columns=decision_labels
)
print("\nContributions to chi-square by cell:")
print(contrib_df)
print(f"\nTotal chi-square: {chi2:.3f}")

# ============================================================
# Step 5: P-value and decision
# ============================================================
print("\n" + "=" * 60)
print("STEP 5: P-VALUE AND DECISION")
print("=" * 60)
print(f"  Chi-square statistic: {chi2:.3f}")
print(f"  Degrees of freedom:   {dof}")
print(f"  P-value:              {p:.6f}")
print(f"  Decision at alpha=0.05: REJECT H0")
print(f"\n  Bail decisions are NOT independent of race.")

# ============================================================
# Step 6: Effect size (Cramer's V)
# ============================================================
n = observed.sum()
k = min(observed.shape) - 1
V = np.sqrt(chi2 / (n * k))

print("\n" + "=" * 60)
print("STEP 6: EFFECT SIZE")
print("=" * 60)
print(f"  Cramer's V: {V:.3f}")
if V < 0.1:
    print("  Interpretation: Negligible association")
elif V < 0.2:
    print("  Interpretation: Small association")
elif V < 0.3:
    print("  Interpretation: Small-to-medium association")
elif V < 0.4:
    print("  Interpretation: Medium association")
else:
    print("  Interpretation: Large association")

# ============================================================
# Step 7: Standardized residuals
# ============================================================
residuals = (observed - expected) / np.sqrt(expected)

print("\n" + "=" * 60)
print("STEP 7: STANDARDIZED RESIDUALS")
print("=" * 60)
res_df = pd.DataFrame(
    np.round(residuals, 2),
    index=race_labels,
    columns=decision_labels
)
print(res_df)
print("\nNotable (|r| > 2):")
for i, r in enumerate(race_labels):
    for j, d in enumerate(decision_labels):
        if abs(residuals[i, j]) > 2:
            print(f"  {r} x {d}: {residuals[i,j]:+.2f}")

What the Residuals Reveal

The overall test is highly significant ($\chi^2 = 25.48$, $df = 3$, $p < 0.001$). But the standardized residuals tell us where the departures from independence are concentrated:

Granted Denied
White +1.36 -1.79
Black -1.66 +2.18
Hispanic -0.82 +1.08
Other +2.02 -2.65

Three cells have standardized residuals exceeding $\pm 2$:

  1. Black, Denied: +2.18. Black defendants were denied bail more often than expected under independence. If bail were race-neutral, about 73 Black defendants would have been denied; the actual number was 92 — nineteen more than expected.

  2. Other, Granted: +2.02. Defendants in the "Other" category were granted bail more often than expected.

  3. Other, Denied: -2.65. Correspondingly, "Other" defendants were denied bail far less often than expected.

Notice that White and Hispanic defendants have residuals within the $\pm 2$ range — their outcomes are closer to what independence would predict. The strongest departures from fairness appear at the extremes: Black defendants are disadvantaged, and "Other" defendants are advantaged, relative to the overall rate.

Visualizing the Disparity

# Grouped bar chart: bail rates by race
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Bail grant rate by race
grant_rates = observed[:, 0] / observed.sum(axis=1) * 100
overall_rate = observed[:, 0].sum() / n * 100

colors = ['#27AE60' if r > overall_rate else '#E74C3C' for r in grant_rates]
bars = axes[0].bar(race_labels, grant_rates, color=colors,
                   edgecolor='black', alpha=0.8)
axes[0].axhline(y=overall_rate, color='black', linestyle='--',
                linewidth=1.5, label=f'Overall rate ({overall_rate:.1f}%)')
axes[0].set_ylabel('Bail Granted (%)', fontsize=12)
axes[0].set_title('Bail Grant Rate by Race', fontsize=13)
axes[0].legend(fontsize=10)
axes[0].set_ylim(0, 100)

# Add percentage labels
for bar, rate in zip(bars, grant_rates):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                 f'{rate:.1f}%', ha='center', fontsize=11, fontweight='bold')

# Panel 2: Standardized residuals heatmap
im = axes[1].imshow(residuals, cmap='RdBu_r', vmin=-3, vmax=3, aspect='auto')
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(decision_labels, fontsize=11)
axes[1].set_yticks(range(len(race_labels)))
axes[1].set_yticklabels(race_labels, fontsize=11)
axes[1].set_title('Standardized Residuals\n(Red = more than expected, '
                   'Blue = fewer)', fontsize=12)

# Add text labels to heatmap
for i in range(len(race_labels)):
    for j in range(len(decision_labels)):
        text = axes[1].text(j, i, f'{residuals[i,j]:+.2f}',
                           ha='center', va='center', fontsize=12,
                           fontweight='bold',
                           color='white' if abs(residuals[i,j]) > 1.5
                           else 'black')

fig.colorbar(im, ax=axes[1], shrink=0.8, label='Standardized Residual')
plt.tight_layout()
plt.show()

The Ethical Dimension

James presents these findings at a criminal justice reform conference. His analysis is careful and measured:

"The chi-square test of independence provides strong evidence ($p < 0.001$) that bail decisions in this courthouse are associated with the defendant's race. Cramer's V of 0.206 indicates a small-to-medium effect — race matters, but it doesn't determine outcomes on its own.

The standardized residuals identify the specific patterns: Black defendants are denied bail at elevated rates relative to what racial neutrality would predict, while defendants classified as 'Other' experience a countervailing advantage. White and Hispanic defendants are closer to the overall average.

Three caveats are essential:

First, this is an observational analysis. The chi-square test establishes an association between race and bail outcomes. It does not establish that race caused the disparate outcomes. Charge severity, prior criminal history, employment status, and community ties all influence bail decisions and may correlate with race. A regression analysis controlling for these factors would be needed to isolate the independent effect of race.

Second, the category 'Other' combines diverse groups (Asian, Native American, multiracial, and others) into a single bin. The favorable outcomes for this group may reflect compositional effects — perhaps defendants in this category had less serious charges or stronger community ties. The aggregation obscures individual group experiences.

Third, Cramer's V = 0.206 means that knowing a defendant's race explains about 4% of the variation in bail decisions ($V^2 \approx 0.042$). The other 96% is driven by other factors. This doesn't diminish the finding — even a small racial bias in bail decisions has enormous cumulative consequences when applied to thousands of defendants per year — but it does mean that race alone is not destiny in this courthouse."

Connecting to Chapter 16

In Chapter 16, James used a two-proportion $z$-test to compare false positive rates between White and Black defendants for a predictive algorithm (13.3% vs. 31.2%, $z = -4.67$, $p < 0.001$). That test answered a focused question: do these two groups differ?

The chi-square test answers a broader question: is the entire distribution of outcomes independent of race, across all groups simultaneously? It's a generalization of the two-proportion test (in fact, for a $2 \times 2$ table, $\chi^2 = z^2$).

But the chi-square test also introduces something the $z$-test didn't provide: Cramer's V as an effect size measure, and standardized residuals for post-hoc identification of which cells drive the overall result. These tools make the chi-square framework more informative than running multiple pairwise $z$-tests — and they avoid the multiple comparisons problem that would arise from testing all ${4 \choose 2} = 6$ pairs separately.

The "Other" Category Problem

One of the most interesting aspects of this analysis is the "Other" category. With $n = 50$, it's the smallest group and it shows the most extreme pattern: 86% granted bail versus 63.3% overall. The standardized residual for "Other, Denied" ($-2.65$) is the single largest contributor to the chi-square statistic.

But what does "Other" mean? It's a catch-all category that combines: - Asian Americans - Native Americans - Pacific Islanders - Multiracial individuals - Anyone who doesn't fit into the court's other categories

Each of these groups may have very different experiences with the justice system. The high bail grant rate for "Other" could mean that Asian American defendants have favorable outcomes (perhaps due to socioeconomic factors), while Native American defendants within the same category face very different realities.

This is a textbook example of Theme 2: who gets counted, and how, shapes what statistics can reveal. The court's choice to use a single "Other" category makes certain analyses possible (the chi-square test works because $E_{Other,Denied} = 18.33 \geq 5$) while hiding other patterns (the experiences of individual groups within "Other").

James writes in his paper:

"The use of broad racial categories — and especially the 'Other' category — is a limitation common to criminal justice data. While necessary for statistical power, these categories obscure the diverse experiences of the communities they contain. Future research with more granular demographic data is needed to understand whether the apparently favorable outcomes for 'Other' defendants reflect a genuine advantage for certain subgroups or a statistical artifact of aggregation."

What Would Happen with More Data?

James wonders: if he collected 6,000 cases instead of 600 (keeping the same proportions), what would change?

# Scaling up by 10x
observed_large = observed * 10
chi2_large, p_large, dof_large, exp_large = stats.chi2_contingency(
    observed_large
)
n_large = observed_large.sum()
V_large = np.sqrt(chi2_large / (n_large * (min(observed_large.shape) - 1)))

print("With n = 6,000 (same proportions):")
print(f"  Chi-square: {chi2_large:.1f} (was {chi2:.1f})")
print(f"  P-value:    {p_large:.2e} (was {p:.6f})")
print(f"  Cramer's V: {V_large:.3f} (was {V:.3f})")

Output:

With n = 6,000 (same proportions):
  Chi-square: 254.8 (was 25.5)
  P-value:    4.76e-55 (was 0.000012)
  Cramer's V: 0.206 (was 0.206)

The chi-square statistic increases tenfold. The p-value drops to astronomically small levels. But Cramer's V stays exactly the same: 0.206. The strength of the association hasn't changed — we're just more certain about it. This is why Cramer's V is essential: it separates "how strong?" from "how sure?"

Try It Yourself

  1. Collapse James's data into a $2 \times 2$ table (White vs. Non-White, Granted vs. Denied). Conduct the chi-square test. Compare the result to the four-category analysis. What information is lost?

  2. Compute the two-proportion $z$-test comparing Black and White bail grant rates (54% vs. 71%). Verify that $z^2 \approx \chi^2$ for the $2 \times 2$ subtable containing only these two groups.

  3. James's colleague suggests that the significant result might disappear if charge severity is accounted for. Design a study that controls for charge severity. What additional data would you need? What test or analysis would you use? (Hint: you might need the methods in Chapters 23 or 24.)

  4. Write a one-paragraph summary of these findings suitable for a community advisory board. Avoid statistical jargon — use percentages, plain language, and focus on what the numbers mean for real people.