Case Study 2: James's Bail Algorithm — Effect Sizes and Fairness

Contributors

Case Study 2: James's Bail Algorithm — Effect Sizes and Fairness

The Setup

Professor James Washington has presented his Chapter 16 findings to the county's Criminal Justice Reform Commission. The overall comparison showed a small, barely significant difference between algorithmic and judge-based bail decisions ($p = 0.049$). Some commission members took this as good news: "The algorithm is basically the same as judges — and it's faster and cheaper."

But James knows that the overall comparison tells only part of the story. The disaggregated analysis — comparing false positive rates by race — revealed a much larger and more troubling pattern. Now he needs to present these findings in a way that communicates not just statistical significance, but the magnitude and human cost of the disparity.

The Complete Analysis

import numpy as np
from scipy import stats
from statsmodels.stats.power import NormalIndPower

# ============================================================
# JAMES'S BAIL ALGORITHM — FULL EFFECT SIZE ANALYSIS
# ============================================================

print("=" * 65)
print("CRIMINAL JUSTICE ALGORITHM AUDIT — EFFECT SIZE REPORT")
print("=" * 65)

# ---- Part 1: Overall Recidivism Comparison ----
print("\n--- PART 1: Overall Algorithm vs. Judge Comparison ---\n")

# Recidivism rates
alg_rearrest, alg_total = 89, 412
judge_rearrest, judge_total = 107, 388

p_alg = alg_rearrest / alg_total
p_judge = judge_rearrest / judge_total
diff_overall = p_alg - p_judge

# Cohen's h
h_overall = 2 * np.arcsin(np.sqrt(p_alg)) - 2 * np.arcsin(np.sqrt(p_judge))

# Confidence interval
se_overall = np.sqrt(p_alg*(1-p_alg)/alg_total + p_judge*(1-p_judge)/judge_total)
ci_low = diff_overall - 1.96 * se_overall
ci_high = diff_overall + 1.96 * se_overall

# Power
power_analysis = NormalIndPower()
power_overall = power_analysis.solve_power(
    effect_size=abs(h_overall), nobs1=alg_total,
    alpha=0.05, ratio=judge_total/alg_total,
    alternative='two-sided'
)

print(f"Algorithm recidivism: {p_alg:.3f} ({alg_rearrest}/{alg_total})")
print(f"Judge recidivism:     {p_judge:.3f} ({judge_rearrest}/{judge_total})")
print(f"Difference:           {diff_overall:.3f} ({diff_overall*100:.1f} percentage points)")
print(f"95% CI:               ({ci_low:.3f}, {ci_high:.3f})")
print(f"Cohen's h:            {abs(h_overall):.3f} — SMALL effect")
print(f"p-value:              0.049 (barely significant)")
print(f"Achieved power:       {power_overall:.2f} ({power_overall*100:.0f}%)")
print(f"\nVerdict: Small effect, barely significant, underpowered (48%).")
print(f"         This result is fragile — a slightly different sample")
print(f"         could easily produce p > 0.05.")

# ---- Part 2: Racial Disparity in False Positives ----
print(f"\n{'='*65}")
print("--- PART 2: Racial Disparity in False Positive Rates ---\n")

# False positive rates by race (from Ch.16 case study)
fp_white = 30 / 225    # 13.3%
fp_black = 86 / 275    # 31.3%
n_white, n_black = 225, 275

diff_racial = fp_black - fp_white
h_racial = 2 * np.arcsin(np.sqrt(fp_black)) - 2 * np.arcsin(np.sqrt(fp_white))

se_racial = np.sqrt(fp_white*(1-fp_white)/n_white + fp_black*(1-fp_black)/n_black)
ci_racial_low = diff_racial - 1.96 * se_racial
ci_racial_high = diff_racial + 1.96 * se_racial

# Power for the racial disparity test
power_racial = power_analysis.solve_power(
    effect_size=abs(h_racial), nobs1=n_white,
    alpha=0.05, ratio=n_black/n_white,
    alternative='two-sided'
)

print(f"White defendants FP rate: {fp_white:.3f} ({30}/{n_white})")
print(f"Black defendants FP rate: {fp_black:.3f} ({86}/{n_black})")
print(f"Difference:               {diff_racial:.3f} ({diff_racial*100:.1f} percentage points)")
print(f"95% CI:                   ({ci_racial_low:.3f}, {ci_racial_high:.3f})")
print(f"Cohen's h:                {abs(h_racial):.3f} — MEDIUM effect")
print(f"Relative risk:            {fp_black/fp_white:.1f}x")
print(f"p-value:                  < 0.001 (highly significant)")
print(f"Achieved power:           {power_racial:.2f} ({power_racial*100:.0f}%)")
print(f"\nVerdict: Medium effect, highly significant, well-powered (>99%).")
print(f"         This result is robust — it would replicate reliably.")

Comparing the Two Findings

James creates a side-by-side comparison that makes the contrast unmistakable:

print(f"\n{'='*65}")
print("SIDE-BY-SIDE COMPARISON")
print(f"{'='*65}\n")

comparisons = [
    ("Measure", "Overall Comparison", "Racial Disparity"),
    ("─" * 20, "─" * 25, "─" * 25),
    ("Difference", f"{abs(diff_overall)*100:.1f} pp", f"{diff_racial*100:.1f} pp"),
    ("Cohen's h", f"{abs(h_overall):.3f} (small)", f"{abs(h_racial):.3f} (medium)"),
    ("p-value", "0.049 (barely sig.)", "< 0.001 (highly sig.)"),
    ("Power", f"{power_overall*100:.0f}% (inadequate)", f"{power_racial*100:.0f}% (excellent)"),
    ("CI width", f"{(ci_high-ci_low)*100:.1f} pp", f"{(ci_racial_high-ci_racial_low)*100:.1f} pp"),
    ("Replicability", "Fragile", "Robust"),
    ("Practical impact", "Modest", "Substantial"),
]

for row in comparisons:
    print(f"  {row[0]:<20} {row[1]:<25} {row[2]:<25}")

The Human Cost

James then translates the effect sizes into human terms:

print(f"\n{'='*65}")
print("HUMAN IMPACT ANALYSIS")
print(f"{'='*65}\n")

# Among Black defendants flagged as high-risk
excess_fp = int(round((fp_black - fp_white) * n_black))
print(f"Excess false positives among Black defendants: ~{excess_fp}")
print(f"(If Black defendants had the same FP rate as white defendants)")
print()

# What this means
print("Each false positive represents a person who:")
print("  - Was incorrectly classified as high-risk")
print("  - May have been denied bail or given higher bail")
print("  - May have spent additional time in pretrial detention")
print("  - May have lost employment, housing, or custody during detention")
print("  - Incurred legal costs fighting the risk classification")
print()

# Scale projection
county_annual = 3200  # hypothetical annual cases
pct_black = 0.40      # hypothetical Black defendant share
black_annual = county_annual * pct_black

# Projected annual excess false positives
annual_excess_fp = int(round((fp_black - fp_white) * black_annual))
print(f"If the county processes {county_annual} cases/year:")
print(f"  Black defendants: ~{int(black_annual)}")
print(f"  Projected excess false positives per year: ~{annual_excess_fp}")
print(f"  Over 5 years: ~{annual_excess_fp * 5}")

James's Presentation to the Commission

James structures his presentation around three key points:

Point 1: The Overall Comparison Is Misleading

"The overall comparison shows a small, barely significant difference ($h = 0.14$, $p = 0.049$). Critically, our study had only 48% power for this comparison — meaning we had roughly a coin flip's chance of detecting the effect. If the true effect is this small, we'd need about 800 defendants per group — not the 400 we had — to reliably detect it.

"This means the overall comparison is essentially uninformative. A failure to find significance would not have meant the algorithm is equivalent to judges — it would have meant we didn't have enough data to tell."

Point 2: The Racial Disparity Is Both Statistically and Practically Significant

"The racial disparity in false positive rates is a different story entirely. The effect is medium-sized ($h = 0.43$), highly significant ($p < 0.001$), and our study had over 99% power to detect it. This result is robust — it would replicate in virtually any similar study.

"Black defendants are 2.3 times more likely to be incorrectly classified as high-risk. This translates to approximately 50 excess false positives per year in our county alone — 50 people subjected to unnecessary restrictions based on their race."

Point 3: Effect Size Matters More Than Statistical Significance

"The most dangerous conclusion from this data would be: 'The algorithm is slightly better overall ($p = 0.049$), so let's keep using it.' That conclusion ignores the effect sizes.

"The overall benefit is small ($h = 0.14$). The racial cost is medium ($h = 0.43$). In effect size terms, the harm is three times larger than the benefit. When we further consider that the 'benefit' comparison is underpowered and fragile while the 'harm' comparison is robust, the picture is clear."

The Broader Context

print(f"\n{'='*65}")
print("EFFECT SIZE IN CONTEXT: CRIMINAL JUSTICE RESEARCH")
print(f"{'='*65}\n")

print("Comparison with published effect sizes in similar research:\n")

context_data = [
    ("ProPublica COMPAS analysis (2016)",
     "FP rate: 44.9% Black vs. 23.5% white", "h ≈ 0.46"),
    ("This study (James's analysis)",
     "FP rate: 31.2% Black vs. 13.3% white", "h ≈ 0.43"),
    ("Predictive policing effectiveness (meta-analysis)",
     "Crime reduction vs. traditional policing", "d ≈ 0.15"),
    ("Pretrial risk assessments (meta-analysis)",
     "Failure to appear rate reduction", "d ≈ 0.20"),
]

for study, finding, es in context_data:
    print(f"  {study}")
    print(f"    Finding: {finding}")
    print(f"    Effect size: {es}")
    print()

print("Key insight: The racial disparities (h ≈ 0.4-0.5) are consistently")
print("LARGER than the algorithms' benefits (d ≈ 0.15-0.20).")
print("The harms, measured by effect size, outweigh the gains.")

The Impossibility Result

James also explains a mathematical result that puts the findings in theoretical perspective:

"Chouldechova (2017) proved that when base rates of recidivism differ between racial groups — which they do in American criminal justice, due to systemic factors — it is mathematically impossible for an algorithm to simultaneously achieve:

Equal false positive rates across groups

Equal false negative rates across groups

Equal predictive accuracy across groups

The algorithm can be 'fair' by one definition and 'unfair' by another. This isn't a flaw in the algorithm — it's a mathematical constraint. The question of which definition of fairness to prioritize is not a statistical question. It's an ethical one.

Effect sizes help frame this ethical decision. When we say the FP rate disparity has $h = 0.43$, we're quantifying the price of one definition of 'fairness.' Policymakers need this number — not just a p-value — to make informed decisions about which tradeoffs they're willing to accept."

Discussion Questions

James's overall comparison ($h = 0.14$, $p = 0.049$) is both small and barely significant. If the p-value had been 0.051 (not significant), would that change the practical implications? Should it?
The racial FP rate disparity ($h = 0.43$) is a medium effect. What would it take for this effect to be considered "small enough" to be acceptable? Is there a threshold, or is any racial disparity in criminal justice unacceptable regardless of effect size?
James notes that the overall comparison was underpowered (48% power). If the county had funded a larger study (say, $n = 1000$ per group), two things could happen: (a) the overall difference becomes clearly significant with a small effect, or (b) the overall difference turns out to be even smaller than estimated (winner's curse). Which do you think is more likely, and why?
A county official suggests using the algorithm but adjusting the risk thresholds by race to equalize false positive rates. James objects that "race-aware" algorithms raise their own ethical concerns. What are the tradeoffs? How would you approach this decision?
Connection to Bayes' theorem (Ch.9): The false positive rate ($P(\text{high-risk} | \text{no recidivism})$) is different from the positive predictive value ($P(\text{recidivism} | \text{high-risk})$). Which is more relevant for a defendant? Which is more relevant for a judge? How does this connect to the base rate fallacy?
How should effect sizes be communicated to non-statistical audiences (like the county commission)? James chose to translate $h = 0.43$ into "50 excess false positives per year." Is this more persuasive than the statistical number? More honest? Both?

Connection to Earlier Chapters

This case study integrates concepts from across the textbook:

Chapter	Concept Applied
Ch.4	Study design: quasi-experimental (algorithm deployed during some months, not others) limits causal claims
Ch.9	Bayes' theorem: FP rate $\neq$ PPV; base rates differ by race due to systemic factors
Ch.12	Confidence intervals: the CI for the racial disparity (11.1 to 24.7 pp) tells us the range of plausible harm
Ch.13	P-values: $p = 0.049$ vs. $p < 0.001$ — same logic, very different evidentiary strength
Ch.16	Two-proportion z-test: the test that produced the significance results
Ch.17	Effect sizes and power: the tools that reveal what the p-values hide

The progression from Chapter 1 (introducing James's research question) through Chapter 17 (answering it with nuance) demonstrates why statistical literacy matters. A naive reading of the p-values alone — "the algorithm is significantly better overall" — would lead to exactly the wrong conclusion. The effect size analysis reveals that the story is far more complicated, and far more consequential, than any single number can capture.