Case Study 2: James's Algorithmic Bail Study — Fairness Auditing with Two-Group Comparisons

The Setup

Professor James Washington has spent three years studying algorithmic decision-making in the criminal justice system. His latest research examines a bail recommendation algorithm deployed in a mid-sized U.S. county — one of hundreds of jurisdictions that have adopted predictive tools to help judges decide whether to release defendants before trial.

The algorithm, called RiskAssess, uses factors like prior criminal history, current charges, employment status, residential stability, and age to generate a risk score from 1 (lowest risk) to 10 (highest risk). Defendants scored 1-4 are recommended for release without bail. Those scored 5-7 are recommended for moderate bail. Those scored 8-10 are recommended for high bail or detention.

James's research question has two parts:

  1. Overall effectiveness: Does the algorithm produce different recidivism outcomes compared to traditional judicial discretion?
  2. Fairness: Does the algorithm's error rate differ by race?

The first question is a standard two-group comparison. The second is where the real stakes lie.

The Data

James obtained data from the county's court records (with identifiers removed per his IRB agreement). The dataset covers 800 defendants released on bail over an 18-month period:

  • Algorithm group (n = 412): Bail set based on RiskAssess recommendations
  • Judge group (n = 388): Bail set by judicial discretion alone (during months when the algorithm was not available)
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep

# Overall recidivism data
print("=" * 55)
print("PART 1: Overall Recidivism Comparison")
print("=" * 55)

# Re-arrested within 2 years
alg_rearrested = 89
alg_total = 412
judge_rearrested = 107
judge_total = 388

p_alg = alg_rearrested / alg_total
p_judge = judge_rearrested / judge_total

print(f"\nAlgorithm group: {alg_rearrested}/{alg_total} = {p_alg:.3f} "
      f"({p_alg*100:.1f}%)")
print(f"Judge group:     {judge_rearrested}/{judge_total} = {p_judge:.3f} "
      f"({p_judge*100:.1f}%)")
print(f"Difference:      {(p_alg - p_judge)*100:.1f} percentage points")

Part 1: Overall Effectiveness

The Analysis

# Two-proportion z-test
count = np.array([alg_rearrested, judge_rearrested])
nobs = np.array([alg_total, judge_total])

z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')

# Pooled proportion
p_pooled = (alg_rearrested + judge_rearrested) / (alg_total + judge_total)

# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/alg_total + 1/judge_total))

print(f"\n--- Two-Proportion z-Test ---")
print(f"H₀: p_alg = p_judge")
print(f"Hₐ: p_alg ≠ p_judge")
print(f"Pooled proportion: {p_pooled:.4f}")
print(f"Standard error: {se:.4f}")
print(f"z = {z_stat:.4f}")
print(f"p-value = {p_value:.4f}")

# 95% CI for the difference (unpooled SE)
ci_low, ci_up = confint_proportions_2indep(
    alg_rearrested, alg_total,
    judge_rearrested, judge_total,
    method='wald'
)
print(f"95% CI for p_alg - p_judge: ({ci_low:.4f}, {ci_up:.4f})")
print(f"95% CI in percentage points: ({ci_low*100:.1f}%, {ci_up*100:.1f}%)")

Interpretation

The algorithm group has a recidivism rate of 21.6% compared to 27.6% for the judge group — a difference of 6.0 percentage points. The two-proportion z-test gives $z = -1.97$, $p = 0.049$.

This is statistically significant at $\alpha = 0.05$, but just barely. The 95% confidence interval for the difference ranges from about −12.0% to −0.1%, which nearly includes zero.

James's first observation: "The algorithm appears to produce slightly lower recidivism rates overall. But the p-value of 0.049 is borderline, and the confidence interval barely excludes zero. This is suggestive but not overwhelming evidence. I'd want to see this replicated before drawing strong conclusions about overall effectiveness.

More importantly, overall effectiveness doesn't tell us whether the algorithm is fair. An algorithm could have a lower overall recidivism rate while still making systematically worse predictions for certain racial groups."

Part 2: The Fairness Audit

This is the analysis that matters most.

Among defendants who were not re-arrested (i.e., who the algorithm should have identified as low risk), James examines how often each group was incorrectly flagged as high risk (risk score ≥ 7). These are false positives — defendants the algorithm wrongly classified as dangerous.

The Data

print("\n" + "=" * 55)
print("PART 2: Fairness Audit — False Positive Rates by Race")
print("=" * 55)

# Among defendants who were NOT re-arrested (true negatives + false positives)
# How many were incorrectly flagged as high-risk?

# White defendants who were NOT re-arrested
white_not_rearrested = 285
white_flagged_high = 38  # incorrectly classified as high risk
white_fp_rate = white_flagged_high / white_not_rearrested

# Black defendants who were NOT re-arrested
black_not_rearrested = 215
black_flagged_high = 67  # incorrectly classified as high risk
black_fp_rate = black_flagged_high / black_not_rearrested

print(f"\nWhite defendants (not re-arrested): n = {white_not_rearrested}")
print(f"  Flagged high risk: {white_flagged_high} "
      f"(FP rate = {white_fp_rate:.3f} = {white_fp_rate*100:.1f}%)")
print(f"\nBlack defendants (not re-arrested): n = {black_not_rearrested}")
print(f"  Flagged high risk: {black_flagged_high} "
      f"(FP rate = {black_fp_rate:.3f} = {black_fp_rate*100:.1f}%)")
print(f"\nDifference: {(black_fp_rate - white_fp_rate)*100:.1f} "
      f"percentage points")

The Test

# Two-proportion z-test for false positive rates
count_fp = np.array([white_flagged_high, black_flagged_high])
nobs_fp = np.array([white_not_rearrested, black_not_rearrested])

z_fp, p_fp = proportions_ztest(count_fp, nobs_fp, alternative='two-sided')

# Pooled proportion
p_pooled_fp = (white_flagged_high + black_flagged_high) / \
              (white_not_rearrested + black_not_rearrested)

# Standard error
se_fp = np.sqrt(p_pooled_fp * (1 - p_pooled_fp) *
                (1/white_not_rearrested + 1/black_not_rearrested))

print(f"\n--- False Positive Rate Comparison ---")
print(f"H₀: FP_white = FP_black")
print(f"Hₐ: FP_white ≠ FP_black")
print(f"Pooled FP rate: {p_pooled_fp:.4f}")
print(f"Standard error: {se_fp:.4f}")
print(f"z = {z_fp:.4f}")
print(f"p-value = {p_fp:.6f}")

# 95% CI
ci_low_fp, ci_up_fp = confint_proportions_2indep(
    white_flagged_high, white_not_rearrested,
    black_flagged_high, black_not_rearrested,
    method='wald'
)
print(f"95% CI for FP_white - FP_black: ({ci_low_fp:.4f}, {ci_up_fp:.4f})")
print(f"95% CI in pp: ({ci_low_fp*100:.1f}%, {ci_up_fp*100:.1f}%)")

The Results

Metric White Defendants Black Defendants Difference
Not re-arrested 285 215
Flagged high risk (false positives) 38 67
False positive rate 13.3% 31.2% 17.9 pp
z-statistic −4.67
p-value < 0.001
95% CI for FP difference (−25.4%, −10.4%)

The false positive rate for Black defendants (31.2%) is more than twice the rate for white defendants (13.3%). This difference is enormous and highly statistically significant ($p < 0.001$).

What This Means

print("\n" + "=" * 55)
print("INTERPRETATION")
print("=" * 55)
print("""
Among defendants who did NOT go on to re-offend:
  - 13.3% of white defendants were incorrectly flagged as high risk
  - 31.2% of Black defendants were incorrectly flagged as high risk

This means a Black defendant who will NOT re-offend is 2.3 times
more likely to be incorrectly labeled as high risk than a white
defendant who will also NOT re-offend.

In human terms: if you are a Black defendant who would not have
committed another crime, the algorithm was more than twice as likely
to recommend that you be detained or face high bail — compared to
a white defendant in the exact same position.
""")

Part 3: Connecting the Statistics to the People

James maps the numbers back to real consequences.

print("=" * 55)
print("HUMAN IMPACT")
print("=" * 55)

# What does a false positive mean for a defendant?
print("""
For each false positive, a defendant who would NOT have re-offended:
  - May spend days, weeks, or months in jail awaiting trial
  - May lose their job (median jail stay = 2-3 weeks)
  - May lose housing
  - May lose custody of children
  - May plead guilty to avoid prolonged detention
  - Faces lasting consequences on criminal record

Among the 500 non-re-arrested defendants in this study:
""")

print(f"  White false positives: {white_flagged_high} defendants")
print(f"  Black false positives: {black_flagged_high} defendants")
print(f"\n  If the Black FP rate matched the white FP rate:")
expected_if_equal = round(white_fp_rate * black_not_rearrested)
excess = black_flagged_high - expected_if_equal
print(f"    Expected Black FPs: {expected_if_equal}")
print(f"    Excess due to disparity: {excess} defendants")
print(f"\n  These {excess} individuals were unnecessarily detained,")
print(f"  lost jobs, or faced consequences they should not have faced —")
print(f"  disproportionately because of their race.")

Part 4: Why Does the Disparity Exist?

The statistical test establishes that the disparity is real — far too large to be explained by chance. But why does the algorithm produce different error rates for different racial groups?

James identifies several potential sources:

1. Training Data Bias

The algorithm was trained on historical criminal justice data. If policing practices disproportionately target Black communities (e.g., more patrols, more stop-and-frisks), then Black individuals are more likely to have recorded criminal contacts — even for behavior that goes unrecorded in white communities. The algorithm learns these patterns and perpetuates them.

2. Proxy Variables

The algorithm doesn't use race directly. But it uses variables that are correlated with race: neighborhood (due to residential segregation), employment history (due to labor market discrimination), and prior contacts with police (due to differential policing). These serve as proxy variables for race.

Connection to Chapter 4: This is confounding in disguise. The algorithm uses legitimate-sounding predictors that are confounded with race. The variables aren't "wrong" individually — someone with more prior arrests is statistically more likely to be re-arrested. But the prior arrests themselves reflect unequal policing, creating a feedback loop.

3. Base Rate Differences

The re-arrest rate differs between groups (partly due to differential policing, partly due to socioeconomic factors). When base rates differ, an algorithm calibrated for the overall population will systematically over-predict for one group and under-predict for another. This is a mathematical consequence of the base rate differences — the algorithm can't simultaneously be "fair" by all definitions.

Part 5: The Fairness Dilemma

James's findings parallel the real-world ProPublica analysis of the COMPAS algorithm (Angwin et al., 2016), which found nearly identical patterns in Broward County, Florida.

The troubling insight: there are multiple valid definitions of fairness, and they are mathematically incompatible when base rates differ between groups.

Fairness Definition Meaning RiskAssess Result
Equal false positive rates Same FP rate for all racial groups Violated (13.3% vs. 31.2%)
Predictive parity Same positive predictive value for all groups Approximately met
Calibration Among those scored X, same recidivism rate regardless of race Approximately met

RiskAssess achieves approximate predictive parity and calibration — but at the cost of dramatically unequal false positive rates. This is not a bug in the implementation; it's a mathematical impossibility result: when base rates differ, you cannot simultaneously achieve equal false positive rates, equal false negative rates, and predictive parity. Something has to give.

James's conclusion: "The algorithm is 'fair' by some definitions and deeply unfair by others. The definition of fairness you choose is not a statistical question — it's an ethical and political one. But the measurement of whether each definition is satisfied? That's statistics. And two-group comparison tests — like the two-proportion z-test — are the primary tool for that measurement.

Every time a jurisdiction adopts a predictive algorithm, someone should be running the tests in this case study. If they're not disaggregating outcomes by race, gender, age, and other protected categories, they're not doing fairness auditing. And without fairness auditing, they're not practicing responsible data science."

The Complete Analysis in One Block

import numpy as np
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep

def fairness_audit(label, successes, totals, group_labels):
    """Run a two-proportion z-test as a fairness audit."""
    rates = [s/n for s, n in zip(successes, totals)]
    z, p = proportions_ztest(np.array(successes), np.array(totals))
    ci_low, ci_up = confint_proportions_2indep(
        successes[0], totals[0], successes[1], totals[1], method='wald'
    )

    print(f"\n{'─' * 50}")
    print(f"  {label}")
    print(f"{'─' * 50}")
    for i, gl in enumerate(group_labels):
        print(f"  {gl}: {successes[i]}/{totals[i]} = {rates[i]:.1%}")
    print(f"  Difference: {(rates[0]-rates[1])*100:.1f} pp")
    print(f"  z = {z:.3f}, p = {p:.4f}")
    print(f"  95% CI: ({ci_low*100:.1f}%, {ci_up*100:.1f}%)")
    sig = "Significant" if p < 0.05 else "Not significant"
    print(f"  → {sig} at α = 0.05")

# Overall comparison
fairness_audit(
    "Overall Recidivism: Algorithm vs. Judge",
    [89, 107], [412, 388],
    ["Algorithm", "Judge"]
)

# False positive rates by race
fairness_audit(
    "False Positive Rate by Race (Algorithm Group)",
    [38, 67], [285, 215],
    ["White (not rearrested)", "Black (not rearrested)"]
)

# Additional analyses James would run:
fairness_audit(
    "False Negative Rate by Race (Algorithm Group)",
    [14, 22], [55, 61],
    ["White (rearrested)", "Black (rearrested)"]
)

Discussion Questions

  1. The overall comparison (Part 1) found the algorithm slightly better than judges ($p = 0.049$). Does this mean the algorithm should be adopted? How does the fairness analysis (Part 2) change your answer?

  2. If the county decided to adjust the algorithm's threshold to equalize false positive rates across racial groups, what would happen to the overall recidivism rate? What tradeoffs would this create?

  3. James's study uses observational data (defendants weren't randomly assigned to algorithm vs. judge). What threats to validity does this create? How would you design a randomized experiment to test the algorithm, if one were ethical and practical?

  4. The ProPublica analysis (2016) and Northpointe's rebuttal illustrated that different definitions of fairness can conflict. In 300-400 words, argue for which definition of fairness should take priority in criminal justice applications, and why.

  5. How does this case study connect to the base rate fallacy from Chapter 9? If the base rate of re-arrest differs by race (partly due to differential policing), how does this affect the interpretation of false positive rates?

  6. James found the false positive rate for Black defendants is 31.2% compared to 13.3% for white defendants. Calculate the ratio (Black FP rate / White FP rate). How does this "disparity ratio" complement the absolute difference of 17.9 percentage points in telling the story?