Case Study 2: James's Algorithmic Reckoning — Ethics of Data-Driven Criminal Justice

Contributors

Case Study 2: James's Algorithmic Reckoning — Ethics of Data-Driven Criminal Justice

The Setup

Professor James Washington is sitting in his office at 11 PM, staring at a spreadsheet that represents two years of his life.

He's about to publish the definitive analysis of Riverside County's RiskAssess algorithm — the predictive tool that assigns every arrested person a risk score from 1 to 10, which judges use to make bail and sentencing decisions. James has analyzed 14,847 cases from 2018 to 2023. His methods are rigorous. His sample is comprehensive. His conclusions are devastating.

And he knows that when this paper is published, it will change things in Riverside County. He just doesn't know if it will change them for the better.

Here's what James found:

The Overall Performance

import pandas as pd
import numpy as np
from scipy import stats

# ============================================================
# JAMES'S ANALYSIS: RiskAssess Algorithm Performance
# Two years of data, 14,847 cases
# ============================================================

np.random.seed(2026)

print("=" * 60)
print("OVERALL ALGORITHM PERFORMANCE")
print("=" * 60)

metrics = {
    'Total cases analyzed': '14,847',
    'Overall accuracy (correctly predicted outcome)': '78.3%',
    'R² (risk score vs. actual recidivism)': '0.85',
    'AUC-ROC': '0.87',
    'False positive rate (scored high risk, did not reoffend)': '29.1%',
    'False negative rate (scored low risk, did reoffend)': '12.4%'
}

for metric, value in metrics.items():
    print(f"  {metric}: {value}")

print("\nBy conventional machine learning standards, this is a")
print("GOOD model. Most practitioners would be satisfied.")

By most standards, this is a good model. An $R^2$ of 0.85 means the algorithm explains 85% of the variation in recidivism outcomes. An AUC of 0.87 is considered "good" to "excellent." If this were a medical diagnostic tool, it would likely be approved.

But James didn't stop at the overall numbers.

The Racial Disparity

print("\n" + "=" * 60)
print("PERFORMANCE BY RACE")
print("=" * 60)

racial_data = pd.DataFrame({
    'Metric': ['Cases', 'R²', 'AUC-ROC',
               'False positive rate (high risk, no reoffense)',
               'False negative rate (low risk, reoffended)',
               'Mean risk score (all defendants)',
               'Mean risk score (no reoffense)',
               'Avg. days detained pretrial'],
    'White': ['7,234', '0.91', '0.93',
              '23.1%', '8.7%', '4.2', '3.1', '12.3'],
    'Black': ['5,891', '0.73', '0.78',
              '44.2%', '17.8%', '6.1', '5.4', '31.7'],
    'Hispanic': ['1,722', '0.79', '0.81',
                 '37.5%', '14.2%', '5.3', '4.6', '22.8']
})

for _, row in racial_data.iterrows():
    print(f"\n  {row['Metric']}:")
    print(f"    White:    {row['White']}")
    print(f"    Black:    {row['Black']}")
    print(f"    Hispanic: {row['Hispanic']}")

The numbers are stark:

False positive rate: 23.1% for white defendants, 44.2% for Black defendants. Nearly twice as many Black defendants are wrongly classified as high risk.
Mean risk score for defendants who did NOT reoffend: 3.1 for white defendants, 5.4 for Black defendants. Black defendants who pose no actual risk are given scores nearly two points higher.
Average pretrial detention: 12.3 days for white defendants, 31.7 days for Black defendants.

The Human Cost

print("\n" + "=" * 60)
print("HUMAN COST ANALYSIS")
print("=" * 60)

# Black defendants scored 7+ who did NOT reoffend
black_high_risk = int(5891 * 0.32)  # ~32% scored 7+
black_false_pos = int(black_high_risk * 0.442)
white_high_risk = int(7234 * 0.18)  # ~18% scored 7+
white_false_pos = int(white_high_risk * 0.231)

print(f"\nBlack defendants scored 7+ (high risk): ~{black_high_risk}")
print(f"  Of these, did NOT reoffend: ~{black_false_pos}")
print(f"  Average excess detention: ~19 days each")
print(f"  Total person-days of wrongful detention: "
      f"~{black_false_pos * 19:,}")

print(f"\nWhite defendants scored 7+ (high risk): ~{white_high_risk}")
print(f"  Of these, did NOT reoffend: ~{white_false_pos}")
print(f"  Average excess detention: ~8 days each")
print(f"  Total person-days of wrongful detention: "
      f"~{white_false_pos * 8:,}")

print(f"\nDisparity ratio: Black false positive excess detention")
print(f"is approximately {(black_false_pos * 19) / max(white_false_pos * 8, 1):.1f}x "
      f"higher than white")

print("\n" + "-" * 60)
print("Each of these numbers represents a person who:")
print("  - Lost days or weeks of their life to pretrial detention")
print("  - May have lost their job while detained")
print("  - May have lost housing while detained")
print("  - Had their family disrupted")
print("  - Was told, by a number on a screen, that they were")
print("    dangerous — when they were not")
print("-" * 60)

Why the Disparity Exists

James has identified three contributing factors to the algorithm's racial disparity:

Factor 1: Training Data Reflects Historical Policing

The algorithm was trained on arrest and conviction data from 2010-2018 — a period when Riverside County had a policy of intensified patrols in predominantly Black neighborhoods (the "hot-spot policing" initiative). This means:

Black residents were arrested at higher rates, not necessarily because they committed more crimes, but because there were more police in their neighborhoods
Higher arrest rates in the training data → higher predicted risk scores for people with demographic profiles similar to those arrested
The algorithm learned to associate being policed with being dangerous

Connection to Theme 5 (Correlation vs. Causation): The algorithm treats the correlation between demographic factors and arrest rates as if it were a correlation with criminal behavior. But arrest rates are partly a function of policing intensity, which is a confounding variable. The algorithm confuses "was arrested" (which depends on policing decisions) with "committed a crime" (which is what we actually want to predict).

Factor 2: Proxy Variables

The algorithm doesn't use race directly as a predictor. But it uses variables that are strongly correlated with race:

Zip code (residential segregation means zip code correlates with race)
Employment history (structural barriers to employment affect Black Americans disproportionately)
Prior contacts with police (more policing → more contacts, regardless of behavior)
Neighborhood crime rate (see Factor 1)

These "proxy variables" effectively smuggle race into the model through the back door. The algorithm is technically race-neutral but functionally race-conscious.

Factor 3: Different Base Rates and Calibration

The algorithm is calibrated for the overall population. A risk score of 7 means a 55% probability of reoffending across all defendants. But: - For white defendants scored 7: actual recidivism rate is 62% - For Black defendants scored 7: actual recidivism rate is 38%

The same score means different things for different groups. This is a calibration failure that systematically over-predicts risk for Black defendants.

print("=" * 60)
print("CALIBRATION ANALYSIS")
print("=" * 60)
print("\nWhat does a risk score of 7 actually mean?")
print()
print("  Risk Score 7:     'High risk — 55% chance of reoffending'")
print("  Actual rate (all):     55%   ← looks correct overall")
print("  Actual rate (white):   62%   ← underpredicts for white")
print("  Actual rate (Black):   38%   ← OVERPREDICTS for Black")
print()
print("  A Black defendant scored 7 is LESS likely to reoffend")
print("  than a white defendant scored 7.")
print("  Yet both face the same bail and sentencing guidelines.")

The Ethical Dilemma: What Should Be Done?

James has identified five possible responses. Each raises its own ethical questions.

Option 1: Keep the Algorithm As-Is

Argument for: The algorithm is still more accurate than unassisted judicial decisions. Studies show judges are also biased — they give harsher sentences before lunch and after losses by local sports teams. At least the algorithm is consistent.

Argument against: "More accurate than biased humans" is a low bar. The algorithm systematically disadvantages Black defendants, and its use carries the authority of mathematics — making the bias harder to identify and challenge.

Option 2: Remove All Proxy Variables

Argument for: Removing zip code, neighborhood crime rate, and other proxy variables would reduce racial disparities.

Argument against: Removing these variables would also reduce overall accuracy, potentially leading to more total errors (both false positives and false negatives). And it might not eliminate the disparity entirely, because the remaining variables might still correlate with race.

Option 3: Calibrate Separately by Race

Argument for: If a score of 7 means different things for different racial groups, calibrate the model separately so that a 7 means the same thing regardless of race. This would equalize false positive rates.

Argument against: This explicitly uses race in the model, which raises its own legal and ethical issues. It also means that two defendants with identical backgrounds (same employment, same prior record) but different races could receive different risk scores — which violates the principle of treating like cases alike.

Option 4: Add Human Oversight

Argument for: Use the algorithm as one input among many, not as the sole determinant. A judge reviews the risk score alongside the full case file and can override the algorithm.

Argument against: Research shows that humans tend to defer to algorithmic recommendations (automation bias). If judges rarely override the algorithm, the human oversight is cosmetic. And if judges override it selectively — say, for defendants of their own race — the oversight could introduce bias rather than correct it.

Option 5: Abolish the Algorithm

Argument for: If the tool cannot be made fair, it should not be used. Liberty is too important to delegate to a statistical model.

Argument against: The alternative — unstructured judicial discretion — has its own documented biases. The question isn't "is the algorithm perfect?" but "is it better than the alternative?" Abolishing it without a replacement might make things worse.

The Fairness Impossibility

James has come to understand a result from computer science that makes this problem even harder: it is mathematically impossible to simultaneously satisfy all reasonable definitions of fairness.

Specifically, you cannot have a model that simultaneously achieves:

Equal calibration — a score of 7 means the same probability of reoffending for all groups
Equal false positive rates — the same proportion of non-reoffenders are wrongly classified as high risk across all groups
Equal false negative rates — the same proportion of reoffenders are wrongly classified as low risk across all groups

When the base rates differ between groups (and they usually do, for reasons ranging from genuine behavioral differences to differential policing), you can optimize for at most two of these three. This isn't a technical limitation — it's a mathematical theorem (Chouldechova, 2017; Kleinberg et al., 2016).

This means that every algorithm encodes a value judgment about which type of fairness matters most. And that value judgment should be made by communities, not by engineers.

print("=" * 60)
print("THE FAIRNESS IMPOSSIBILITY")
print("=" * 60)
print()
print("You CANNOT simultaneously achieve:")
print()
print("  1. Equal calibration")
print("     (same score = same probability, for all groups)")
print()
print("  2. Equal false positive rates")
print("     (same proportion of 'innocent' people wrongly flagged)")
print()
print("  3. Equal false negative rates")
print("     (same proportion of 'guilty' people wrongly missed)")
print()
print("When base rates differ between groups, optimizing for")
print("one type of fairness necessarily sacrifices another.")
print()
print("This is not a bug. It is a mathematical theorem.")
print()
print("IMPLICATION: Every algorithm embeds a value judgment.")
print("The question is: who gets to make that judgment?")

Structured Debate

Debate: Should Riverside County continue using the RiskAssess algorithm?

Position A: Modify and Continue The algorithm, despite its flaws, provides valuable information. Removing it would return to a system of unchecked judicial discretion that has its own biases. The right approach is to fix the calibration, add human oversight, and monitor for disparities.

Position B: Suspend Pending Reform The algorithm should be suspended until it can be demonstrated to achieve acceptable fairness across racial groups. During the suspension, judges would use structured decision frameworks without algorithmic input. The burden of proof should be on the tool to demonstrate fairness, not on defendants to prove unfairness.

Position C: Abolish Permanently Statistical models should not be used to make decisions about human liberty. The illusion of objectivity that algorithms provide makes their biases harder to detect and challenge than human biases. The criminal justice system should rely on human judgment with appropriate training and accountability.

Your task: Choose a position, write a 300-word argument, and then write a 150-word response to the strongest objection to your position.

Discussion Questions

The training data problem. If the algorithm was trained on data from a period of racially biased policing, can it ever be "fixed" — or is the bias baked in? What would it take to create an unbiased training dataset for criminal justice prediction?
The proxy variable problem. Is it ethical to use zip code as a predictor in a criminal justice algorithm, knowing that residential segregation means zip code correlates with race? What about employment history? Prior police contacts?
The fairness impossibility. Given that you can't achieve all three forms of fairness simultaneously, which should Riverside County prioritize: equal calibration, equal false positive rates, or equal false negative rates? Who should make this decision?
The counterfactual. The algorithm has a 44% false positive rate for Black defendants. What is the false positive rate for judges without the algorithm? If we don't know, can we fairly evaluate the algorithm?
Community voice. Should the communities most affected by the algorithm — particularly Black communities in Riverside County — have a formal role in deciding whether and how the algorithm is used? What would that look like?
James's responsibility. James has the data showing racial disparity. Does he have an obligation to advocate for a specific policy, or should he present the data and let policymakers decide? Can a researcher be "neutral" when the data shows injustice?

Key Takeaways

Algorithmic decision-making in criminal justice concentrates the ethical stakes of every concept in this chapter: Simpson's paradox (aggregate accuracy hiding group-level unfairness), the ecological fallacy (group-level risk scores applied to individuals), proxy variables (race entering through the back door), and correlation vs. causation (confusing arrest rates with crime rates)
The fairness impossibility theorem means every algorithm embeds a value judgment — there is no "neutral" model
Training data reflects historical patterns, including historical injustices — models trained on biased data can perpetuate and amplify bias
"More accurate than the alternative" is a necessary but insufficient criterion for ethical deployment
The question "who decides what's fair?" is at least as important as the question "what's fair?"