Case Study 2: James's Algorithm Risk Scores vs. Actual Recidivism

Contributors

Case Study 2: James's Algorithm Risk Scores vs. Actual Recidivism

The Setup

Professor James Washington has been studying the predictive policing algorithm used in his county's criminal justice system since Chapter 1. In Chapter 16, he compared false positive rates across racial groups. In Chapter 19, he tested whether bail decisions were independent of race. In Chapter 20, he analyzed whether algorithm performance varied across defendant demographic groups.

Now he's asking the most fundamental question of all: Does the algorithm's risk score actually predict what it claims to predict?

The algorithm assigns each defendant a risk score from 1 (low risk) to 10 (high risk), intended to predict the likelihood of reoffending within two years. The county uses these scores to inform pretrial detention, bail, and sentencing decisions. A score of 7 or higher is flagged as "high risk," which can mean the difference between going home and going to jail.

James has two years of follow-up data: the risk score assigned at the time of arrest, and whether the defendant actually reoffended within two years. He can now evaluate the algorithm's predictive accuracy using the regression tools from this chapter.

But the technical question — "Does the model predict accurately?" — is inseparable from the ethical question: "Does the model predict accurately for everyone?"

The Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

np.random.seed(2026)

# ============================================================
# JAMES'S ANALYSIS: RISK SCORE vs. ACTUAL RECIDIVISM
# Complete evaluation with racial disparity analysis
# ============================================================

# Aggregate data: for each risk score bin, what percentage
# actually reoffended within 2 years?
overall = pd.DataFrame({
    'risk_score': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'actual_recid_pct': [8, 12, 18, 24, 30, 38, 45, 55, 62, 72],
    'n_defendants': [450, 380, 320, 290, 260, 220, 180, 150, 120, 80]
})

# Same data disaggregated by race
by_race = pd.DataFrame({
    'risk_score': [1,2,3,4,5,6,7,8,9,10] * 2,
    'race': ['White']*10 + ['Black']*10,
    'actual_recid_pct': [
        # White defendants
        10, 15, 20, 27, 33, 40, 48, 57, 65, 75,
        # Black defendants
        6, 9, 15, 20, 26, 34, 41, 52, 58, 68
    ],
    'n_defendants': [
        # White
        250, 200, 170, 150, 130, 100, 75, 55, 35, 20,
        # Black
        200, 180, 150, 140, 130, 120, 105, 95, 85, 60
    ]
})

print("=" * 70)
print("JAMES'S ANALYSIS: ALGORITHM RISK SCORE vs. ACTUAL RECIDIVISM")
print("=" * 70)

# ---- PHASE 1: OVERALL CALIBRATION ----
print("\n" + "-" * 70)
print("PHASE 1: OVERALL CALIBRATION")
print("-" * 70)

# Correlation and regression
r_overall, p_overall = stats.pearsonr(overall['risk_score'],
                                       overall['actual_recid_pct'])
result_overall = stats.linregress(overall['risk_score'],
                                   overall['actual_recid_pct'])

print(f"\nOverall correlation: r = {r_overall:.4f}")
print(f"R² = {r_overall**2:.4f}")
print(f"p-value = {p_overall:.2e}")
print(f"\nRegression: Actual Recidivism% = "
      f"{result_overall.intercept:.2f} + "
      f"{result_overall.slope:.2f} × Risk Score")
print(f"\nInterpretation: For each 1-point increase in risk score,")
print(f"the actual recidivism rate increases by "
      f"{result_overall.slope:.1f} percentage points.")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Calibration plot
axes[0].scatter(overall['risk_score'], overall['actual_recid_pct'],
                s=overall['n_defendants']/3,
                color='steelblue', edgecolors='navy', alpha=0.8,
                zorder=5)
x_line = np.linspace(0.5, 10.5, 100)
axes[0].plot(x_line,
             result_overall.intercept + result_overall.slope * x_line,
             'r-', linewidth=2)
# Perfect calibration reference
axes[0].plot([1, 10], [10, 100], 'k--', alpha=0.3,
             label='Perfect calibration')
axes[0].set_xlabel('Algorithm Risk Score (1-10)', fontsize=12)
axes[0].set_ylabel('Actual Recidivism Rate (%)', fontsize=12)
axes[0].set_title(f'Overall Calibration\n'
                  f'(r = {r_overall:.3f}, R² = {r_overall**2:.3f})',
                  fontsize=13)
axes[0].legend()
axes[0].set_xlim(0, 11)
axes[0].set_ylim(0, 85)
axes[0].grid(True, alpha=0.3)
axes[0].text(2, 70, 'Bubble size =\nsample size',
             fontsize=9, fontstyle='italic')

# Residual plot
predicted = (result_overall.intercept +
             result_overall.slope * overall['risk_score'])
residuals = overall['actual_recid_pct'] - predicted
axes[1].scatter(predicted, residuals, color='steelblue',
                edgecolors='navy', s=80, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Predicted Recidivism Rate')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ---- PHASE 2: DISAGGREGATED ANALYSIS ----
print("\n" + "-" * 70)
print("PHASE 2: DISAGGREGATED BY RACE")
print("-" * 70)

white = by_race[by_race['race'] == 'White']
black = by_race[by_race['race'] == 'Black']

# Separate regressions
r_w, p_w = stats.pearsonr(white['risk_score'],
                           white['actual_recid_pct'])
result_w = stats.linregress(white['risk_score'],
                             white['actual_recid_pct'])

r_b, p_b = stats.pearsonr(black['risk_score'],
                           black['actual_recid_pct'])
result_b = stats.linregress(black['risk_score'],
                             black['actual_recid_pct'])

print(f"\nWhite defendants:")
print(f"  Regression: Recidivism% = {result_w.intercept:.2f} + "
      f"{result_w.slope:.2f} × Risk Score")
print(f"  r = {r_w:.4f}, R² = {r_w**2:.4f}")

print(f"\nBlack defendants:")
print(f"  Regression: Recidivism% = {result_b.intercept:.2f} + "
      f"{result_b.slope:.2f} × Risk Score")
print(f"  r = {r_b:.4f}, R² = {r_b**2:.4f}")

# The disparity table
print(f"\n{'Risk':>6}  {'White':>8}  {'Black':>8}  {'Difference':>12}")
print(f"{'Score':>6}  {'Recid%':>8}  {'Recid%':>8}  {'(W - B)':>12}")
print("-" * 42)
for _, row_w in white.iterrows():
    row_b = black[black['risk_score'] == row_w['risk_score']].iloc[0]
    diff = row_w['actual_recid_pct'] - row_b['actual_recid_pct']
    print(f"{row_w['risk_score']:>6}  "
          f"{row_w['actual_recid_pct']:>7.0f}%  "
          f"{row_b['actual_recid_pct']:>7.0f}%  "
          f"{diff:>+11.0f}pp")

avg_diff = np.mean(white['actual_recid_pct'].values -
                    black['actual_recid_pct'].values)
print(f"\nAverage disparity: {avg_diff:.1f} percentage points")

# Visualization: Side-by-side regression lines
fig, ax = plt.subplots(figsize=(10, 7))

ax.scatter(white['risk_score'], white['actual_recid_pct'],
           s=white['n_defendants']/2, color='#2196F3',
           edgecolors='navy', alpha=0.8, zorder=5, label='White')
ax.scatter(black['risk_score'], black['actual_recid_pct'],
           s=black['n_defendants']/2, color='#FF5722',
           edgecolors='darkred', alpha=0.8, zorder=5, label='Black')

x_line = np.linspace(0.5, 10.5, 100)
ax.plot(x_line, result_w.intercept + result_w.slope * x_line,
        '#2196F3', linewidth=2, linestyle='-')
ax.plot(x_line, result_b.intercept + result_b.slope * x_line,
        '#FF5722', linewidth=2, linestyle='-')

# Highlight the gap at risk score = 5
pred_w_5 = result_w.intercept + result_w.slope * 5
pred_b_5 = result_b.intercept + result_b.slope * 5
ax.annotate('', xy=(5, pred_w_5), xytext=(5, pred_b_5),
            arrowprops=dict(arrowstyle='<->', color='black', lw=2))
ax.text(5.3, (pred_w_5 + pred_b_5)/2,
        f'Gap: {pred_w_5 - pred_b_5:.0f}pp',
        fontsize=11, fontweight='bold')

ax.set_xlabel('Algorithm Risk Score (1-10)', fontsize=12)
ax.set_ylabel('Actual Recidivism Rate (%)', fontsize=12)
ax.set_title('Differential Calibration: Same Score, Different Meaning',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=12, title='Race', title_fontsize=12)
ax.set_xlim(0, 11)
ax.set_ylim(0, 85)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ---- PHASE 3: IMPACT ANALYSIS ----
print("\n" + "-" * 70)
print("PHASE 3: IMPACT ANALYSIS")
print("-" * 70)

# At the "high risk" threshold of 7+:
white_high = white[white['risk_score'] >= 7]
black_high = black[black['risk_score'] >= 7]

n_white_high = white_high['n_defendants'].sum()
n_black_high = black_high['n_defendants'].sum()

# False positive rate for "high risk" threshold
# False positive = flagged as high risk but did NOT reoffend
white_fp = np.sum(white_high['n_defendants'] *
                   (100 - white_high['actual_recid_pct']) / 100)
black_fp = np.sum(black_high['n_defendants'] *
                   (100 - black_high['actual_recid_pct']) / 100)

white_fp_rate = white_fp / n_white_high * 100
black_fp_rate = black_fp / n_black_high * 100

print(f"\nAt 'High Risk' threshold (score >= 7):")
print(f"  White defendants flagged:  {n_white_high}")
print(f"  Black defendants flagged:  {n_black_high}")
print(f"\n  White false positive rate: {white_fp_rate:.1f}%")
print(f"  Black false positive rate: {black_fp_rate:.1f}%")
print(f"  Ratio: {black_fp_rate/white_fp_rate:.2f}x")

# Total excess false positives
excess_fp = black_fp - (n_black_high * white_fp_rate / 100)
print(f"\n  Excess false positives for Black defendants:")
print(f"  {excess_fp:.0f} people detained who would not have been")
print(f"  detained if they were white with the same risk score.")

The Analysis

Phase 1: The Algorithm "Works"

At first glance, the algorithm appears to perform well. The correlation between risk score and actual recidivism is extremely strong ($r \approx 0.99$, $R^2 \approx 0.99$). Higher risk scores are associated with dramatically higher actual recidivism rates, exactly as intended.

The regression line shows that each 1-point increase in risk score is associated with approximately 7 additional percentage points of actual recidivism. A defendant with a risk score of 1 has about an 8% chance of reoffending; a defendant with a score of 10 has about a 72% chance.

If you stopped at the overall analysis, you'd conclude: "The algorithm works. The risk scores are well-calibrated. Higher scores predict higher recidivism."

Phase 2: The Algorithm Doesn't Work Equally

When James disaggregates the data by race, a different picture emerges.

At every risk score level, white defendants have higher actual recidivism rates than Black defendants. The gap averages about 5-7 percentage points:

At risk score 5: white defendants reoffend at 33%, Black at 26%.
At risk score 7: white defendants reoffend at 48%, Black at 41%.
At risk score 9: white defendants reoffend at 65%, Black at 58%.

What this means in practice: A Black defendant assigned a risk score of 7 is less dangerous than a white defendant assigned the same score of 7. But both are treated identically by the system — both are flagged as "high risk," potentially denied bail, and recommended for longer sentences.

The algorithm over-predicts risk for Black defendants relative to their actual behavior. A Black defendant with a 26% chance of reoffending is given the same risk score as a white defendant with a 33% chance.

Phase 3: The Human Cost

At the "high risk" threshold (score >= 7), the differential calibration creates concrete harm:

Black defendants flagged as "high risk" have a higher false positive rate than white defendants at the same scores.
This translates to an estimated 15-20 excess false positives per year — Black individuals detained before trial who would not have been detained if they were white with identical risk scores.

Each of those false positives is a person who: - Sits in jail awaiting trial, potentially for months - May lose their job, their housing, or custody of their children - Is found not guilty or does not reoffend — the system was wrong about them - Suffers consequences that disproportionately fall on one racial group

The Regression to the Mean Connection

James notices something else. The algorithm was trained on historical data, and historical arrest rates are themselves biased — Black communities have been subject to more aggressive policing (more patrols, more stops, more arrests for the same behavior). An individual's "prior arrests" input to the algorithm reflects not just their behavior but the policing environment they lived in.

This is regression to the mean in a social context. If the training data is extreme (biased by over-policing), the algorithm's predictions are extreme. But actual recidivism — measured by any rearrest, regardless of the policing environment — regresses toward a less extreme reality. The algorithm's predictions are more extreme than the outcomes they predict, specifically for groups that experienced more extreme (biased) policing in the training data.

James's Recommendation

"The overall calibration of the algorithm is excellent — $R^2 = 0.99$ — but this masks a critical problem. The same risk score predicts different outcomes for different racial groups. A risk score of 5 means a 33% recidivism rate for white defendants but only 26% for Black defendants.

"This differential calibration is not a quirk of the data. It's a predictable consequence of training on historically biased arrest data. The algorithm has learned that being Black is associated with higher arrest rates, and it uses race-correlated variables (neighborhood, employment stability, prior arrests in over-policed communities) as proxies.

"My recommendation: 1. The county should not use a single threshold (score >= 7) for all defendants. At minimum, calibration should be race-specific. 2. Better: commission an independent audit to identify which input variables are encoding racial bias and adjust the model accordingly. 3. Best: consider whether algorithmic risk assessment should play such a decisive role in pretrial detention decisions at all."

Theme 3 — AI and Algorithms Use Statistics:

Every algorithmic risk assessment tool — from criminal justice to credit scoring to medical triage — is a regression model at its core. It takes input variables, assigns weights (regression coefficients), and produces a predicted score. The statistics you learned in this chapter — correlation, regression lines, $R^2$, residual analysis — are the tools for evaluating whether these algorithms work.

But "works" has at least two meanings. "Works" in the overall-$R^2$-is-high sense is easy to demonstrate. "Works" in the works-fairly-for-everyone sense requires disaggregated analysis. James's case shows that an algorithm can be simultaneously accurate (high overall $R^2$) and unfair (differential calibration by race). Understanding this distinction is increasingly important as algorithmic decision-making spreads into healthcare, education, hiring, and lending.

Theme 6 — Ethical Data Practice:

James's analysis raises a fundamental question: Should we use regression models to make decisions about people's freedom?

The algorithm was built to be objective — to replace human judgment, which is known to be biased. And in some ways, it succeeds: it's more consistent than individual judges, and its overall predictions are accurate. But it inherits and amplifies the biases in its training data, producing disparate impacts that are hidden behind the veneer of mathematical objectivity.

The ethical obligation doesn't end when you fit the model and check the $R^2$. It requires asking: Who is affected by this model's errors? Are the errors distributed equitably? And what happens to the people the model gets wrong?

Discussion Questions

James's overall model has $R^2 = 0.99$. Does this high $R^2$ prove the algorithm is fair? What additional analyses are needed beyond the overall fit?
The algorithm was trained on arrest data, which reflects policing patterns. How does this create a feedback loop? (Hint: more policing in certain neighborhoods → more arrests → higher training data risk → algorithm assigns higher scores → justifies more policing.)
One proposed solution is to remove race from the algorithm's inputs. Why might this be insufficient? (Hint: what about variables correlated with race, like neighborhood and zip code?)
Compare the regression line for white defendants and Black defendants. If the county wanted to use a single threshold for "high risk," what would a fair threshold look like for each group?
James found that the algorithm predicts well overall but poorly for subgroups. This is related to Simpson's Paradox (which you'll encounter in Chapter 27). How is this case similar to Simpson's Paradox?
Connect this case to the regression-to-the-mean concept from Section 22.9. How does the historical bias in training data relate to the idea that extreme observations (over-policing-driven arrest rates) don't perfectly predict future outcomes?