Case Study 1: James and the COMPAS Algorithm — A Full Statistical Audit

The Setup

Professor James Washington has been building toward this moment for the entire textbook.

In Chapter 1, he was introduced as a criminal justice researcher concerned about predictive policing algorithms. In Chapter 9, he calculated the positive predictive value of a risk score system and found differential PPV by race. In Chapter 13, he set up a hypothesis test for differential false positive rates. In Chapter 16, he performed a formal two-proportion z-test comparing error rates across racial groups. In Chapter 25, he wrote a policy memo communicating his findings to the County Criminal Justice Reform Commission.

Now, in this chapter — the flagship AI chapter — James is going to perform a complete statistical audit of the COMPAS algorithm, bringing together everything he knows about sampling, Bayes' theorem, hypothesis testing, regression, and the critical evaluation framework we've built.

This case study is the most comprehensive integration exercise in the textbook. It ties together concepts from at least ten chapters.

Background: What Is COMPAS?

The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system assigns risk scores from 1 to 10 to criminal defendants, predicting the likelihood that they will reoffend within two years. The scores influence decisions about bail, pretrial detention, and sentencing at multiple points in the criminal justice system.

In 2016, ProPublica published an investigation titled "Machine Bias," analyzing COMPAS scores for over 7,000 defendants in Broward County, Florida. Their findings ignited a national debate about algorithmic fairness that continues today.

James has obtained the same publicly available dataset and is conducting his own independent analysis.

James's Statistical Audit: Step by Step

Step 1: Evaluate the Training Data (STATS Checklist — S and T)

James starts where any good statistician would: with the data.

Who created COMPAS? Northpointe (now Equivant), a private company. The proprietary nature of the algorithm means that the exact training data, features, and model architecture are not publicly available. James notes this immediately: transparency is limited.

What data did ProPublica use? ProPublica obtained COMPAS scores for 7,214 defendants arrested in Broward County, Florida, between 2013 and 2014. They linked these scores to criminal records over the following two years to determine who actually reoffended.

Is this sample representative? James considers the question carefully:

  • Broward County is one county in one state. The results may not generalize to other jurisdictions with different demographics, policing practices, or judicial cultures.
  • The dataset includes only defendants who were arrested — not the broader population. People who committed crimes but were never arrested are not in the data. People who were arrested but not charged may be included or excluded depending on data linkage.
  • "Reoffending" is measured as "being arrested again within two years." But arrest is not the same as committing a crime. If policing is more intensive in certain neighborhoods (which it is), then members of those communities are more likely to be re-arrested for the same behavior — inflating their apparent recidivism rate.

James's Note: "The outcome variable — re-arrest within two years — is itself contaminated by the same biases we're trying to measure. We're using a biased proxy (arrest) for the outcome we actually care about (reoffending). This is the same proxy variable problem as the healthcare algorithm from Section 26.5."

Step 2: Examine the Accuracy Metrics (STATS Checklist — A)

James computes the confusion matrix for the COMPAS algorithm, defining "high risk" as a score of 5 or above and "positive" as actual recidivism:

Overall Performance:

Actually Reoffended Did Not Reoffend
Predicted High Risk 1,369 (True Positive) 1,282 (False Positive)
Predicted Low Risk 990 (False Negative) 2,681 (True Negative)

From this, James calculates:

  • Sensitivity (True Positive Rate): 1,369 / (1,369 + 990) = 58.0%
  • Specificity (True Negative Rate): 2,681 / (2,681 + 1,282) = 67.7%
  • Overall Accuracy: (1,369 + 2,681) / 6,322 = 64.1%
  • Positive Predictive Value (PPV): 1,369 / (1,369 + 1,282) = 51.6%

"Sixty-four percent accuracy," James tells his seminar. "That means the algorithm is wrong more than one out of every three times. And its PPV of 51.6% means that among defendants labeled 'high risk,' barely more than half actually reoffended. That's close to a coin flip."

Step 3: Break Down by Race (The Core Finding)

James now splits the confusion matrix by race — the analysis that made ProPublica's investigation so consequential.

Black Defendants (n = 3,175):

Actually Reoffended Did Not Reoffend
Predicted High Risk 805 532
Predicted Low Risk 381 990
  • Sensitivity: 805 / (805 + 381) = 67.9%
  • Specificity: 990 / (990 + 532) = 65.0%
  • False Positive Rate: 532 / (990 + 532) = 35.0%
  • PPV: 805 / (805 + 532) = 60.2%

White Defendants (n = 2,103):

Actually Reoffended Did Not Reoffend
Predicted High Risk 349 209
Predicted Low Risk 302 891
  • Sensitivity: 349 / (349 + 302) = 53.6%
  • Specificity: 891 / (891 + 209) = 81.0%
  • False Positive Rate: 209 / (891 + 209) = 19.0%
  • PPV: 349 / (349 + 209) = 62.5%

Step 4: Interpret the Disparities

James highlights the key findings:

The False Positive Rate Disparity (ProPublica's Focus):

Among defendants who did NOT reoffend: - 35.0% of Black defendants were falsely labeled high risk - 19.0% of white defendants were falsely labeled high risk

Black defendants who would not reoffend were nearly twice as likely to be incorrectly flagged as high risk. In human terms: Black defendants were nearly twice as likely as white defendants to be kept in jail awaiting trial — even though they would not have reoffended.

The Predictive Value Comparison (Northpointe's Defense):

Among defendants labeled high risk: - 60.2% of Black defendants actually reoffended - 62.5% of white defendants actually reoffended

The PPV is approximately equal across groups. Northpointe argued this means the algorithm is "calibrated" — a high-risk score means roughly the same thing regardless of race.

Both are true. Both are incomplete.

"Here's the mathematical reality," James explains. "The base rate of recidivism differs between groups in this dataset. Approximately 51.4% of Black defendants were rearrested within two years, compared to 39.2% of white defendants. When base rates differ, a theorem by Chouldechova (2017) proves that no algorithm can simultaneously equalize false positive rates AND predictive values across groups. It's not a bug in COMPAS. It's a mathematical impossibility."

Step 5: The Deeper Statistical Issue

James pushes the analysis further.

"The base rate difference itself needs scrutiny. Is it because Black defendants actually reoffend more? Or because they're policed more heavily and therefore re-arrested more? If a police department patrols Black neighborhoods more intensively — and the data says they do — then Black individuals are more likely to be caught for minor offenses that would go undetected in less-policed neighborhoods."

"This means the base rate difference is itself partly an artifact of biased policing. And COMPAS, trained on this data, learns and perpetuates this bias."

In statistical terms: the outcome variable (re-arrest) is confounded with the treatment variable (race/policing intensity). The algorithm can't distinguish between "this person is genuinely more likely to reoffend" and "this person is more likely to be caught reoffending because of where they live and how they're policed."

Connection to Chapter 4: This is the confounding variable problem from Chapter 4 (Section 4.6), playing out with devastating real-world consequences. The confounding variable is policing intensity, which is associated with both race (through neighborhood-level policing decisions) and the outcome (re-arrest). Without controlling for policing intensity — and without the ability to randomly assign policing levels (which would be both impractical and unethical) — we cannot separate the effect of individual risk from the effect of systemic over-policing.

Step 6: What Should We Do?

James presents three options to his seminar:

Option A: Abandon algorithmic risk assessment entirely. Return to fully human decision-making. The concern: human judges have their own biases, which are less transparent and less auditable than algorithmic ones. At least with an algorithm, we can measure the disparity.

Option B: Use the algorithm but equalize false positive rates. Adjust the threshold scores differently for different racial groups so that the false positive rate is the same. The concern: this would require using race explicitly in the algorithm (race-conscious scoring), which raises legal and ethical issues. It would also mean that the predictive value of a "high risk" score would differ by race.

Option C: Use the algorithm as one input among many, with mandatory human review and transparent override procedures. Keep the algorithm but require judges to document their reasoning and forbid automated decisions. The concern: research shows that decision-makers tend to anchor on algorithmic recommendations — even when told they're advisory.

"Notice," James says, "that none of these options is purely technical. The choice between them is an ethical and political choice about what kind of fairness we value most. Statistics can quantify the tradeoffs. It can't tell us which tradeoff is right."

Discussion Questions

  1. James found that re-arrest is a biased proxy for reoffending. What would be a better outcome variable? Is it possible to measure "actual reoffending" without the bias of differential policing?

  2. If you were a judge in Broward County, how would you use a COMPAS score after reading James's analysis? Would you use it at all?

  3. The Chouldechova impossibility result means that no algorithm can satisfy all fairness criteria simultaneously. Does this mean we should give up on algorithmic fairness, or does it mean we need to be more explicit about which type of fairness we're prioritizing?

  4. James notes that human judges also have biases. Is a biased algorithm preferable to a biased human? Under what conditions? (Consider: measurability, consistency, accountability, and the ability to audit.)

  5. This case study connects concepts from at least eight earlier chapters (Chapters 1, 4, 9, 13, 16, 17, 22, 25). Choose two of these connections and explain how the earlier concept directly applies to the COMPAS analysis.

Key Statistical Concepts Applied

Concept Chapter of Origin Application in COMPAS Audit
Sampling bias Ch.4 Training data reflects historical policing patterns
Confounding variable Ch.4 Policing intensity confounds race and re-arrest
Conditional probability Ch.9 P(reoffend | high risk) vs. P(high risk | did not reoffend)
Bayes' theorem / PPV Ch.9 PPV depends on base rate of recidivism
Hypothesis testing Ch.13 Testing whether error rates differ by race
Two-proportion z-test Ch.16 Comparing false positive rates across groups
Statistical vs. practical significance Ch.17 Disparity size matters, not just p-value
Proxy variables Ch.22 Re-arrest as proxy for reoffending; cost as proxy for need
Communication Ch.25 Presenting findings to policymakers
STATS checklist Ch.26 Systematic evaluation framework