Key Takeaways: Statistics and AI: Being a Critical Consumer of Data

Contributors

Key Takeaways: Statistics and AI: Being a Critical Consumer of Data

One-Sentence Summary

AI and machine learning systems are built on the same statistical foundations you've learned throughout this course — training data is a sample, overfitting is a model complexity problem, algorithmic bias is a sampling and confounding problem, and the STATS checklist (Source, Training data, Accuracy metrics, Testing, Significance and Size) gives you a practical framework for critically evaluating any AI or data-driven claim you encounter.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Machine learning as applied statistics	ML algorithms learn patterns from data using regression, classification, and clustering — all statistical techniques	Understanding this demystifies AI and gives you the tools to evaluate it
Training data as a sample	The data used to train an AI system functions as a sample, subject to all forms of sampling bias	Biased training data produces biased AI, regardless of algorithm sophistication
Overfitting	A model captures noise rather than signal, performing well on training data but poorly on new data	The gap between training and test performance is the key diagnostic
Algorithmic bias	Systematic unfairness arising from biased data, biased labels, or proxy variables	AI can encode, automate, and scale human prejudice
Prediction vs. inference	Prediction asks "what will happen?"; inference asks "why?"	ML excels at prediction but often can't explain causal mechanisms

AI/ML as Applied Statistics

ML Technique	Statistical Equivalent	Chapter Reference
Supervised learning (regression)	Linear/multiple regression	Ch.22, Ch.23
Supervised learning (classification)	Logistic regression, Naive Bayes	Ch.24, Ch.9
Unsupervised learning (clustering)	Finding natural groupings in scatterplots	Ch.5
Feature engineering	Creating new variables from existing ones	Ch.7
Training/test split	Cross-validation, replication	Ch.13, Ch.17
Regularization	Parsimony, adjusted $R^2$	Ch.23
Collaborative filtering	Nearest-neighbor regression/prediction	Ch.22

Training Data Bias Checklist

Bias Type (Ch.4)	AI/ML Equivalent	Red Flag Question
Selection bias	Non-representative training data	"Does the training data look like the population the system will serve?"
Nonresponse bias	Missing populations in data	"Who is absent from this dataset?"
Survivorship bias	Training on outcomes that were observed	"Are we only seeing the 'winners'?"
Convenience sample	Using easily available data	"Was this data chosen because it was easy to get, or because it's representative?"
Response/measurement bias	Labeling errors, proxy variables	"Are the labels accurate? Is the measured outcome what we actually care about?"

Overfitting Summary

Indicator	Meaning
Training $R^2$ much higher than test $R^2$	Model memorized training data noise
Many features relative to observations	High risk of fitting noise
Perfect fit on training data	Almost certainly overfit
Performance degrades on new data	Model didn't learn generalizable patterns

Solutions: Held-out test sets, cross-validation, regularization, reducing the number of features, collecting more data

Algorithmic Bias: Three Landmark Cases

Case	What Happened	Root Statistical Cause
Amazon hiring algorithm	Penalized resumes with "women's" and graduates of women's colleges	Confounding: gender correlated with hiring in biased historical data
Healthcare algorithm	Under-identified Black patients' needs	Proxy variable: used cost (biased by access barriers) as proxy for need
COMPAS recidivism	Black defendants nearly twice as likely to be falsely labeled high risk	Base rate differences + measurement bias (re-arrest as proxy for reoffending)

COMPAS Fairness Framework

Fairness Criterion	Who Championed It	What It Means
Equal false positive rates	ProPublica	Among those who won't reoffend, the same proportion should be falsely flagged across groups
Equal predictive values	Northpointe	Among those flagged high risk, the same proportion should actually reoffend across groups
Mathematical impossibility	Chouldechova (2017)	When base rates differ, both criteria cannot be satisfied simultaneously

Implication: Choosing a definition of fairness is an ethical decision, not a technical one.

Big Data Fallacies

Fallacy	Reality
"More data = more accurate"	Biased big data is a precisely estimated wrong answer
"More variables = more signal"	More variables = more spurious correlations (multiple comparisons problem)
"The pattern speaks for itself"	Correlations require interpretation; they can be spurious, confounded, or unstable
"It's statistically significant, so it must matter"	With big data, even trivially small effects are significant; practical significance is what counts

Prediction vs. Inference

Question Type	Goal	Best Tool	Limitation
"What will happen?"	Prediction	Machine learning	Can't explain why
"Why does it happen?"	Inference	Statistical modeling	May be less accurate for prediction
"What should we do?"	Causal inference	Experiments (Ch.4)	Requires controlled manipulation

The STATS Checklist (New Technique)

S — Source: Who made this claim? What are their incentives? Is this peer-reviewed, commercial, or social media?

T — Training Data (or Sample): What data was used? Is it representative? Who's included/excluded?

A — Accuracy Metrics: How is performance measured? Are the right metrics used? Is accuracy misleading due to base rates?

T — Testing: Was the model validated on new data? Is there independent replication?

S — Significance and Size: Is the effect statistically significant? Is it practically meaningful? Is the effect size reported?

LLMs: What You Need to Know

Property	Statistical Explanation
Sound confident	Authoritative language patterns are common in training data
Hallucinate	Generate statistically likely text, not verified facts
Reflect biases	Absorb biases present in training text
Good at patterns	Trained on statistical patterns across enormous text corpora
Not truth engines	Predict likely text, not accurate text

Common Mistakes

Mistake	Correction
"The AI is 95% accurate, so it's reliable"	Accuracy depends on base rates; ask for sensitivity, specificity, and PPV
"It was trained on millions of examples"	Sample size doesn't fix sample bias
"The algorithm is objective"	Algorithms encode the biases of their training data
"The AI found a pattern, so it must be real"	Could be overfitting, spurious correlation, or confounding
"More data is always better"	More biased data is a more precise wrong answer
"The AI proves X causes Y"	AI finds correlations, not causes; inference requires experiments

Connections

Connection	Details
Ch.4 (Sampling bias)	Training data IS a sample; all sampling biases apply
Ch.9 (Bayes' theorem)	PPV calculations for AI classifiers use the same Bayesian logic
Ch.13 (Hypothesis testing)	Testing whether algorithmic outcomes differ across groups
Ch.17 (Effect sizes)	Big data makes everything significant; practical significance matters more
Ch.22 (Regression)	ML regression is statistical regression at scale
Ch.25 (Communication)	The misleading techniques you learned to avoid as a producer, you now detect as a consumer
Ch.27 (Ethics)	Algorithmic bias flows into the broader ethics of data practice