Key Takeaways: Statistics and AI: Being a Critical Consumer of Data

One-Sentence Summary

AI and machine learning systems are built on the same statistical foundations you've learned throughout this course — training data is a sample, overfitting is a model complexity problem, algorithmic bias is a sampling and confounding problem, and the STATS checklist (Source, Training data, Accuracy metrics, Testing, Significance and Size) gives you a practical framework for critically evaluating any AI or data-driven claim you encounter.

Core Concepts at a Glance

Concept Definition Why It Matters
Machine learning as applied statistics ML algorithms learn patterns from data using regression, classification, and clustering — all statistical techniques Understanding this demystifies AI and gives you the tools to evaluate it
Training data as a sample The data used to train an AI system functions as a sample, subject to all forms of sampling bias Biased training data produces biased AI, regardless of algorithm sophistication
Overfitting A model captures noise rather than signal, performing well on training data but poorly on new data The gap between training and test performance is the key diagnostic
Algorithmic bias Systematic unfairness arising from biased data, biased labels, or proxy variables AI can encode, automate, and scale human prejudice
Prediction vs. inference Prediction asks "what will happen?"; inference asks "why?" ML excels at prediction but often can't explain causal mechanisms

AI/ML as Applied Statistics

ML Technique Statistical Equivalent Chapter Reference
Supervised learning (regression) Linear/multiple regression Ch.22, Ch.23
Supervised learning (classification) Logistic regression, Naive Bayes Ch.24, Ch.9
Unsupervised learning (clustering) Finding natural groupings in scatterplots Ch.5
Feature engineering Creating new variables from existing ones Ch.7
Training/test split Cross-validation, replication Ch.13, Ch.17
Regularization Parsimony, adjusted $R^2$ Ch.23
Collaborative filtering Nearest-neighbor regression/prediction Ch.22

Training Data Bias Checklist

Bias Type (Ch.4) AI/ML Equivalent Red Flag Question
Selection bias Non-representative training data "Does the training data look like the population the system will serve?"
Nonresponse bias Missing populations in data "Who is absent from this dataset?"
Survivorship bias Training on outcomes that were observed "Are we only seeing the 'winners'?"
Convenience sample Using easily available data "Was this data chosen because it was easy to get, or because it's representative?"
Response/measurement bias Labeling errors, proxy variables "Are the labels accurate? Is the measured outcome what we actually care about?"

Overfitting Summary

Indicator Meaning
Training $R^2$ much higher than test $R^2$ Model memorized training data noise
Many features relative to observations High risk of fitting noise
Perfect fit on training data Almost certainly overfit
Performance degrades on new data Model didn't learn generalizable patterns

Solutions: Held-out test sets, cross-validation, regularization, reducing the number of features, collecting more data

Algorithmic Bias: Three Landmark Cases

Case What Happened Root Statistical Cause
Amazon hiring algorithm Penalized resumes with "women's" and graduates of women's colleges Confounding: gender correlated with hiring in biased historical data
Healthcare algorithm Under-identified Black patients' needs Proxy variable: used cost (biased by access barriers) as proxy for need
COMPAS recidivism Black defendants nearly twice as likely to be falsely labeled high risk Base rate differences + measurement bias (re-arrest as proxy for reoffending)

COMPAS Fairness Framework

Fairness Criterion Who Championed It What It Means
Equal false positive rates ProPublica Among those who won't reoffend, the same proportion should be falsely flagged across groups
Equal predictive values Northpointe Among those flagged high risk, the same proportion should actually reoffend across groups
Mathematical impossibility Chouldechova (2017) When base rates differ, both criteria cannot be satisfied simultaneously

Implication: Choosing a definition of fairness is an ethical decision, not a technical one.

Big Data Fallacies

Fallacy Reality
"More data = more accurate" Biased big data is a precisely estimated wrong answer
"More variables = more signal" More variables = more spurious correlations (multiple comparisons problem)
"The pattern speaks for itself" Correlations require interpretation; they can be spurious, confounded, or unstable
"It's statistically significant, so it must matter" With big data, even trivially small effects are significant; practical significance is what counts

Prediction vs. Inference

Question Type Goal Best Tool Limitation
"What will happen?" Prediction Machine learning Can't explain why
"Why does it happen?" Inference Statistical modeling May be less accurate for prediction
"What should we do?" Causal inference Experiments (Ch.4) Requires controlled manipulation

The STATS Checklist (New Technique)

S — Source: Who made this claim? What are their incentives? Is this peer-reviewed, commercial, or social media?

T — Training Data (or Sample): What data was used? Is it representative? Who's included/excluded?

A — Accuracy Metrics: How is performance measured? Are the right metrics used? Is accuracy misleading due to base rates?

T — Testing: Was the model validated on new data? Is there independent replication?

S — Significance and Size: Is the effect statistically significant? Is it practically meaningful? Is the effect size reported?

LLMs: What You Need to Know

Property Statistical Explanation
Sound confident Authoritative language patterns are common in training data
Hallucinate Generate statistically likely text, not verified facts
Reflect biases Absorb biases present in training text
Good at patterns Trained on statistical patterns across enormous text corpora
Not truth engines Predict likely text, not accurate text

Common Mistakes

Mistake Correction
"The AI is 95% accurate, so it's reliable" Accuracy depends on base rates; ask for sensitivity, specificity, and PPV
"It was trained on millions of examples" Sample size doesn't fix sample bias
"The algorithm is objective" Algorithms encode the biases of their training data
"The AI found a pattern, so it must be real" Could be overfitting, spurious correlation, or confounding
"More data is always better" More biased data is a more precise wrong answer
"The AI proves X causes Y" AI finds correlations, not causes; inference requires experiments

Connections

Connection Details
Ch.4 (Sampling bias) Training data IS a sample; all sampling biases apply
Ch.9 (Bayes' theorem) PPV calculations for AI classifiers use the same Bayesian logic
Ch.13 (Hypothesis testing) Testing whether algorithmic outcomes differ across groups
Ch.17 (Effect sizes) Big data makes everything significant; practical significance matters more
Ch.22 (Regression) ML regression is statistical regression at scale
Ch.25 (Communication) The misleading techniques you learned to avoid as a producer, you now detect as a consumer
Ch.27 (Ethics) Algorithmic bias flows into the broader ethics of data practice