Key Takeaways: Statistics and AI: Being a Critical Consumer of Data
One-Sentence Summary
AI and machine learning systems are built on the same statistical foundations you've learned throughout this course — training data is a sample, overfitting is a model complexity problem, algorithmic bias is a sampling and confounding problem, and the STATS checklist (Source, Training data, Accuracy metrics, Testing, Significance and Size) gives you a practical framework for critically evaluating any AI or data-driven claim you encounter.
Core Concepts at a Glance
Concept
Definition
Why It Matters
Machine learning as applied statistics
ML algorithms learn patterns from data using regression, classification, and clustering — all statistical techniques
Understanding this demystifies AI and gives you the tools to evaluate it
Training data as a sample
The data used to train an AI system functions as a sample, subject to all forms of sampling bias
Biased training data produces biased AI, regardless of algorithm sophistication
Overfitting
A model captures noise rather than signal, performing well on training data but poorly on new data
The gap between training and test performance is the key diagnostic
Algorithmic bias
Systematic unfairness arising from biased data, biased labels, or proxy variables
AI can encode, automate, and scale human prejudice
Prediction vs. inference
Prediction asks "what will happen?"; inference asks "why?"
ML excels at prediction but often can't explain causal mechanisms
AI/ML as Applied Statistics
ML Technique
Statistical Equivalent
Chapter Reference
Supervised learning (regression)
Linear/multiple regression
Ch.22, Ch.23
Supervised learning (classification)
Logistic regression, Naive Bayes
Ch.24, Ch.9
Unsupervised learning (clustering)
Finding natural groupings in scatterplots
Ch.5
Feature engineering
Creating new variables from existing ones
Ch.7
Training/test split
Cross-validation, replication
Ch.13, Ch.17
Regularization
Parsimony, adjusted $R^2$
Ch.23
Collaborative filtering
Nearest-neighbor regression/prediction
Ch.22
Training Data Bias Checklist
Bias Type (Ch.4)
AI/ML Equivalent
Red Flag Question
Selection bias
Non-representative training data
"Does the training data look like the population the system will serve?"
Nonresponse bias
Missing populations in data
"Who is absent from this dataset?"
Survivorship bias
Training on outcomes that were observed
"Are we only seeing the 'winners'?"
Convenience sample
Using easily available data
"Was this data chosen because it was easy to get, or because it's representative?"
Response/measurement bias
Labeling errors, proxy variables
"Are the labels accurate? Is the measured outcome what we actually care about?"
Overfitting Summary
Indicator
Meaning
Training $R^2$ much higher than test $R^2$
Model memorized training data noise
Many features relative to observations
High risk of fitting noise
Perfect fit on training data
Almost certainly overfit
Performance degrades on new data
Model didn't learn generalizable patterns
Solutions: Held-out test sets, cross-validation, regularization, reducing the number of features, collecting more data
Algorithmic Bias: Three Landmark Cases
Case
What Happened
Root Statistical Cause
Amazon hiring algorithm
Penalized resumes with "women's" and graduates of women's colleges
Confounding: gender correlated with hiring in biased historical data
Healthcare algorithm
Under-identified Black patients' needs
Proxy variable: used cost (biased by access barriers) as proxy for need
COMPAS recidivism
Black defendants nearly twice as likely to be falsely labeled high risk
Base rate differences + measurement bias (re-arrest as proxy for reoffending)
COMPAS Fairness Framework
Fairness Criterion
Who Championed It
What It Means
Equal false positive rates
ProPublica
Among those who won't reoffend, the same proportion should be falsely flagged across groups
Equal predictive values
Northpointe
Among those flagged high risk, the same proportion should actually reoffend across groups
Mathematical impossibility
Chouldechova (2017)
When base rates differ, both criteria cannot be satisfied simultaneously
Implication: Choosing a definition of fairness is an ethical decision, not a technical one.
Big Data Fallacies
Fallacy
Reality
"More data = more accurate"
Biased big data is a precisely estimated wrong answer
"More variables = more signal"
More variables = more spurious correlations (multiple comparisons problem)
"The pattern speaks for itself"
Correlations require interpretation; they can be spurious, confounded, or unstable
"It's statistically significant, so it must matter"
With big data, even trivially small effects are significant; practical significance is what counts
Prediction vs. Inference
Question Type
Goal
Best Tool
Limitation
"What will happen?"
Prediction
Machine learning
Can't explain why
"Why does it happen?"
Inference
Statistical modeling
May be less accurate for prediction
"What should we do?"
Causal inference
Experiments (Ch.4)
Requires controlled manipulation
The STATS Checklist (New Technique)
S — Source: Who made this claim? What are their incentives? Is this peer-reviewed, commercial, or social media?
T — Training Data (or Sample): What data was used? Is it representative? Who's included/excluded?
A — Accuracy Metrics: How is performance measured? Are the right metrics used? Is accuracy misleading due to base rates?
T — Testing: Was the model validated on new data? Is there independent replication?
S — Significance and Size: Is the effect statistically significant? Is it practically meaningful? Is the effect size reported?
LLMs: What You Need to Know
Property
Statistical Explanation
Sound confident
Authoritative language patterns are common in training data
Hallucinate
Generate statistically likely text, not verified facts
Reflect biases
Absorb biases present in training text
Good at patterns
Trained on statistical patterns across enormous text corpora
Not truth engines
Predict likely text, not accurate text
Common Mistakes
Mistake
Correction
"The AI is 95% accurate, so it's reliable"
Accuracy depends on base rates; ask for sensitivity, specificity, and PPV
"It was trained on millions of examples"
Sample size doesn't fix sample bias
"The algorithm is objective"
Algorithms encode the biases of their training data
"The AI found a pattern, so it must be real"
Could be overfitting, spurious correlation, or confounding
"More data is always better"
More biased data is a more precise wrong answer
"The AI proves X causes Y"
AI finds correlations, not causes; inference requires experiments
Connections
Connection
Details
Ch.4 (Sampling bias)
Training data IS a sample; all sampling biases apply
Ch.9 (Bayes' theorem)
PPV calculations for AI classifiers use the same Bayesian logic
Ch.13 (Hypothesis testing)
Testing whether algorithmic outcomes differ across groups
Ch.17 (Effect sizes)
Big data makes everything significant; practical significance matters more
Ch.22 (Regression)
ML regression is statistical regression at scale
Ch.25 (Communication)
The misleading techniques you learned to avoid as a producer, you now detect as a consumer
Ch.27 (Ethics)
Algorithmic bias flows into the broader ethics of data practice
We use cookies to improve your experience and show relevant ads. Privacy Policy