Quiz: Statistics and AI: Being a Critical Consumer of Data

Contributors

Quiz: Statistics and AI: Being a Critical Consumer of Data

Test your understanding of how AI and machine learning relate to statistics, the dangers of biased training data and overfitting, algorithmic bias as a statistical phenomenon, big data fallacies, the prediction vs. inference distinction, and critical evaluation of AI claims. Try to answer each question before revealing the answer.

1. Machine learning is best described as:

(a) A completely new approach to data analysis unrelated to statistics

(b) A set of computational techniques that extend and apply statistical methods to learn patterns from data

(c) A method for generating random data

(d) A replacement for hypothesis testing

Answer

**(b) A set of computational techniques that extend and apply statistical methods to learn patterns from data.** Machine learning uses the same fundamental logic as statistical analysis: take a sample (training data), find patterns (fit a model), and make predictions about new data (inference/prediction). Supervised learning encompasses regression and classification — methods you already know from Chapters 22-24. The math can be more complex, but the conceptual foundation is statistical.

2. Training data in machine learning is most analogous to:

(a) A population in statistics

(b) A sample in statistics

(c) A confidence interval

(d) A hypothesis

Answer

**(b) A sample in statistics.** Training data is a sample from the population of all possible observations the AI system will encounter. Just like any statistical sample, it can suffer from selection bias, nonresponse bias, survivorship bias, and convenience sampling. An AI system trained on a biased sample will produce biased results, regardless of how sophisticated the algorithm is. This is the central insight of Section 26.3.

3. An AI model achieves $R^2 = 0.97$ on its training data but $R^2 = 0.42$ on new test data. This is most likely an example of:

(a) Algorithmic bias

(b) Underfitting

(c) Overfitting

(d) Big data

Answer

**(c) Overfitting.** The large gap between training performance ($R^2 = 0.97$) and test performance ($R^2 = 0.42$) is the hallmark of overfitting. The model has learned the noise and idiosyncrasies of the training data rather than the underlying signal. It memorized the training set rather than learning generalizable patterns. A well-fit model would show similar performance on both training and test data.

4. Amazon's hiring algorithm penalized resumes containing the word "women's" because:

(a) The algorithm was intentionally programmed to discriminate

(b) Gender was a confounding variable in the historical hiring data

(c) The algorithm didn't have enough training data

(d) The algorithm was underfitting

Answer

**(b) Gender was a confounding variable in the historical hiring data.** Amazon's tech workforce was historically predominantly male, so the training data reflected this imbalance. Gender was correlated with the outcome (getting hired) not because of qualifications but because of historical discrimination. The algorithm found this correlation and encoded it as a pattern. This is a confounding variable problem ([Chapter 4](../../part-01-getting-started/chapter-04-designing-studies/index.md), Section 4.6) — gender was associated with both the candidate pool (input) and hiring decisions (output) due to systemic factors, not job-relevant qualifications.

5. The COMPAS recidivism algorithm was found to have equal predictive values across racial groups (Northpointe's claim) AND unequal false positive rates across racial groups (ProPublica's finding). This was possible because:

(a) One of them was wrong

(b) They measured different data

(c) When base rates differ between groups, it's mathematically impossible to equalize both predictive values and error rates simultaneously

(d) The sample size was too small to detect the difference

Answer

**(c) When base rates differ between groups, it's mathematically impossible to equalize both predictive values and error rates simultaneously.** This is a mathematical impossibility result proven by Chouldechova (2017). When the base rate of the outcome (recidivism) differs between groups, no algorithm can simultaneously equalize false positive rates, false negative rates, AND predictive values. Both ProPublica and Northpointe were reporting accurate statistics — they were just measuring different definitions of fairness. This connects to the conditional probability lesson from [Chapter 9](../../part-03-probability/chapter-09-conditional-probability-and-bayes/index.md): P(high risk | did not reoffend) and P(will reoffend | high risk) are different conditional probabilities.

6. Google Flu Trends eventually over-predicted flu cases primarily because:

(a) Google didn't have enough search data

(b) The system found spurious correlations among billions of search terms, and the relationship between search behavior and flu was unstable over time

(c) Flu viruses changed too quickly for any model

(d) The system was underfitting

Answer

**(b) The system found spurious correlations among billions of search terms, and the relationship between search behavior and flu was unstable over time.** With billions of search terms available, Google Flu Trends inevitably found terms that correlated with flu rates by chance (the multiple comparisons problem from [Chapter 17](../../part-05-inference-in-practice/chapter-17-power-and-effect-sizes/index.md)). Additionally, the relationship between searches and actual flu cases changed — people searched for flu symptoms due to media coverage and concern, not just illness. This illustrates two big data fallacies: more variables mean more spurious correlations, and correlations found in one time period may not hold in another.

7. Which of the following is the BEST question to ask when a company claims their AI system is "95% accurate"?

(a) "How many engineers worked on it?"

(b) "What programming language is it written in?"

(c) "Was that accuracy measured on the training data or on a separate test set, and what are the sensitivity, specificity, and PPV at the relevant base rate?"

(d) "Is 95% a prime number?"

Answer

**(c) "Was that accuracy measured on the training data or on a separate test set, and what are the sensitivity, specificity, and PPV at the relevant base rate?"** As we learned from Maya's analysis (Section 26.3), overall accuracy can be deeply misleading. A system that always predicts the majority class can have high accuracy with zero usefulness. The PPV depends critically on the base rate (Bayes' theorem, [Chapter 9](../../part-03-probability/chapter-09-conditional-probability-and-bayes/index.md)). And accuracy on training data tells you nothing about how the system will perform in the real world — you need performance on held-out or external data.

8. The distinction between prediction and inference is best described as:

(a) Prediction uses data; inference uses theory

(b) Prediction asks "what will happen?" while inference asks "why does it happen?"

(c) Prediction is done by computers; inference is done by humans

(d) There is no meaningful distinction

Answer

**(b) Prediction asks "what will happen?" while inference asks "why does it happen?"** Prediction focuses on accurately forecasting outcomes. Inference focuses on understanding causal relationships and the mechanisms behind those outcomes. Machine learning tends to prioritize prediction (often using complex, hard-to-interpret models), while traditional statistics tends to prioritize inference (using simpler, interpretable models with clear coefficients and p-values). A model can be excellent at prediction without providing any insight into why things happen — and confusing the two can lead to poor decisions.

9. An AI dermatology system was trained on images that were 85% from light-skinned patients. This is most directly a problem of:

(a) Overfitting

(b) Underfitting

(c) Selection bias in the training data

(d) Too much data

Answer

**(c) Selection bias in the training data.** The training data over-represents light-skinned patients and under-represents dark-skinned patients. This means the model has seen many more examples of what skin conditions look like on lighter skin, and will likely have lower sensitivity (miss more conditions) for darker-skinned patients. This is selection bias ([Chapter 4](../../part-01-getting-started/chapter-04-designing-studies/index.md), Section 4.3) applied to training data — the sample is not representative of the population the system will serve.

10. LLMs (like ChatGPT or Claude) sometimes generate false information that sounds convincing ("hallucinate") because:

(a) They intentionally lie to confuse users

(b) They generate text based on statistically likely word patterns, not factual verification

(c) They have not been trained on enough data

(d) They are programmed to make mistakes

Answer

**(b) They generate text based on statistically likely word patterns, not factual verification.** LLMs are statistical models of language. They predict the most likely next word or token given the preceding context. If the pattern "According to a 2023 study in The Lancet..." is common in the training data, the model can generate plausible-sounding but fabricated study details. The model's confidence comes from the statistical frequency of language patterns, not from any connection to truth. Understanding this helps you evaluate LLM output more critically.

11. In the bias-variance tradeoff, overfitting corresponds to:

(a) High bias, low variance

(b) Low bias, high variance

(c) High bias, high variance

(d) Low bias, low variance

Answer

**(b) Low bias, high variance.** An overfit model has low bias because it captures even subtle patterns in the training data (including noise). But it has high variance because it's highly sensitive to the specific training set — train it on a slightly different sample and you'll get a substantially different model. A good model balances bias and variance: complex enough to capture real patterns (not too much bias) but simple enough to generalize to new data (not too much variance).

12. Maya calculated that the AI chest X-ray system's PPV dropped from 63.2% at 12% prevalence to 28% at 3% prevalence. This dramatic change demonstrates:

(a) The AI system broke down at lower prevalence

(b) The fundamental dependence of predictive value on base rates (Bayes' theorem)

(c) That lower-prevalence settings have worse equipment

(d) That the AI was overfitting

Answer

**(b) The fundamental dependence of predictive value on base rates (Bayes' theorem).** This is Bayes' theorem in action ([Chapter 9](../../part-03-probability/chapter-09-conditional-probability-and-bayes/index.md), Section 9.8). The same sensitivity and specificity produce very different PPV values depending on the prevalence (base rate). When the condition is rare (3% prevalence), even a specific test generates many false positives relative to true positives. The AI system didn't change — the population context changed. This is why Maya's statistical training was essential: without understanding base rates, the hospital might deploy the system in a setting where 72% of its positive flags are wrong.

13. A hiring AI doesn't use race as an input variable, but it uses zip code, which is correlated with race due to residential segregation. Zip code in this context is functioning as:

(a) A confounding variable

(b) A proxy variable

(c) A random variable

(d) An independent variable

Answer

**(b) A proxy variable.** A proxy variable is one that stands in for (is correlated with) another variable. Because zip code is correlated with race due to historical residential segregation, using zip code as a predictor effectively introduces race into the model even though race isn't explicitly included. This is one of the most insidious forms of algorithmic bias — the model can produce racially biased outcomes while technically being "race-blind." It's also related to confounding ([Chapter 4](../../part-01-getting-started/chapter-04-designing-studies/index.md)), where zip code is associated with both the outcome (loan default) and the sensitive characteristic (race).

14. Which statement about big data is TRUE?

(a) More data always leads to better conclusions

(b) Big data eliminates the need for representative sampling

(c) With big data, even tiny, practically meaningless effects can be statistically significant

(d) Big data makes overfitting less likely

Answer

**(c) With big data, even tiny, practically meaningless effects can be statistically significant.** With millions of observations, the standard error becomes very small, which means even tiny differences produce small p-values. This is the statistical vs. practical significance distinction from Chapter 17. A correlation of r = 0.003 can be statistically significant with 50 million observations, but it explains essentially none of the variation. Big data doesn't eliminate the need for representative sampling (biased big data gives precise wrong answers), doesn't always lead to better conclusions (spurious correlations multiply), and can actually make overfitting more likely when paired with many variables.

15. The STATS checklist stands for:

(a) Statistics, Testing, Analysis, Training, Sampling

(b) Source, Training Data, Accuracy Metrics, Testing, Significance and Size

(c) Sample, Treatment, Alternative, Test Statistic, Significance

(d) Statistics, Technology, AI, Truth, Science

Answer

**(b) Source, Training Data, Accuracy Metrics, Testing, Significance and Size.** The STATS checklist is the practical evaluation tool introduced in Section 26.9: - **S**ource: Who made this claim? What are their incentives? - **T**raining Data: What data was used? Is it representative? - **A**ccuracy Metrics: How is performance measured? Is the metric appropriate? - **T**esting: Was the model validated on new data? - **S**ignificance and Size: Is the effect statistically significant AND practically meaningful?

16. A model that always predicts "no cancer" in a population where 2% of people have cancer would have what accuracy?

(a) 0%

(b) 2%

(c) 50%

(d) 98%

Answer

**(d) 98%.** If only 2% of people have cancer, a model that always says "no cancer" would be correct 98% of the time. This demonstrates why accuracy alone is a misleading metric, especially when one class is much more common than the other (class imbalance). The model has 98% accuracy but 0% sensitivity — it would miss every single cancer case. This is why Maya insists on examining sensitivity, specificity, and PPV, not just overall accuracy.

17. Collaborative filtering (the technique behind many recommendation algorithms) is most similar to which statistical concept?

(a) Hypothesis testing

(b) Nearest-neighbor prediction (a form of regression)

(c) Chi-square test

(d) ANOVA

Answer

**(b) Nearest-neighbor prediction (a form of regression).** Collaborative filtering works by finding users who are similar to you (your "nearest neighbors" in terms of viewing/purchasing history) and predicting your preferences based on theirs. This is fundamentally a prediction/regression task: given information about similar users, predict the value (rating/preference) you would assign to an item you haven't yet encountered. It's a non-parametric form of regression that uses similarity rather than a linear equation.

18. The healthcare algorithm that used cost as a proxy for need resulted in:

(a) Overfitting to the training data

(b) Systematic under-identification of Black patients' healthcare needs because Black patients historically received less spending due to access barriers

(c) Random errors distributed equally across all groups

(d) An algorithm that was too simple to capture patterns

Answer

**(b) Systematic under-identification of Black patients' healthcare needs because Black patients historically received less spending due to access barriers.** The algorithm's designers intended to predict healthcare need, but they used healthcare cost as a proxy. Because Black patients historically received less healthcare spending than white patients with the same conditions (due to insurance disparities, access barriers, and systemic factors), the algorithm learned that Black patients had lower costs and therefore lower "need." This is a measurement problem — the proxy variable (cost) was differentially related to the true outcome (need) across racial groups. Fixing this bias would have increased the percentage of Black patients flagged for extra care from 17.7% to 46.5%.

19. A company claims: "Our AI analyzed 100 million data points and found that people who eat breakfast are more productive at work." The most important critique of this claim is:

(a) 100 million data points is too few

(b) The large sample size almost guarantees statistical significance, so the real question is whether the effect size is practically meaningful, and correlation doesn't imply that eating breakfast causes productivity

(c) AI can't study nutrition

(d) Productivity can't be measured

Answer

**(b) The large sample size almost guarantees statistical significance, so the real question is whether the effect size is practically meaningful, and correlation doesn't imply that eating breakfast *causes* productivity.** With 100 million observations, even trivially small correlations will be statistically significant ([Chapter 17](../../part-05-inference-in-practice/chapter-17-power-and-effect-sizes/index.md)). The key questions are: (1) How large is the effect? (2) Is this a correlation or a causal relationship? People who eat breakfast might differ from people who don't in many other ways — income, job type, sleep habits, general health — all of which are confounding variables ([Chapter 4](../../part-01-getting-started/chapter-04-designing-studies/index.md)). The AI found a pattern, but the causal claim requires an experiment, not an observational data analysis.

20. Which of the following best summarizes the main message of this chapter?

(a) AI systems are too dangerous to use and should be banned

(b) AI systems are always accurate because they use mathematics

(c) Statistical thinking provides the tools to critically evaluate AI systems — understanding their strengths, limitations, and potential for bias — making you a better consumer and citizen

(d) Only computer scientists should evaluate AI systems

Answer

**(c) Statistical thinking provides the tools to critically evaluate AI systems — understanding their strengths, limitations, and potential for bias — making you a better consumer and citizen.** This chapter's central argument is that statistical thinking is not anti-AI — it's pro-understanding. The same concepts that help you evaluate a sample also help you evaluate training data. The same Bayes' theorem that clarifies medical test accuracy clarifies AI accuracy. The same correlation-vs.-causation thinking that protects you from bad research protects you from overhyped AI claims. Your statistical toolkit is your best defense against both AI hype and AI ignorance.