Exercises: Statistics and AI: Being a Critical Consumer of Data

These exercises progress from conceptual understanding of AI as applied statistics through critical evaluation of real-world claims, algorithmic bias analysis, and a full evaluation of an AI system using the STATS checklist. Estimated completion time: 3.5 hours.

Difficulty Guide: - (star) Foundational (5-10 min each) - (star)(star) Intermediate (10-20 min each) - (star)(star)(star) Challenging (20-40 min each) - (star)(star)(star)(star) Advanced/Research (40+ min each)


Part A: Conceptual Understanding (star)

A.1. In your own words, explain what it means to say "machine learning is applied statistics." Give two specific examples of ML techniques that correspond to statistical methods from earlier chapters.

A.2. True or false (explain each):

(a) An AI system trained on 10 million data points is always more reliable than one trained on 10,000 data points.

(b) Training data in machine learning serves the same conceptual role as a sample in statistical analysis.

(c) If an AI system is 95% accurate overall, it is 95% accurate for every subgroup in the population.

(d) A model that perfectly fits the training data will always make the best predictions on new data.

(e) Machine learning algorithms are objective because they are based on mathematics, not human judgment.

(f) The COMPAS algorithm was found to be unfair by every possible definition of fairness simultaneously.

A.3. Explain the difference between overfitting and underfitting. Which produces better performance on training data? Which produces better performance on new (test) data?

A.4. What is the bias-variance tradeoff? Why can't a model have both zero bias and zero variance?

A.5. Explain why a recommendation algorithm (like StreamVibe's) is fundamentally a prediction model. What statistical technique from earlier chapters is it most similar to?

A.6. Why do LLMs sometimes "hallucinate" (generate false information that sounds convincing)? Explain this using the concept of statistical patterns in language.


Part B: Training Data as a Sample (star)

B.1. A voice assistant (like Siri or Alexa) was primarily trained on recordings of native English speakers from the United States. Identify the type of sampling bias this represents and predict how it would affect the system's performance for:

(a) Non-native English speakers

(b) Speakers of British English

(c) Speakers from rural Appalachia

(d) Children

B.2. A credit scoring AI was trained on ten years of loan data from a major bank. During those ten years, the bank's loan officers had discretion over who received loans.

(a) Explain why this training data might contain selection bias.

(b) If the loan officers had (consciously or unconsciously) denied loans to qualified applicants from certain neighborhoods, how would this affect the AI system?

(c) Connect this to the concept of survivorship bias from Chapter 4.

B.3. A social media platform trains a content moderation AI on posts that human moderators flagged as violating community standards. The human moderators are all based in the United States.

(a) What types of content might the AI be poorly equipped to evaluate?

(b) How does this connect to the concept of response bias from Chapter 4?

(c) Propose a way to reduce this bias while keeping the system operational.

B.4. For each of the following AI training datasets, identify the most likely type of sampling bias:

(a) Self-driving car AI trained mostly on data from California highways

(b) Job interview AI trained on recordings of successful candidates

(c) Disease prediction AI trained on data from patients who visited a doctor

(d) Sentiment analysis AI trained on English-language product reviews


Part C: Overfitting and Model Validation (star)(star)

C.1. A data scientist builds a model to predict student GPA from 35 variables (study hours, sleep, social media use, commute time, etc.) using data from 120 students. The model achieves $R^2 = 0.94$ on the training data.

(a) Why should you be suspicious of this $R^2$ value?

(b) If the model were tested on a new sample of 120 students, would you expect $R^2$ to increase, decrease, or stay the same? Why?

(c) What is the ratio of variables to observations? What problem does this suggest?

(d) Suggest two strategies the data scientist could use to reduce overfitting.

C.2. A company claims their AI predicts employee turnover with 89% accuracy. You ask for details and learn:

  • The model was trained and tested on the same dataset
  • The dataset includes 500 employees, of whom 50 left the company
  • They used a model with 200 features

(a) Why is training and testing on the same data problematic?

(b) With only 50 positive cases (employees who left) and 200 features, what overfitting risk exists?

(c) If a model simply predicted "will not leave" for every employee, what would its accuracy be? How does this compare to the claimed 89%?

(d) What additional metrics would you want to see besides accuracy?

C.3. Explain why holding out a test set (data the model never sees during training) is analogous to the scientific principle of replication. How does this relate to the replication crisis discussed in Chapter 13?


Part D: Algorithmic Bias (star)(star)

D.1. The Amazon hiring algorithm (Section 26.5) penalized resumes containing the word "women's." Explain how this outcome can be understood as a confounding variable problem (Chapter 4). Draw a diagram showing the confounding relationship.

D.2. The healthcare algorithm from Section 26.5 used healthcare costs as a proxy for healthcare needs. Explain why this proxy variable introduced racial bias. What would have been a better outcome variable to predict?

D.3. A bank uses an AI system for loan approvals. The system doesn't use race as an input variable, but it does use zip code, education level, and employment history. Explain how the system could still produce racially biased outcomes even without explicitly using race.

D.4. Consider two definitions of fairness for the COMPAS algorithm:

  • Fairness Definition A: Among defendants who did NOT reoffend, the false positive rate should be equal across racial groups.
  • Fairness Definition B: Among defendants scored as high risk, the actual reoffending rate should be equal across racial groups.

(a) Which definition did ProPublica focus on?

(b) Which definition did Northpointe focus on?

(c) Why can't both definitions be satisfied simultaneously when base rates differ?

(d) If you were a judge using COMPAS scores, which definition of fairness would matter more to you? Why? If you were a defendant, would your answer change?

D.5. A university admissions algorithm is trained on data from admitted students who performed well (graduated with a GPA above 3.0). A critic argues this data is biased because it only includes students who were admitted in the first place.

(a) What type of bias does this represent?

(b) How might this bias affect which students the algorithm recommends admitting?

(c) What data would you need to properly evaluate whether the algorithm identifies students who would succeed?


Part E: Big Data Fallacies (star)(star)

E.1. Google Flu Trends failed despite having access to billions of search queries. Explain how this failure illustrates:

(a) The multiple comparisons problem (Chapter 17)

(b) Overfitting (Section 26.4)

(c) The instability of correlations over time (Chapter 22)

E.2. A marketing company claims: "We analyzed 50 million customer records and found that people who buy organic milk are 23% more likely to purchase premium streaming services."

(a) Apply the STATS checklist to this claim.

(b) With 50 million records, even tiny effects will be statistically significant. How does this relate to the concept of practical significance from Chapter 17?

(c) Suggest at least three confounding variables that could explain this correlation.

(d) Should a streaming service target organic milk buyers based on this finding? Why or why not?

E.3. "With enough data, the patterns speak for themselves." Critique this statement using at least three concepts from this chapter.


Part F: Prediction vs. Inference (star)(star)

F.1. For each scenario, identify whether the primary goal is prediction or inference, and explain why:

(a) A hospital wants to identify which patients are likely to be readmitted within 30 days.

(b) A researcher wants to understand whether a new drug reduces blood pressure and by how much.

(c) A streaming service wants to recommend shows that users will enjoy.

(d) An economist wants to estimate the effect of minimum wage increases on employment.

(e) An insurance company wants to set premium prices for individual customers.

F.2. A machine learning model predicts house prices with an $R^2$ of 0.93 using 150 features. A simple regression model with 5 features achieves $R^2 = 0.81.

(a) Which model is better for prediction? Why?

(b) Which model is better for understanding what drives house prices? Why?

(c) If you were advising a real estate investor on which neighborhoods to invest in, which model would be more useful? Explain.

(d) If you were advising a city planner on how to increase housing values, which model would be more useful? Explain.

F.3. An AI system finds that people who set their alarm for 5:00 AM have higher incomes. A wellness influencer uses this to claim that "waking up at 5 AM makes you richer."

(a) Is this a prediction finding or an inference finding?

(b) Identify at least three confounding variables.

(c) What type of study (Chapter 4) would be needed to establish a causal claim?


Part G: Misinformation and Data Literacy (star)(star)(star)

G.1. Apply the STATS checklist to the following claim from a news article:

"A groundbreaking study shows that AI can predict divorce with 79% accuracy by analyzing couples' text messages. Researchers at [University] analyzed 10,000 text message exchanges from 500 couples over two years."

Evaluate each component of the STATS checklist. What questions would you want answered before accepting this claim?

G.2. A pharmaceutical company runs an AI analysis of its drug trial data and reports: "Our AI analysis identified a subgroup of patients for whom the drug is 340% more effective than the overall population."

(a) Why should the phrase "identified a subgroup" raise red flags?

(b) How does this relate to the multiple comparisons problem and p-hacking from Chapter 17?

(c) What would be needed to validate this subgroup finding?

G.3. You see a social media post that reads: "EXPOSED: Government data proves vaccines cause [condition]. The rate of [condition] in vaccinated children is 12% higher than in unvaccinated children."

Using your statistical knowledge, identify at least four questions you would need answered before evaluating this claim. Reference specific chapters and concepts.

G.4. A company sends you a report claiming their AI-powered tutoring software improved student test scores by 15 points. The report includes a bar chart comparing average scores of students who used the software (mean = 78) versus students who didn't (mean = 63).

(a) What critical information is missing from this report?

(b) Could you draw a causal conclusion from this data? Why or why not?

(c) Design a study (using principles from Chapter 4) that could test whether the software actually causes improvement.


Part H: Integration and Critical Thinking (star)(star)(star)

H.1. Maya is evaluating the AI chest X-ray system from Section 26.3. The hospital administrator argues: "Maya, 92% accuracy is better than most of our radiologists. We should adopt this immediately."

Write Maya's response (200-300 words), addressing: - Why 92% accuracy is misleading in this context - The PPV calculation and what it means for clinical practice - The training data representativeness concern - A recommendation for how to proceed

H.2. James is presenting the COMPAS analysis to a group of judges. One judge says: "Professor Washington, the algorithm has the same predictive value for Black and white defendants at each score level. That means it's fair. End of story."

Write James's response (200-300 words), explaining: - Why equal predictive value doesn't mean equal treatment - What the false positive rate disparity means in human terms - Why this is a mathematical impossibility result, not a fixable bug - What the judges should consider when using these scores

H.3. Alex's manager at StreamVibe says: "Our recommendation algorithm works great — engagement is up 23% since we deployed it." Alex suspects there might be problems the engagement numbers don't capture. Write Alex's email to the manager (200 words) raising concerns about:

  • What "engagement" measures vs. what it misses
  • Potential filter bubbles and their societal implications
  • Whether the algorithm works equally well for all user groups
  • A suggestion for additional metrics to track

H.4. Sam reads a sports analytics article claiming: "Our AI model proves that three-point shooting is overrated — teams would win more games by shooting more mid-range jumpers." The article is based on an analysis of 15,000 games.

(a) Is "proves" appropriate language? What would be more accurate?

(b) Could this be a case of confusing prediction with inference?

(c) What confounding variables might explain this finding?

(d) What would an experiment look like that could test this causal claim? Is such an experiment feasible?


Part I: STATS Checklist Application (star)(star)(star)

I.1. Find a real news article, social media post, or product claim that involves AI or a data-driven system. Apply the full STATS checklist. Write a 300-word evaluation that:

  • Identifies what the claim is
  • Evaluates each element of the STATS checklist
  • Provides an overall assessment of the claim's credibility
  • Suggests what additional information would strengthen or weaken the claim

I.2. Choose one of the following AI applications and write a 400-word critical evaluation using concepts from this chapter:

(a) Facial recognition for law enforcement

(b) AI-generated art and copyright

(c) Algorithmic hiring/resume screening

(d) Predictive policing

(e) AI-powered medical diagnosis

Your evaluation should address: training data quality, potential biases, the prediction vs. inference distinction, who benefits and who is harmed, and what transparency measures should be required.


Part J: Synthesis (star)(star)(star)(star)

J.1. Write a "Statistical Consumer's Guide to AI" (500-700 words) aimed at a general audience — someone who reads the news but has never taken a statistics course. Your guide should explain, in plain language, the three most important things a non-statistician should understand about AI claims. Use at least three specific examples from this chapter.

J.2. You've been asked to advise a hospital considering the adoption of an AI diagnostic tool. Write a one-page memo (400-500 words) that:

  • Lists the five most important questions the hospital should ask the AI vendor
  • Explains why each question matters (using statistical concepts)
  • Recommends a phased adoption plan
  • Addresses what to do if the vendor can't answer these questions

J.3. Design a simple study to test whether an AI hiring tool is biased. Specify:

  • The null and alternative hypotheses
  • What data you would need to collect
  • What statistical test you would use (reference a specific chapter)
  • How you would define "bias" for the purposes of this study
  • What sample size considerations matter (Chapter 17)
  • What ethical considerations are involved (preview of Chapter 27)