Glossary: Introductory Statistics: Making Sense of Data in the Age of AI

#

"big data": datasets with millions or billions of observations — carries an almost magical aura. More data sounds like it should always be better. More data means smaller standard errors, tighter confidence intervals, and more power to detect effects. All of which is true — *when the data is representative*. → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
"Comparing Multiple Groups: ANOVA.": ## 20.17 Chapter Summary → Chapter 20: Analysis of Variance (ANOVA)
"given that...": $P(\text{illness} \mid \text{smoker})$ = "Among smokers, what's the probability of illness?" - $P(\text{smoker} \mid \text{illness})$ = "Among those with illness, what's the probability of being a smoker?" - $P(\text{pass} \mid \text{studied})$ = "Among students who studied, what's the probability o → Chapter 9: Conditional Probability and Bayes' Theorem
"Holding other variables constant": the conceptual leap of controlling for confounders statistically. This connects to the entire course: confounding (Ch. 4), correlation vs. causation (Ch. 22), and the logic of statistical control. Students who master this idea understand why observational data with regression can approximate (but no → Chapter 23: Multiple Regression — Instructor Notes
"P-value explained properly": fully delivered in Sections 13.5-13.6 - ✅ **"What 'statistically significant' means"** — fully delivered in Section 13.7 - 🔄 **"Daria's shooting analysis"** — partially resolved (formal test, $z = 1.22$, $p = 0.111$, fail to reject at $\alpha = 0.05$; full two-sample test framework in Chapter 16; po → Chapter 13: Hypothesis Testing: Making Decisions with Data
"The probability of WHAT given WHAT?": Identify which direction the conditional goes. - [ ] **"What's the base rate?"** — How common is this event *before* considering the evidence? - [ ] **"What's the alternative?"** — You must compare competing explanations, not evaluate one in isolation. - [ ] **"Was independence assumed?"** — If prob → Key Takeaways: Conditional Probability and Bayes' Theorem
"This treatment cured 90% of patients": but the 10% who died weren't tracked, or the patients who were too sick to participate were excluded from the study. - **"These schools have a 100% college acceptance rate"** — because students who wouldn't get in were counseled out before applying. → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
$\mu_0$ is NOT in the 95% CI: **Fail to reject $H_0: \mu = \mu_0$ at $\alpha = 0.05$** $\iff$ **$\mu_0$ IS in the 95% CI** → Chapter 13: Hypothesis Testing: Making Decisions with Data
$R^2$ measures model fit: the proportion of variability in $y$ explained by $x$. It's the regression analogue of $\eta^2$ from ANOVA (Chapter 20). Same concept, same formula, different context. → Chapter 22: Correlation and Simple Linear Regression
(a) Means:: Player A: $(20+22+19+21+20+23+18+21)/8 = 164/8 = 20.5$ - Player B: $(15+18+20+22+25+20+28+12)/8 = 160/8 = 20.0$ - Player C: $(30+10+25+5+35+15+20+20)/8 = 160/8 = 20.0$ → Quiz: Numerical Summaries — Center, Spread, and Shape
(b) Standard deviations:: Player A: $s \approx 1.60$ - Player B: $s \approx 5.07$ - Player C: $s \approx 10.00$ → Quiz: Numerical Summaries — Center, Spread, and Shape
(c): Version A: $z = (55 - 72)/8 = -17/8 = -2.13$ - Version B: $z = (55 - 72)/15 = -17/15 = -1.13$ → Quiz: Numerical Summaries — Center, Spread, and Shape
0.04%: not the 99.9999986% implied by the prosecution. → Case Study 2: Bayes in the Courtroom — The Prosecutor's Fallacy in Criminal Trials
0.875 (87.5%): **Sensitivity** = 35 / (35 + 10) = 35/45 = **0.778 (77.8%)** - **Specificity** = 140 / (140 + 15) = 140/155 = **0.903 (90.3%)** - **Precision** = 35 / (35 + 15) = 35/50 = **0.700 (70.0%)** - **F1 Score** = 2 $\times$ (0.700 $\times$ 0.778) / (0.700 + 0.778) = 2 $\times$ 0.545 / 1.478 = **0.737** → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
10 million Americans: about one in every twelve adults in the country. The mailing list was drawn from three sources: → Case Study: The 1936 Literary Digest Poll Disaster — Sampling Bias in Action
19 percentage points: one of the worst polling failures in recorded history. → Case Study: The 1936 Literary Digest Poll Disaster — Sampling Bias in Action
1974: National Research Act: Created the National Commission for the Protection of Human Subjects - **1979: Belmont Report** — Established three core principles: - **Respect for persons** — Individuals must be treated as autonomous agents; those with diminished autonomy deserve additional protection - **Beneficence** — Research → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
2,401 adults: more than double the estimate-based approach. This is why having a prior estimate of $p$ matters: it can dramatically reduce the required sample size. → Chapter 12: Confidence Intervals: Estimating with Uncertainty
2. Hypotheses:: H0: _______________________________ (null — status quo, no effect, no difference) → Appendix E: Templates and Worksheets
23% defect rate: 23,000 defective bearings out of 100,000. The factory would hemorrhage money on rework and rejected parts. More critically, some defective bearings might slip through inspection and end up in vehicles. → Case Study: The Empirical Rule in Quality Control — When Every Millimeter Matters
24% power: meaning there was only a 24% chance of detecting Daria's improvement, even if it's real. Sam would need roughly **237 shots** for 80% power, or about **318 shots** for 90% power. → Chapter 17: Power, Effect Sizes, and What "Significant" Really Means
3 standard deviations above the mean: very unusual. By the Empirical Rule, only about 0.15% of patients would have readings this high or higher. This is almost certainly clinically significant. → Quiz: Numerical Summaries — Center, Spread, and Shape
3. Point Estimate:: Symbol: __________ - Value: __________ → Appendix E: Templates and Worksheets
3. Significance Level:: alpha = _________ (set BEFORE looking at results) → Appendix E: Templates and Worksheets
3. Was there a control group?: Without a baseline comparison, you can't tell whether the treatment had an effect or whether the outcome would have happened anyway. → Chapter 4: Designing Studies: Sampling and Experiments
4. Answer these questions in text cells:: How many observational units are in your dataset? - How many variables are there? How many are numerical? How many are categorical? - Are there any missing values? Which columns have them? - Does pandas correctly identify the variable types, or are some categorical variables stored as numbers? - Wha → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
4. Was blinding used?: Single-blind? Double-blind? Unblinded? The less blinding, the more room for bias. → Chapter 4: Designing Studies: Sampling and Experiments
4.5: Upper fence = 24.5 + 1.5(8) = 24.5 + 12 = **36.5** → Chapter 6: Numerical Summaries: Center, Spread, and Shape
5. Can you think of a confounding variable?: If you can think of a plausible third variable that explains the association, be cautious. If the study controlled for known confounders, that strengthens the claim. → Chapter 4: Designing Studies: Sampling and Experiments
5. Test Information:: Test name: ___________________________________ - Test statistic formula: ___________________________________ - Observed test statistic value: ___________ - Degrees of freedom (if applicable): ___________ - p-value: ___________ → Appendix E: Templates and Worksheets
6. CI Formula:: Formula: point estimate +/- (critical value) x (standard error) - Standard error formula: ___________________________________ - Standard error value: ___________ - Critical value (z* or t*): ___________ - Degrees of freedom (if t): ___________ - Margin of error: ___________ → Appendix E: Templates and Worksheets
6. Decision:: [ ] Reject H0 (p-value <= alpha) - [ ] Fail to reject H0 (p-value > alpha) → Appendix E: Templates and Worksheets
6. How was the sample selected?: Random sample → results may generalize to the population - Convenience sample → results may only apply to the specific group studied → Chapter 4: Designing Studies: Sampling and Experiments
7. How large was the sample?: Larger samples give more precise estimates (we'll formalize this in Chapter 11) - But sample *quality* matters more than sample *size* → Chapter 4: Designing Studies: Sampling and Experiments
8. Effect Size and Practical Significance:: Effect size measure: _______________ Value: ___________ - Is the effect practically meaningful? _________________________________________ - 95% CI for the parameter: ( ___________ , ___________ ) → Appendix E: Templates and Worksheets
8. Has the finding been replicated?: A single study is a single data point. Replication by independent researchers strengthens confidence enormously. → Chapter 4: Designing Studies: Sampling and Experiments
`zip_code` is stored as `int64`: pandas thinks it's a number, but it's actually a nominal categorical variable. You can't calculate the "average zip code" (as we saw in Chapter 2's case study on electronic health records). Maya makes a mental note not to include it in any numerical summaries. → Case Study: Exploring Public Health Data with pandas — Dr. Chen's Flu Surveillance
| | 4 | 4 - 6 =: 2** | | 6 | 6 - 6 = **0** | | 8 | 8 - 6 = **+2** | | 10 | 10 - 6 = **+4** | | **Sum** | **0** | → Chapter 6: Numerical Summaries: Center, Spread, and Shape
| .0013 | .0013 | .0012 | .0011 | .0010 | |: 2.5** | .0062 | .0059 | .0055 | .0052 | .0049 | | **-2.0** | .0228 | .0217 | .0207 | .0197 | .0188 | | **-1.5** | .0668 | .0643 | .0618 | .0594 | .0571 | | **-1.0** | .1587 | .1539 | .1492 | .1446 | .1401 | | **-0.5** | .3085 | .3015 | .2946 | .2877 | .2810 | | **0.0** | .5000 | .5080 | .5160 | .523 → Key Takeaways: Probability Distributions and the Normal Curve

A

About the 71st percentile: above average, but not as far above as it might seem from the raw score. → Case Study 1: The Bell Curve in Standardized Testing — How SAT and ACT Scores Are Designed to Be Normal
Accuracy: the overall correct rate: → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
Against abandoning p-values:: The value of a standardized decision framework - P-values have clear meaning when used correctly - The problem is misuse, not the tool itself - No replacement would be immune to misuse → Exercises: Hypothesis Testing: Making Decisions with Data
Analysis of Variance (ANOVA): a method for comparing means across three or more groups. Key resources to preview: → Further Reading: Chi-Square Tests: Categorical Data Analysis
ANOVA: extending the two-group comparison from Chapter 16 to three or more groups. The bootstrap and permutation ideas from this chapter can also be applied to multi-group comparisons, providing a useful robustness check on ANOVA results. → Further Reading: The Bootstrap and Simulation-Based Inference
Appendix Types Selected:: Statistical tables (Category A + stats focus) - Python code reference (CODE_LANGUAGE = Python) - Environment setup guide (CODE_LANGUAGE ≠ none) - Data sources guide (Category A + B with data analysis) - Templates & worksheets (Category C) - FAQ / Troubleshooting (Category A + C) - Key studies summar → Introductory Statistics: Making Sense of Data in the Age of AI
Apply the addition and multiplication rules: these are the computational workhorses of probability that students will use through Chapter 10 and beyond. 2. **Distinguish between independent and mutually exclusive events** — students frequently confuse these, and the confusion persists if not addressed directly. 3. **Construct and interpret two → Chapter 8: Probability Foundations — Instructor Notes
Arguments against:: The model systematically over-predicts risk for Black defendants - No individual should be punished because of *statistical patterns* in their demographic group - The training data reflects historical policing patterns, which themselves reflect structural racism - A tool that is 85% accurate overall → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
Arguments for using the algorithm:: It's more consistent than individual judges, who also have biases - It provides a structured framework for decisions that were previously subjective - The overall accuracy is high - Not using data means falling back on intuition, which has its own biases → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
Assess model fit:: $R^2 = r^2$ — proportion of variability explained - Residual plot — check for patterns → Key Takeaways: Correlation and Simple Linear Regression
Averaging zip codes, ID numbers, or codes: just because it's made of digits doesn't make it numerical 2. **Treating ordinal as continuous without acknowledging the simplification** — the average of 1-5 ratings is common but technically approximate 3. **Ignoring the data dictionary** — coded values (77 = "Don't know") can corrupt calculations → Key Takeaways: Types of Data and the Language of Statistics
Avoid common misleading graph techniques: students should be both producers and critical consumers of data visualizations. → Chapter 25: Communicating with Data — Instructor Notes

B

base rate: the prevalence of the disease. Only 1% of people have it. So out of every 1,000 people tested: → Chapter 9: Conditional Probability and Bayes' Theorem
base rates matter: that even excellent tests produce false alarms when the underlying condition is rare. And you've seen that the base rate fallacy, the tendency to ignore prior probabilities, is one of the most common reasoning errors humans make. → Chapter 9: Conditional Probability and Bayes' Theorem
Bayesian updating: the idea that probability is not fixed but changes with new evidence. This is a genuine paradigm shift. Students who internalize Bayesian thinking start seeing evidence differently — they ask "How should this new information update my belief?" rather than "Does this prove or disprove my belief?" → Chapter 9: Conditional Probability and Bayes' Theorem — Instructor Notes
between-group variability: the variability explained by group membership. → Chapter 20: Analysis of Variance (ANOVA)
bimodal: two modes, two peaks. The data doesn't have a single center; it has two. Bimodal distributions often mean your data contains two distinct groups behaving differently (morning visitors and afternoon visitors). → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
binary outcomes: the response variable has exactly two possible values. And here's the problem: the regression models you learned in Chapters 22 and 23 don't work for binary outcomes. If you try to force a straight line through yes/no data, you'll get predictions that are impossible — probabilities below 0 or above → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
bounded below at zero: The **log-normal distribution** often fits right-skewed positive data better than the normal - **Power-law distributions** describe phenomena where extreme values are far more common than the normal model predicts - Assuming normality when the data isn't normal leads to **systematic errors** in pred → Case Study 2: When Normality Fails — Income, Wealth, and Power-Law Distributions
box plot: a graph that condenses an entire distribution into five numbers and reveals outliers at a glance. → Chapter 6: Numerical Summaries: Center, Spread, and Shape

C

Calculate and interpret effect sizes (Cohen's d): students should leave this chapter always asking "how big is the effect?" 3. **Conduct a basic power analysis** — understanding power demystifies sample size decisions in research design. → Chapter 17: Power and Effect Sizes — Instructor Notes
Calculate and interpret p-values: the most misunderstood concept in introductory statistics. Spend more time here than on any other single concept. 2. **State null and alternative hypotheses** — the framework that structures every subsequent inference chapter. 3. **Distinguish between Type I and Type II errors** — understanding the → Chapter 13: Hypothesis Testing — Instructor Notes
Calculate and interpret standard deviation: this is the measure students will use most frequently for the rest of the course (standard error, test statistics, confidence intervals all depend on it). 2. **Use the five-number summary and box plots** — box plots appear in nearly every subsequent chapter for visual comparison. 3. **Apply the Empi → Chapter 6: Numerical Summaries — Instructor Notes
Calculate the standard error of the mean: this formula (sigma / sqrt(n)) is used in every subsequent inference chapter. → Chapter 11: Sampling Distributions and the Central Limit Theorem — Instructor Notes
Calculus: not even a little bit. We never take derivatives or integrals. - ❌ **Prior statistics courses** — this book starts from zero. - ❌ **Programming experience** — we teach you Python from scratch in Chapter 3. - ❌ **A scientific calculator** — Python and Excel will handle all computation. - ❌ **A "math → Prerequisites: Are You Ready?
categorical variables: variables where the values are categories or groups, not numbers on a scale. → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
Central Limit Theorem: arguably the single most important theorem in all of statistics. It's the bridge from probability to inference, and it'll explain why everything we've learned about the normal distribution matters even more than you currently think. → Chapter 10: Probability Distributions and the Normal Curve
Ch.10 section 10.9: `stats.probplot()`, Ch.10 section 10.9 QRPs, see *questionable research practices* quartile, **Ch.6 section 6.4** questionable research practices (QRPs), **Ch.27 section 27.5** → Index
Ch.11 section 11.2: of the mean, **Ch.11 section 11.2** - of the proportion, **Ch.11 section 11.5** sampling variability, **Ch.11 section 11.2** SAT/ACT score distributions, Ch.10 case-study-01 scatterplot, Ch.5 section 5.9, **Ch.22 section 22.2** scipy.stats, see individual function names seaborn, **Ch.5 section 5.2** → Index
Ch.13 section 13.2: five-step procedure, Ch.13 section 13.4 - logic of, Ch.13 section 13.2 → Index
Ch.13 section 13.5: correct interpretation, Ch.13 section 13.5, Appendix F Q2 - five misconceptions, Ch.13 section 13.6 paired t-test, **Ch.16 section 16.4** - `stats.ttest_rel()`, Ch.16 section 16.9 `pandas`, **Ch.3 section 3.4**, Appendix B section B.2 parameter, **Ch.2 section 2.4** `pd.crosstab()`, Ch.8 section 8.7 → Index
Ch.15 section 15.2: `stats.ttest_1samp()`, Ch.13 section 13.13, Ch.15 one-sample z-test for proportions, **Ch.14 section 14.3** - `proportions_ztest()`, Ch.14 section 14.8 one-tailed test, **Ch.13 section 13.8** one-way ANOVA, see *ANOVA* O'Neil, Cathy, Ch.26 further-reading open data, **Ch.27 section 27.5** Open Scien → Index
Ch.17 section 17.4: Cramer's V, **Ch.19 section 19.6** - eta-squared, **Ch.20 section 20.10** - R-squared, **Ch.22 section 22.8** Efron, Bradley, **Ch.18 section 18.2** Empirical Rule (68-95-99.7), **Ch.6 section 6.7**, Ch.10 section 10.2 error bars, Ch.25 section 25.9 eta-squared, **Ch.20 section 20.10** ethical revie → Index
Ch.19 section 19.2: `chi2_contingency()`, Ch.19 section 19.5 - `chisquare()`, Ch.19 section 19.3 - conditions, Ch.19 section 19.4 - Cramer's V, **Ch.19 section 19.6** - goodness-of-fit, **Ch.19 section 19.3** - standardized residuals, **Ch.19 section 19.8** - test of independence, **Ch.19 section 19.5** Chouldechova im → Index
Ch.20 section 20.8: `pairwise_tukeyhsd()`, Ch.20 section 20.8 Tuskegee syphilis study, Ch.4 section 4.7, **Ch.27 section 27.6** Tversky, Amos, Ch.8, Ch.9 two-proportion z-test, **Ch.16 section 16.6** two-sample t-test, **Ch.16 section 16.3** two-tailed test, **Ch.13 section 13.8** Type I error, **Ch.13 section 13.9**, → Index
Ch.20 section 20.9: `stats.levene()`, Ch.20 section 20.9 library (Python), **Ch.3 section 3.4** likelihood ratio, **Ch.9 section 9.13** Likert scale, Ch.2 section 2.3 LINE conditions (regression), Ch.22 section 22.16 linear relationship, **Ch.22 section 22.3** `linregress()`, Ch.22 section 22.11 listwise deletion, Ch.7 → Index
Ch.21 section 21.3: decision guide, Appendix F Q13 nonresponse bias, **Ch.4 section 4.3** normal distribution, **Ch.10 section 10.5** - `scipy.stats.norm`, Ch.10 section 10.7 - standard normal, **Ch.10 section 10.6** normality assumption, **Ch.15 section 15.6** `np.random.choice()`, Ch.11 section 11.3, Ch.18 section 18 → Index
Ch.21 section 21.6: `stats.mannwhitneyu()`, Ch.21 section 21.6 margin of error, **Ch.12 section 12.2** - for proportions, **Ch.14 section 14.7** marginal probability, **Ch.8 section 8.7** matched pairs, **Ch.16 section 16.4** matplotlib, **Ch.5 section 5.2**, Appendix B section B.3 maximum likelihood estimation, **Ch.2 → Index
Ch.21 section 21.8: `stats.kruskal()`, Ch.21 section 21.8 → Index
Ch.22 section 22.6: diagnostics, Ch.22 section 22.16, Ch.23 resistant measure, **Ch.6 section 6.1** response bias, **Ch.4 section 4.3** rights-based ethics, **Ch.27 section 27.8** Rivera, Alex [anchor example], Ch.1 section 1.5, Ch.4 case-study-02, Ch.8 section 8.5, Ch.12 section 12.2, Ch.13 section 13.14, Ch.16 sectio → Index
Ch.26 section 26.5: healthcare algorithm (Obermeyer), Ch.1, Ch.4, **Ch.26 section 26.5** - proxy variables, Ch.16 case-study-02, Ch.24 section 24.12, **Ch.26 section 26.5** alternative hypothesis (Ha), **Ch.13 section 13.3**, Ch.14 section 14.3, Ch.15 section 15.2 Anaconda, Ch.3 section 3.11, Appendix C analysis of var → Index
Ch.4 section 4.3: algorithmic, see *algorithmic bias* - nonresponse, **Ch.4 section 4.3** - response, **Ch.4 section 4.3** - selection, **Ch.4 section 4.3** - survivorship, **Ch.4 section 4.3** bias-variance tradeoff, **Ch.26 section 26.4** Bickel, Hammel, and O'Connell (Berkeley admissions), **Ch.27 section 27.2** b → Index
Ch.6 section 6.5: `.var()`, Ch.6 section 6.5.1 variance inflation factor, see *VIF* Verhulst, Pierre-Francois, Ch.24 section 24.2 VIF (Variance Inflation Factor), **Ch.23 section 23.8** vos Savant, Marilyn, Ch.8 case-study-01 → Index
Check assumptions:: Independence (study design) - Normality within each group (histograms, QQ-plots, Shapiro-Wilk) - Equal variances (Levene's test, SD ratio $< 2$) → Key Takeaways: Analysis of Variance (ANOVA)
Check fairness:: Evaluate error rates separately for each relevant group - Equal accuracy does not guarantee equal error rates → Key Takeaways: Logistic Regression
Check overall fit:: $R^2$ and adjusted $R^2$ - F-test (is at least one predictor useful?) → Key Takeaways: Multiple Regression
Check the assumptions:: Are observations independent? - Create histograms or QQ-plots for each group to check normality - Run Levene's test for equal variances → Chapter 20: Analysis of Variance (ANOVA)
Checklist:: Are we still conversational? Using "you" and "I"? Contractions? ✓ - Are we still leading with stories and concrete examples before abstractions? ✓ (Blood pressure, Daria's three-pointers, StreamVibe watch time) - Are we acknowledging math anxiety without condescending? ✓ ("Take a breath — I'm showin → Chapter 10: Probability Distributions and the Normal Curve
chi-square goodness-of-fit test: a test of whether a single categorical variable follows a specified distribution. → Chapter 19: Chi-Square Tests: Categorical Data Analysis
chi-square test: a formula-based method for analyzing categorical data. Where this chapter used simulation to test group differences, the chi-square test uses a clever comparison of observed vs. expected frequencies. Key resources to preview: → Further Reading: The Bootstrap and Simulation-Based Inference
children with asthma: they didn't choose where to live, they can't move on their own, and they're bearing the physical consequences of environmental contamination. A care ethics approach would center these children and ask: what do they need? → Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good
Choose the correct test based on study design: the paired vs. independent distinction is the primary decision point, and students frequently get it wrong. 2. **Conduct and interpret a two-sample t-test** — the most commonly used test in published research. 3. **Construct confidence intervals for the difference between two groups** — the CI for t → Chapter 16: Comparing Two Groups — Instructor Notes
Choose the threshold:: Consider the relative costs of false positives vs. false negatives - The threshold is a *values* decision, not just a statistical one → Key Takeaways: Logistic Regression
Classify variables as categorical or numerical: this determines which statistical methods are appropriate for the rest of the course. 2. **Distinguish between populations and samples** — a foundational distinction for all of inference. 3. **Read and interpret data tables and data dictionaries** — practical skill students need immediately for the → Chapter 2: Types of Data and the Language of Statistics — Instructor Notes
Cluster sampling: the sections are the clusters, and everyone within selected clusters is surveyed. > 2. **Convenience sampling** — the researcher is surveying whoever happens to be at that location at that time. The sample is not random and likely overrepresents frequent mall shoppers. > 3. A stratified sample guara → Chapter 4: Designing Studies: Sampling and Experiments
clustering: dividing observations into groups based on similarity. → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
Code cells: where you write Python code (they have a play button on the left) 2. **Text cells** — where you write notes and explanations using Markdown formatting → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
collaborative filtering: essentially, it finds users who are similar to you (who watched and liked similar shows) and recommends what *they* watched next. Statistically, this is nearest-neighbor regression: predicting your rating for an unwatched show based on the ratings of your "neighbors." > > **Step 4: Rank and serve.** → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
Color key:: 🔵 Light blue: Foundation — start here - 🟠 Orange: Critical bridge chapters — don't skip these - 🟢 Green: Core methods - 🔴 Pink: Capstone and reflection → How to Use This Book
Common actions to log:: Removed duplicate rows - Dropped rows with missing values in column(s) ___ - Imputed missing values in ___ using ___ method - Recoded variable ___ (original values -> new values) - Created new variable ___ from ___ - Removed outliers in ___ (criteria: ___) - Fixed inconsistent entries in ___ (e.g., → Appendix E: Templates and Worksheets
Common wrong interpretations:: "There's a 95% chance mu is between 42 and 58." (Wrong — mu is fixed, not random.) - "95% of the data falls in this interval." (Wrong — that describes the data, not the parameter.) - "If we sampled again, there's a 95% chance the new sample mean would be in this interval." (Wrong — this confuses the → Appendix F: FAQ and Troubleshooting
Communication: [ ] My notebook tells a coherent story from start to finish - [ ] My executive summary/policy brief is written for a non-technical audience - [ ] I've translated statistical results into plain language - [ ] My project is well-organized with clear section headers - [ ] I've proofread for clarity, gr → Capstone Rubric
Compare the simple and multiple models:: How did the coefficient for your first predictor change? - Did it shrink, stay the same, or grow? What does this tell you about confounding? - Compare $R^2$ and adjusted $R^2$. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
comparing two groups: the two-sample t-test, the paired t-test, and the two-proportion z-test. You'll finally be able to answer Alex's big question: "Did the new recommendation algorithm actually increase watch time compared to the old one?" And Professor Washington's: "Is the algorithm's false positive rate different fo → Chapter 15: Inference for Means
Conduct a one-sample t-test for a population mean: this is the workhorse procedure for inference about means. 2. **Understand when to use z vs. t** — the t-distribution accounts for the additional uncertainty of estimating sigma. 3. **Verify conditions for t-procedures** — randomness, approximate normality (or large sample), and independence. → Chapter 15: Inference for Means — Instructor Notes
Conduct a one-sample z-test for a proportion: this applies the hypothesis testing framework from Ch. 13 to a specific and common scenario. 2. **Verify conditions for inference about proportions** — the success-failure condition (np >= 10 and n(1-p) >= 10) is essential and often neglected. 3. **Interpret results in real-world context** — student → Chapter 14: Inference for Proportions — Instructor Notes
Conduct and interpret a one-way ANOVA: the procedural skill, including reading an ANOVA table. 3. **Perform post-hoc pairwise comparisons** — finding a significant F-test is just the beginning; post-hoc tests tell you which groups differ. → Chapter 20: ANOVA — Instructor Notes
Conduct the test and build a CI.: Compute the z-test statistic and p-value - Construct both a Wald CI and a Wilson CI - Compare the two CIs — do they differ substantially? → Chapter 14: Inference for Proportions
Conduct the test.: Compute the test statistic - Calculate the p-value - State your decision at $\alpha = 0.05$ → Chapter 13: Hypothesis Testing: Making Decisions with Data
confidence interval: a range of plausible values for the true population mean. → Chapter 12: Confidence Intervals: Estimating with Uncertainty
confidence intervals: the first major inference tool. You'll learn to translate the standard error into a range of plausible values for the population parameter: → Further Reading: Sampling Distributions and the Central Limit Theorem
Confounding: understanding why correlation doesn't prove causation requires grasping confounding. This is a genuine threshold concept: once students see confounders everywhere, they can't unsee them. Some students cross this threshold quickly; others need the entire semester. → Chapter 4: Designing Studies — Instructor Notes
Content Blocks Activated:: Category A: Mathematical formulation, code implementation, worked examples, debugging walkthroughs, comparison tables - Category B: Research study breakdowns, debate/discussion frameworks, ethical analysis - Category C: Action checklists, self-assessment tools, scenario walkthroughs → Introductory Statistics: Making Sense of Data in the Age of AI
contingency table: a grid showing the frequency of each combination of categories. You calculated joint probabilities (cell count / grand total), marginal probabilities (row or column total / grand total), and conditional probabilities (cell count / row or column total). > > Back then, contingency tables were tools fo → Chapter 19: Chi-Square Tests: Categorical Data Analysis
Contingency tables: the topic of this section — are built from two categorical variables. They show how many observations fall into each combination of categories. Remember: categorical variables classify observations into groups. That classification is exactly what makes probability calculations from contingency table → Chapter 8: Probability: The Foundation of Inference
control charts: time-series plots of measurements with lines drawn at 1, 2, and 3 standard deviations from the mean. → Case Study: The Empirical Rule in Quality Control — When Every Millimeter Matters
Correct analysis (paired t-test):: $\bar{d} = 3.17$, $SE_d = 0.748$, $t = 4.24$, $p = 0.0007$ - Conclusion: Strong evidence of improvement. ✓ → Chapter 16: Comparing Two Groups
cross-sectional: a snapshot of many groups at one point in time. Approach (b) is **longitudinal** — following the same individuals over time. The longitudinal approach is better for studying interventions because you can compare each family's asthma outcomes *before and after* receiving the air purifier, using each → Chapter 4: Designing Studies: Sampling and Experiments

D

d): Descriptive: "What is the average drink price of the four coffee shops in this dataset?" (Just summarizing the data you have.) - Inferential: "Based on this sample, are independent coffee shops in this city more expensive than chain shops, on average?" (Generalizing from 4 shops to all shops in the → Quiz: Types of Data and the Language of Statistics
Data Handling: [ ] I've inspected the data and reported its basic properties - [ ] I've created a data dictionary - [ ] I've addressed missing values with documented reasoning - [ ] I've handled outliers with documented reasoning - [ ] I've created any needed derived variables → Capstone Rubric
Data:: **New Algorithm (A):** 15, 22, 45, 30, 18, 120, 25, 35 - **Old Algorithm (B):** 12, 8, 20, 14, 28, 10, 16, 19 → Chapter 21: Nonparametric Methods: When Assumptions Fail
DataFrame: think of it as a supercharged spreadsheet that lives inside your Python code. A DataFrame has rows and columns, just like Excel, but you can manipulate it with code instead of mouse clicks. → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Decision: Fail to reject $H_0$: **Conclusion:** At the 5% significance level, there is not sufficient evidence to conclude that Daria's three-point shooting percentage has improved from 31%. → Chapter 13: Hypothesis Testing: Making Decisions with Data
Decisions:: Age -99 and 999: Set to NaN (clearly placeholder values, not real ages) - Negative watch times: Set to NaN (impossible values) - Extreme watch times (>24 hours): Flagged but NOT removed. Rationale: the data might represent cumulative watch time over the study period, not a single day. Will investiga → Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template
Decomposing variability: the insight that total variation can be split into "explained" (between-group) and "unexplained" (within-group) components. This idea extends naturally to regression (Ch. 22-23), where R-squared is the proportion of variation explained by the model. → Chapter 20: ANOVA — Instructor Notes
Degrees of freedom for chi-square tests:: Goodness-of-fit: df = k - 1 (where k = number of categories) - Test of independence: df = (r - 1)(c - 1) (where r = rows, c = columns) → Appendix A: Statistical Tables
department selectivity: Women applied more heavily to selective departments (low admission rates for everyone) - Men applied more heavily to less selective departments (high admission rates for everyone) - Aggregating across departments mixed the effect of *who applied where* with the effect of *how each department treated → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
Describe the properties of the normal distribution: students need to internalize symmetry, the 68-95-99.7 rule, and the role of mean and standard deviation as parameters. 3. **Assess normality using QQ-plots** — a practical diagnostic skill used whenever t-tests, ANOVA, or regression require normality assumptions. → Chapter 10: Probability Distributions and the Normal Curve — Instructor Notes
descriptive statistics: we're summarizing the data we have. But generalizing to "all American adults" would be **inferential statistics** — we're reaching beyond our sample to make a claim about the population. The quality of that inference depends on how well our 500 respondents represent the U.S. adult population (which → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Design principles:: One main idea per slide - Minimal text — let the visuals do the work - Every chart must have a clear takeaway stated in the slide title (e.g., "Customers who receive same-day shipping return 40% fewer items" — not "Return rate by shipping method") → Capstone Project 2: Business Analytics Report
Detecting change points: moments when the underlying process shifts → Chapter 28: Your Statistical Journey Continues
Diagnose residuals:: Residuals vs. predicted → linearity, equal variance - QQ-plot → normality - Independence → study design → Key Takeaways: Multiple Regression
Difficulty Guide:: ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each) → Exercises: Why Statistics Matters (and Why You Might Actually Enjoy This)
difficulty level: the percentage of students who answer it correctly. The final test is assembled with a carefully calibrated *mix* of difficulties: → Case Study 1: The Bell Curve in Standardized Testing — How SAT and ACT Scores Are Designed to Be Normal
Distinguish between P(A|B) and P(B|A): the "prosecutor's fallacy" is one of the most dangerous statistical errors, and it appears in medicine, law, and everyday reasoning. 2. **Apply Bayes' theorem to update probabilities** — this is the mathematical foundation of learning from evidence. 3. **Construct tree diagrams** — tree diagrams mak → Chapter 9: Conditional Probability and Bayes' Theorem — Instructor Notes
Distribution thinking: seeing data as a distribution rather than individual numbers. This chapter is where the shift begins. A student who thinks "the data has a right-skewed distribution with a center around 45 and a spread of about 20" is thinking statistically in a way that a student who only sees individual data point → Chapter 5: Graphs and Descriptive Statistics — Instructor Notes
distribution-free methods: they work regardless of the underlying distribution's shape. → Chapter 21: Nonparametric Methods: When Assumptions Fail
distributions: entire shapes with centers, spreads, peaks, tails, and outliers. Not "the average is 44%" but "the distribution of shooting percentages is symmetric and unimodal, centered around 44%, with a spread from about 15% to 70%." Not "the average age is 38" but "the age distribution is bimodal, with peaks i → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
Domain 1: Education: **Civil Rights Data Collection (CRDC)**: U.S. Department of Education data on school discipline, access to advanced courses, teacher quality, and resource allocation — broken down by race, gender, and disability status. - **National Center for Education Statistics (NCES)**: Graduation rates, test sc → Capstone Project 3: Social Justice Data Audit
Domain 2: Criminal Justice: **Stanford Open Policing Project**: Traffic stop data from multiple states, including driver demographics and stop outcomes. - **The Sentencing Project / U.S. Sentencing Commission**: Federal sentencing data with demographic variables. - **Local police department open data**: Many cities publish arr → Capstone Project 3: Social Justice Data Audit
Domain 3: Employment and Hiring: **Bureau of Labor Statistics / Current Population Survey**: Employment rates, wages, and occupational data by demographics. - **EEOC charge data**: Discrimination complaint data by type and basis. - **Glassdoor or PayScale salary data** (publicly available subsets). → Capstone Project 3: Social Justice Data Audit
Domain 4: Housing and Lending: **Home Mortgage Disclosure Act (HMDA) data**: Mortgage application outcomes by race, income, and geography. - **HUD Fair Housing complaints**: Discrimination complaint data. - **Zillow / Redfin open data**: Housing prices and neighborhood demographics. → Capstone Project 3: Social Justice Data Audit
Dr. Maya Chen: public health epidemiologist tracking disease outbreak patterns across communities (CDC/WHO-style data) 2. **Alex Rivera** — marketing data analyst at StreamVibe testing whether a new recommendation algorithm increases watch time (A/B testing, tech industry) 3. **Professor James Washington** — crimi → Introductory Statistics: Making Sense of Data in the Age of AI

E

Each user is randomly assigned: the hash function is effectively random with respect to user characteristics - **Each user stays in the same group** — the hash of a given user ID always produces the same number, so users don't bounce between layouts between sessions - **The assignment is invisible to users** — they don't know they → Case Study: A/B Testing in Tech — Designing Experiments at Scale
Emergency Use Authorization (EUA): a special pathway that allows use of unapproved treatments during a public health emergency. The EUA has a lower evidence threshold than full approval. → Case Study 2: Clinical Trials and FDA Approval — When Type I and Type II Errors Are Life-or-Death
Equal calibration: a score of 7 means the same probability of reoffending for all groups 2. **Equal false positive rates** — the same proportion of non-reoffenders are wrongly classified as high risk across all groups 3. **Equal false negative rates** — the same proportion of reoffenders are wrongly classified as low → Case Study 2: James's Algorithmic Reckoning — Ethics of Data-Driven Criminal Justice
Ethics: [ ] I've addressed data provenance and consent - [ ] I've considered who might be harmed - [ ] I've discussed representation and missing voices - [ ] I've considered potential misuse of findings - [ ] My ethical discussion is specific to my project, not generic → Capstone Rubric
Evaluate the model:: Confusion matrix at the chosen threshold - Accuracy, sensitivity, specificity, precision, F1 score - ROC curve and AUC → Key Takeaways: Logistic Regression
Every variable is either categorical or numerical: and getting this right determines which tools and analyses are appropriate. 2. **Numbers aren't always numerical variables.** Zip codes, ID numbers, and phone numbers are categorical despite being made of digits. 3. **Parameters describe populations; statistics describe samples.** Most real-world an → Chapter 2: Types of Data and the Language of Statistics
Example business questions (choose your own):: Which customer segments are most profitable, and what distinguishes high-value customers from low-value ones? - Does the new marketing campaign lead to significantly higher conversion rates compared to the control group? Is the difference large enough to justify the cost? - What factors best predict → Capstone Project 2: Business Analytics Report
Examples of independent samples:: Patients randomly assigned to a drug group vs. a placebo group - Students at School A vs. students at School B - Users who see Algorithm A vs. users who see Algorithm B (Alex's A/B test!) - Crime outcomes under algorithm-based bail vs. judge-based bail (James's study!) → Chapter 16: Comparing Two Groups
Examples:: Is the cure rate higher for Drug A than Drug B? (Maya's world) - Is the click-through rate different for two website designs? (Alex's world) - Is the recidivism rate different for algorithm-recommended vs. judge-recommended bail decisions? (James's world!) → Chapter 16: Comparing Two Groups
exercises.md: practice problems at four difficulty levels - **quiz.md** — self-assessment with answers and explanations - **case-study-01.md** — extended real-world application - **case-study-02.md** — additional deep-dive case study - **key-takeaways.md** — one-page summary card - **further-reading.md** — annota → How to Use This Book
expected frequencies: the counts you'd predict if the null hypothesis (equal distribution) were true. → Chapter 19: Chi-Square Tests: Categorical Data Analysis
Explain the logic of bootstrap resampling: the "big idea" is that you can learn about the population by cleverly reusing your sample. 2. **Construct bootstrap confidence intervals** — the practical skill that students can apply immediately. 3. **Compare simulation-based and formula-based approaches** — understanding when and why the two appr → Chapter 18: Bootstrap and Simulation-Based Inference — Instructor Notes
Explain when nonparametric methods are needed: the decision to use a nonparametric test is based on assumption violations, and students need to diagnose these. 2. **Conduct a Wilcoxon rank-sum test** — the most common nonparametric alternative to the two-sample t-test. 3. **Compare parametric and nonparametric approaches** — students should unde → Chapter 21: Nonparametric Methods — Instructor Notes
Exploratory analysis:: Create scatterplots of $y$ vs. each numerical predictor. - Compute a correlation matrix among numerical predictors. - Create box plots of $y$ by categories of your categorical predictor. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Explore the data:: Scatterplots of $y$ vs. each $x$ - Correlation matrix among predictors (watch for multicollinearity) - Descriptive statistics → Key Takeaways: Multiple Regression

F

F1 Score: the harmonic mean of precision and recall: → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
false negative rates: the error rates within each racial group. Northpointe is reporting **predictive values** — the accuracy rates within each risk category. And here's the thing that breaks people's brains: **it's mathematically impossible for both metrics to be equal across racial groups when the base rates differ.**" → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
False Positive Rate: 209 / (891 + 209) = 19.0%: PPV: 349 / (349 + 209) = 62.5% → Case Study 1: James and the COMPAS Algorithm — A Full Statistical Audit
False Positive Rate: 532 / (990 + 532) = 35.0%: PPV: 805 / (805 + 532) = 60.2% → Case Study 1: James and the COMPAS Algorithm — A Full Statistical Audit
false positives: defendants the algorithm wrongly classified as dangerous. → Case Study 2: James's Algorithmic Bail Study — Fairness Auditing with Two-Group Comparisons
FINAL EXAM: Cumulative, 120 minutes. → 10-Week Quarter / Accelerated Syllabus
Financial cost:: Each confirmatory workup costs approximately $800-$1,200 in specialist visits and lab tests. - Total cost of false-positive follow-ups: approximately $2,100 \times $1,000 = $2.1 million. - Cost per true case detected: about $265,000 (total program cost divided by 8 cases found). → Case Study 1: Medical Screening — When a Positive Test Doesn't Mean What You Think
Fit the regression line:: Slope: $b_1 = r \cdot \frac{s_y}{s_x}$ - Intercept: $b_0 = \bar{y} - b_1\bar{x}$ - Equation: $\hat{y} = b_0 + b_1 x$ → Key Takeaways: Correlation and Simple Linear Regression
Fix:: Save your work to Google Drive frequently. - Re-run your notebook from the top when you reconnect (Colab doesn't preserve variables between sessions). → Appendix C: Environment Setup Guide
Flag any ambiguous variables: ones where the classification isn't clear-cut. Write a sentence explaining why you chose the classification you did. > 6. **Identify the data structure:** Is your dataset cross-sectional or longitudinal? How do you know? > > **Example:** If you chose the World Happiness Report: > - Observational uni → Chapter 2: Types of Data and the Language of Statistics
Flip sign: In extreme cases, the coefficient can reverse direction entirely. A variable that appeared to *increase* $y$ in simple regression might *decrease* $y$ once confounders are controlled. This is Simpson's Paradox in regression form. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Follow the style guide:: Conversational tone — write like you're explaining to a friend - Lead with intuition, then formulas - Use inclusive language and diverse examples - Keep code examples under 15-20 lines - Always explain what code does in plain English 3. **Respect the citation honesty system:** - **Tier 1:** Only for → Contributing to Introductory Statistics: Making Sense of Data in the Age of AI
For abandoning p-values:: The ASA's 2016 statement on p-value misuse - The role of p-values in the replication crisis - Bayesian alternatives - Effect sizes and confidence intervals as better tools → Exercises: Hypothesis Testing: Making Decisions with Data
For one-way ANOVA:: df1 = k - 1 (number of groups minus 1) - df2 = N - k (total observations minus number of groups) → Appendix A: Statistical Tables
Format and style requirements:: Use a professional business report format with clear section headers - Open with a one-paragraph executive summary stating the question, key finding, and top recommendation - Use bullet points and numbered lists for readability - Include 3-4 well-designed visualizations embedded in the report (not s → Capstone Project 2: Business Analytics Report
four different public health intervention programs: a vaccination-focused campaign, a nutrition education program, a community fitness initiative, and a standard-care control — and she wants to know: do the programs produce different health outcomes? That's not a two-group question. It's a four-group question. → Chapter 20: Analysis of Variance (ANOVA)
Functions:: `=RSQ(known_y's, known_x's)` — returns $R^2$ - `=SLOPE(known_y's, known_x's)` — returns $b_1$ - `=INTERCEPT(known_y's, known_x's)` — returns $b_0$ → Chapter 22: Correlation and Simple Linear Regression

G

Given:: $P(\text{disease}) = 0.001$ (1 in 1,000 people have the disease) - $P(\text{positive} \mid \text{disease}) = 0.99$ (99% sensitivity) - $P(\text{positive} \mid \text{no disease}) = 0.02$ (2% false positive rate) → Chapter 9: Conditional Probability and Bayes' Theorem
Grow: Occasionally, a coefficient *increases* when you add other variables. This happens when the added variable acts as a "suppressor" — it removes irrelevant variance from the original predictor, making its signal clearer. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable

H

high bias: they systematically miss patterns — but **low variance** — they give similar predictions regardless of which specific training data you use. - **Complex models** (like a polynomial with 50 terms) have **low bias** — they can capture intricate patterns — but **high variance** — they're highly sensiti → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
holding other variables constant: the idea that a regression coefficient tells you the effect of one predictor *while controlling for all the others*. > > **Why does this matter for communication?** Because one of the most common misinterpretations of regression is ignoring the "all else equal" clause. When you write "for each addit → Chapter 25: Communicating with Data: Telling Stories with Numbers
Hospital discharge data: matched to voter registration records - **Web browsing histories** — matched to social media profiles - **Genome data** — relatives' DNA can identify "anonymous" donors - **Location data** — just four spatiotemporal points can uniquely identify 95% of people → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
How to use this table:: For P(Z <= z): Read the value directly. - For P(Z > z): Compute 1 - P(Z <= z). - For P(-z < Z < z): Compute 2 * P(Z <= z) - 1. - For P(Z < -z): By symmetry, P(Z < -z) = P(Z > z) = 1 - P(Z <= z). → Appendix A: Statistical Tables
Hypotheses:: $H_0$: The two populations have the same distribution (the values from one group are equally likely to be larger or smaller than values from the other) - $H_a$: The values from one group tend to be systematically larger (or smaller) than the other → Chapter 21: Nonparametric Methods: When Assumptions Fail
hypothesis testing: the formal framework for answering questions like "is this difference real?" Sam's question about Daria and Alex's A/B test will finally get their full, rigorous treatment. → Chapter 11: Sampling Distributions and the Central Limit Theorem

I

Identify and handle missing data: the decision of how to handle missing values can change results, and students need to understand the tradeoffs. 2. **Document data cleaning decisions for reproducibility** — this is a professional practice that separates careful analysis from sloppy analysis. 3. **Use pandas for common cleaning task → Chapter 7: Data Wrangling — Instructor Notes
Identify areas for further study: students should leave knowing what comes next (graduate stats, data science, research methods). 3. **Finalize the data analysis portfolio** — the tangible artifact they walk away with. → Chapter 28: Your Statistical Journey Continues — Instructor Notes
Identify sources of bias: sampling bias, response bias, and confounding are concepts students will use for the rest of the course and their lives. 3. **Evaluate whether a study design supports causal conclusions** — the highest-order objective in this chapter. → Chapter 4: Designing Studies — Instructor Notes
illusion of fluency: recognition ≠ understanding → Key Takeaways: Why Statistics Matters (and Why You Might Actually Enjoy This)
Immediate impact:: Parents received a phone call saying their newborn may have a serious metabolic disorder. - They were told to bring the baby in for confirmatory testing (blood draws, specialist appointments). - The waiting period for confirmatory results was 5-10 days. → Case Study 1: Medical Screening — When a Positive Test Doesn't Mean What You Think
In terms of people:: County adult population: ~500,000 - Estimated hypertension prevalence: 32.8% to 41.2% - Estimated number of adults with hypertension: **164,000 to 206,000** → Case Study 1: Maya's Confidence Interval — Estimating Disease Prevalence in a Community
In this chapter, you will learn to:: Explain why statistics matters in your life and career, regardless of your major - Tell the difference between descriptive and inferential statistics - Start seeing statistical reasoning in the news, conversations, and decisions around you → Chapter 1: Why Statistics Matters (and Why You Might Actually Enjoy This)
Inconsistencies found:: `device`: "mobile" vs. "Mobile", "desktop" vs. "Desktop", "smart_tv" vs. "Smart TV" — 7 categories should be 4 - `gender`: 13 unique values should be 4 (male, female, other, + missing) → Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template
independent: one of the most important concepts in probability. → Chapter 8: Probability: The Foundation of Inference
informed consent: the idea that participants should know they're in a study and agree to participate. This principle emerged from horrific historical abuses: the Tuskegee syphilis study (where Black men with syphilis were deliberately left untreated for decades), Nazi medical experiments, and others. > > Today, any s → Chapter 4: Designing Studies: Sampling and Experiments
interaction: the effect of one variable depends on the level of another. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
intercept: the predicted value of $y$ when $x = 0$ - $b_1$ is the **slope** — the predicted change in $y$ for each one-unit increase in $x$ - $x$ is the explanatory (predictor) variable → Chapter 22: Correlation and Simple Linear Regression
Interpret coefficients:: Exponentiate each coefficient: $e^{b_i}$ = odds ratio - "For each one-unit increase in $x_i$, the odds of the outcome are multiplied by $e^{b_i}$, holding all other variables constant" - Check p-values and 95% CIs for the odds ratios → Key Takeaways: Logistic Regression
Interpret individual predictors:: Coefficient: "For each one-unit increase in $x_i$, predicted $y$ changes by $b_i$, **holding all other variables constant**" - t-test / p-value: Is this specific predictor significant? - 95% CI: Plausible range for the true effect → Key Takeaways: Multiple Regression
Interpret results with effect sizes (Cramer's V): reinforcing the lesson from Chapter 17 that significance alone is not enough. → Chapter 19: Chi-Square Tests — Instructor Notes
Interpretation: [ ] I've interpreted every result in context (not just "significant" or "not significant") - [ ] I've correctly interpreted confidence intervals and p-values - [ ] I've discussed limitations and confounders - [ ] I've been careful with causal language - [ ] I've synthesized results into an overall c → Capstone Rubric
Interpretation:: Poverty rate and AQI are statistically significant predictors of ER visit rates after controlling for the other variables. - The uninsured percentage has a p-value of 0.057 — just barely above the $\alpha = 0.05$ threshold. This is a borderline result. The coefficient suggests a real effect, but we → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Intervention strategies:: **The courtroom analogy.** In a criminal trial, the jury does not calculate "the probability the defendant is innocent." They evaluate "how likely is this evidence, assuming the defendant is innocent?" If the evidence would be very unlikely under innocence (small p-value), they reject the innocence → Common Student Struggles and Intervention Strategies
Investigation examples:: Do suspension rates differ significantly by race after controlling for school size and poverty level? - Is there an association between the percentage of students of color in a school and access to AP courses? - Do students from different income backgrounds have significantly different loan repaymen → Capstone Project 3: Social Justice Data Audit
Issues found:: `signup_date` and `last_active` are stored as text, not dates - `sessions` is float instead of integer (NaN contamination) - `age` is float instead of integer (NaN contamination) → Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template

J

Jupyter notebook: the tool you'll use for the rest of this course. → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks

K

kernel: a running instance of Python that executes your code. Think of it as a calculator that's always on, waiting for your instructions. → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Key Activities:: Day 1: Chapters 1-2. Statistical claims in headlines activity. Variable classification exercise. - Day 2: Chapter 3. Guided Jupyter lab: load data, explore with `.head()`, `.describe()`. Excel parallel demo. → 10-Week Quarter / Accelerated Syllabus
Key features:: 28 chapters covering the complete introductory statistics curriculum - Conversational, intuition-first approach — formulas serve understanding, not the other way around - Python and Excel/Google Sheets examples side by side - Progressive portfolio project — leave with a real data analysis you can sh → Introductory Statistics: Making Sense of Data in the Age of AI
Key Finding 1:: Result: ___________________________________________________________________ - Visual: ___________________________________________________________________ - Plain-language explanation: _________________________________________________ → Appendix E: Templates and Worksheets
Key Finding 2:: Result: ___________________________________________________________________ - Visual: ___________________________________________________________________ - Plain-language explanation: _________________________________________________ → Appendix E: Templates and Worksheets
Key Finding 3 (if applicable):: Result: ___________________________________________________________________ - Visual: ___________________________________________________________________ - Plain-language explanation: _________________________________________________ → Appendix E: Templates and Worksheets
Key observations:: `watch_time_min` and `sessions` have identical missing counts (743) — they're missing together, which makes sense (if a user has no watch data, sessions would also be missing) - `satisfaction_score` has 22.4% missing — above the 20% threshold. This variable may not be reliable enough for primary ana → Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template
Key results:: At 23 people: P(match) ≈ 0.5073 — just over 50%! - At 50 people: P(match) ≈ 0.9704 — over 97%! - At 57 people: P(match) ≈ 0.9900 — 99%! → Chapter 8: Probability: The Foundation of Inference

L

layered approach:: **Layer 1 (what she says aloud):** "Poverty is correlated with ER overcrowding. But when we look deeper, the real drivers are insurance access and primary care availability. Communities with similar poverty levels have very different ER rates depending on how many doctors they have." → Case Study 1: Maya's Public Health Brief for the City Council
Look back: trace the arc of what you've learned across all eight parts of this textbook 2. **Look around** — see where Maya, Alex, James, and Sam ended up 3. **Look forward** — map the roads that branch out from here, depending on where your curiosity leads → Chapter 28: Your Statistical Journey Continues

M

making decisions under uncertainty: it's the most practical course you'll take regardless of your major 2. **Descriptive statistics** summarizes what you have; **inferential statistics** reaches beyond your data to the bigger picture 3. Every statistical investigation follows **four pillars:** question → data → analysis → interpretati → Chapter 1: Why Statistics Matters (and Why You Might Actually Enjoy This)
Matplotlib and Seaborn Documentation: matplotlib.org/stable/tutorials/index.html — Official matplotlib tutorials, from basic to advanced - seaborn.pydata.org/tutorial.html — Seaborn's tutorial pages, with clear examples for every chart type → Further Reading: Exploring Data — Graphs and Descriptive Statistics
mean: the balance point of the data. If you put all 11 game scores on a number line and tried to balance them on a fulcrum, the fulcrum would sit at 21.5. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
Measures of center: mean, median, and mode — each answer the question "What's the typical value?" in different ways. The mean uses every value but is sensitive to outliers. The median is resistant to outliers. The mode identifies the most common value. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
Measures of spread: range, IQR, variance, and standard deviation — quantify how much values differ from each other. **Standard deviation** is the most important: it measures the typical distance of values from the mean. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
Minimum: the smallest value > 2. **Q1** — the first quartile (25th percentile) > 3. **Median (Q2)** — the middle value (50th percentile) > 4. **Q3** — the third quartile (75th percentile) > 5. **Maximum** — the largest value → Chapter 6: Numerical Summaries: Center, Spread, and Shape
mode: the most common category. This is why Chapter 2's variable classification matters so much: the type of variable determines which summaries are appropriate. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
Monte Carlo simulation: using random sampling to approximate quantities that are difficult to compute analytically. The name comes from the Monte Carlo Casino in Monaco, because the methods rely on random chance (like gambling) to produce reliable answers. → Chapter 18: The Bootstrap and Simulation-Based Inference
multicollinearity: your predictors are correlated with each other. → Appendix F: FAQ and Troubleshooting

N

New evidence:: Among users who watched a recommendation, 40% browsed for 30+ minutes: $P(\text{browsed} \mid \text{watch}) = 0.40$. - Among users who didn't watch, 10% browsed that long: $P(\text{browsed} \mid \text{didn't watch}) = 0.10$. → Chapter 9: Conditional Probability and Bayes' Theorem
nonparametric methods: distribution-free alternatives to the $t$-test and ANOVA that make fewer assumptions about the data. Some nonparametric tests are closely related to chi-square tests; for example, the Kruskal-Wallis test (a nonparametric alternative to one-way ANOVA) is essentially a chi-square test applied to ranke → Further Reading: Chi-Square Tests: Categorical Data Analysis
Not publishing (benefits):: No property value decline - No anxiety or stigma - No job losses → Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good
Not publishing (harms):: Continued environmental exposure for ~2,500 residents, including ~600 children - No remediation funding - No accountability for the polluter - Perpetuates environmental injustice → Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good
null distribution: the distribution of the test statistic under $H_0$. If the observed difference is in the tails of this distribution, that's evidence against $H_0$. → Chapter 18: The Bootstrap and Simulation-Based Inference
numerical data: variables where the values are numbers on a meaningful scale (ages, incomes, test scores, watch times). And understanding histograms is the gateway to one of the most powerful ideas in all of statistics: distribution thinking. → Chapter 5: Exploring Data: Graphs and Descriptive Statistics

O

observational data: people weren't randomly assigned to be vaccinated or not. Maybe vaccinated people tend to be younger, healthier, or have better access to healthcare in general. The association between vaccination and lower hospitalization is real in this data, but calling it *causal* would require a controlled stud → Case Study: Exploring Public Health Data with pandas — Dr. Chen's Flu Surveillance
observational unit: one person who responded to the survey (remember that term from Chapter 2?). Each column is a **variable**. → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Opening quote and overview: why this chapter matters 2. **"In this chapter, you will learn to..."** — concrete skills 3. **Learning path annotations** — 🏃 Fast Track and 🔬 Deep Dive guidance 4. **Main content sections** — concepts, examples, code, and practice 5. **Project checkpoint** — apply it to your portfolio 6. **Practic → How to Use This Book
ordinal: the categories have a natural order (free < basic < premium) that reflects increasing levels of service. `user_id` is **nominal** — it's a label with no meaningful order. Both are categorical variables, but the distinction matters because ordinal variables preserve rank information that nominal vari → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks

P

P(A|B) ≠ P(B|A): and why confusing the two is called the **prosecutor's fallacy**. You've seen how this confusion can have real consequences in medicine, criminal justice, and everyday reasoning. → Chapter 9: Conditional Probability and Bayes' Theorem
P-hacking: exploring many analyses and reporting only the significant ones — inflates the false positive rate far beyond the nominal $\alpha$. It is one of the primary causes of the replication crisis. → Chapter 13: Hypothesis Testing: Making Decisions with Data
Player A:: Average points per game: 22.5 - Games this season: 60 - Standard deviation: 8.3 → Case Study: Probability in Sports — From Fantasy Leagues to the Vegas Line
Player B:: Average points per game: 21.8 - Games this season: 45 - Standard deviation: 4.1 → Case Study: Probability in Sports — From Fantasy Leagues to the Vegas Line
population proportion: the disease prevalence of 1%. But here's the thing: that 1% is itself an estimate. Maya's county might have a different prevalence than the national average. The inference tools in this chapter let you estimate the *actual* prevalence in a specific population and test whether it differs from the ass → Chapter 14: Inference for Proportions
pre-registration: publicly committing to your hypotheses and analysis plan before collecting data — has become a cornerstone of credible science. When a study is pre-registered, you know the researchers didn't explore dozens of paths and cherry-pick the one that "worked." > > **The ethical principle:** A p-value is o → Chapter 13: Hypothesis Testing: Making Decisions with Data
Probability as long-run frequency: the conceptual shift from certainty to probabilistic thinking. Many students think a probability of 0.7 means "it will happen" and 0.3 means "it won't." The idea that probability describes the long-run behavior of a random process (not a prediction about a single event) is a threshold concept. → Chapter 8: Probability Foundations — Instructor Notes
probability distributions: and one distribution in particular, the **normal curve**, that will change how you think about data forever. → Chapter 9: Conditional Probability and Bayes' Theorem
Probability is everywhere in sports: not just in betting, but in team strategy, player evaluation, game planning, and fan behavior. → Case Study: Probability in Sports — From Fantasy Leagues to the Vegas Line
probability is not fixed; it changes with evidence: is the conceptual shift that separates casual probability thinking from the kind of reasoning that actually works in the real world. Every confidence interval (Chapter 12), hypothesis test (Chapter 13), and regression model (Chapters 22-24) you'll encounter builds on this foundation. → Chapter 9: Conditional Probability and Bayes' Theorem
proportions: the most common type of test in everyday applications. You'll learn: → Further Reading: Hypothesis Testing: Making Decisions with Data
Publishing (benefits):: Potential health improvements for ~2,500 residents in three communities - $4.2 million in federal remediation funding - Environmental justice for communities historically ignored - Long-term reduction in healthcare costs - Regulatory accountability → Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good
Publishing (harms):: Property value decline for ~400 homeowners - Anxiety and stigma for ~2,500 residents - Potential job losses for ~340 plant workers - Community conflict → Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good

Q

questionable research practices (QRPs): techniques that fall short of outright fraud but systematically distort the scientific record. → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice

R

randomization: randomly assign treatments so that confounders balance out across groups. But what about observational data, where you *can't* randomly assign? > > Multiple regression offers a partial solution: it lets you **statistically control** for confounders by including them in the model. It's not as strong → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
ranks: first, second, third, and so on — and analyzing the ranks instead of the raw values. This simple trick sidesteps the normality assumption entirely. It also makes the methods naturally resistant to outliers, because whether that extreme value is 100 or 1,000,000, it gets the same rank: the highest on → Chapter 21: Nonparametric Methods: When Assumptions Fail
Reading a box plot:: Box width = IQR (spread of the middle 50%) - Median position in box = symmetry or skew - Whisker lengths = range of non-outlier data - Dots beyond whiskers = potential outliers - Comparing box plots side by side = comparing distributions → Key Takeaways: Numerical Summaries — Center, Spread, and Shape
Reading the CI:: Contains zero → no significant difference - Entirely positive → Group 1 plausibly higher - Entirely negative → Group 1 plausibly lower - Width → precision of the estimate → Key Takeaways: Comparing Two Groups
Real Statistics Resource Pack: **Recommended:** Use Python for nonparametric analysis → Key Takeaways: Nonparametric Methods
recoding decision: how do you classify this observation? And each choice changed the count. → Case Study: When Cleaning Decisions Changed the COVID-19 Count
Recognize Simpson's paradox: this is a threshold concept that shows how data can tell opposite stories at different levels of aggregation. 3. **Apply ethical frameworks to data analysis** — from collection through reporting, every step involves ethical choices. → Chapter 27: Ethical Data Practice — Instructor Notes
Recommended Data Sources:: **CDC WONDER** (wonder.cdc.gov): Mortality data, birth data, environmental health data. Example datasets include cause-of-death by county, infant mortality rates, cancer incidence. - **Behavioral Risk Factor Surveillance System (BRFSS)**: The largest continuously conducted health survey in the world → Capstone Project 1: Public Health Data Investigation
Recommended datasets (all free and public):: **CDC BRFSS** — health behaviors and outcomes across U.S. states - **Gapminder** — life expectancy, GDP, and population across countries and decades - **U.S. College Scorecard** — college costs, graduation rates, and earnings - **World Happiness Report** — national happiness scores and contributing → How to Use This Book
Red flags from `.describe()`:: `watch_time_min` has a minimum of **-5.2** (impossible — can't watch negative minutes) - `watch_time_min` maximum is **15,840 minutes** = 264 hours = 11 days straight. Possible bot or data error. - `age` minimum is **-99** (impossible — likely a placeholder for "unknown") - `age` maximum is **999** → Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template
reference category: the group against which others are compared. When both urban = 0 and suburban = 0, the community must be rural. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Regression to the mean: a subtle but powerful idea. Students scoring in the top 10% on one exam will, on average, score lower on the next exam — not because they got worse, but because extreme scores tend to be partly due to chance. This concept explains many real-world phenomena (sports "slumps," the "sophomore jinx") and → Chapter 22: Correlation and Simple Linear Regression — Instructor Notes
rejection region: the set of test statistic values that would lead to rejecting $H_0$. → Chapter 13: Hypothesis Testing: Making Decisions with Data
replicability: a related but distinct concept. Reproducibility requires documentation (a cleaning log, code, and clear decision records). *Reference:* Section 7.10 → Quiz: Data Wrangling — Cleaning and Preparing Real Data
Reproducibility: [ ] My notebook runs from top to bottom without errors (Restart and Run All) - [ ] All data files are included or download instructions are provided - [ ] All imports are at the top - [ ] Code is commented - [ ] Random seeds are set → Capstone Rubric
Required elements (choose at least two):: **Alternative test:** If you used a parametric test, also run the nonparametric equivalent (or vice versa). Do the conclusions change? - **Subgroup analysis:** Does the disparity vary across subgroups? (e.g., does a racial disparity in sentencing look different for drug offenses vs. violent offenses → Capstone Project 3: Social Justice Data Audit
Required elements:: Clearly define the groups being compared and the metric of interest - State null and alternative hypotheses - Choose the appropriate test: - Two-sample t-test (for comparing means of two independent groups) - Paired t-test (for before/after or matched comparisons) - Two-proportion z-test (for compar → Capstone Project 2: Business Analytics Report
Requirements:: Open with the problem and why it matters - Summarize key findings in plain language (no jargon, no formulas) - Include 2-3 well-designed visualizations that support your narrative - Clearly state what the data shows and, equally important, what it does *not* show - End with actionable recommendation → Capstone Project 1: Public Health Data Investigation
Resampling: the insight that you can learn about the population by cleverly reusing the sample. This is a modern computational approach that was impossible before computers. Students who grasp this idea have a deeper understanding of what inference is actually doing. → Chapter 18: Bootstrap and Simulation-Based Inference — Instructor Notes
resistant measure: a statistic that isn't heavily influenced by extreme values or outliers. The mean is *not* resistant. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
right-skewed: you can't have a negative F (it's a ratio of positive quantities) - It starts at 0 and has a long right tail - As $df_2$ gets large, the distribution becomes more concentrated around $F = 1$ - It was named in honor of Ronald A. Fisher, the statistician who developed ANOVA in the 1920s → Chapter 20: Analysis of Variance (ANOVA)
robust: they work under weaker assumptions. But they're less **powerful** when the parametric assumptions actually hold. → Chapter 21: Nonparametric Methods: When Assumptions Fail
robustness: the t-test's ability to give approximately correct results even when assumptions aren't perfectly met. The guidelines were: for $n \geq 30$, the CLT handles most non-normality. For $15 \leq n < 30$, check for outliers and strong skew. For $n < 15$, you really need approximate normality. > > Now we f → Chapter 21: Nonparametric Methods: When Assumptions Fail
Rules of thumb:: For small datasets (< 50 observations): 5-7 bins - For medium datasets (50-300): 8-15 bins - For large datasets (300+): 15-25 bins - A popular formula: number of bins ≈ √n (the square root of the number of observations) → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
Run `.info()`: Check data types and non-null counts - [ ] **Run `.describe()`** — Look for impossible min/max, suspicious means, large standard deviations - [ ] **Run `.value_counts()`** on categorical columns — Check for inconsistent categories - [ ] **Check for duplicates** — `df.duplicated().sum()` - [ ] **Coun → Key Takeaways: Data Wrangling — Cleaning and Preparing Real Data
Run both the parametric and nonparametric test:: Two independent groups → t-test AND Mann-Whitney U - Paired data → paired t-test AND Wilcoxon signed-rank - Three or more groups → ANOVA AND Kruskal-Wallis → Chapter 21: Nonparametric Methods: When Assumptions Fail

S

Sam is at approximately the 84th percentile: he scored higher than about 84% of test takers. Not bad. → Case Study 1: The Bell Curve in Standardized Testing — How SAT and ACT Scores Are Designed to Be Normal
sample size determination: deciding how large your sample needs to be to achieve a desired margin of error. → Chapter 12: Confidence Intervals: Estimating with Uncertainty
sampling distribution of the median: the very thing we couldn't derive a formula for. The spread of this distribution tells you how much the sample median varies from sample to sample. And that gives you everything you need to build a confidence interval. → Chapter 18: The Bootstrap and Simulation-Based Inference
sampling variability: the natural variation that occurs because different random samples contain different individuals. We've been aware of this concept since Chapter 1, when we noticed that a product's 4.2-star rating based on 47 reviews was more trustworthy than a 4.5-star rating based on 3 reviews. Now we're going to → Chapter 11: Sampling Distributions and the Central Limit Theorem
Set up and use a Jupyter notebook: students need this working by the end of class to stay on pace. 2. **Perform basic pandas operations** (load, view, filter, sort) — these are the building blocks for every subsequent lab. 3. **Navigate between Python and spreadsheet approaches** — students should know both exist and when each is app → Chapter 3: Your Data Toolkit — Instructor Notes
Setup (from Washington's expanded dataset):: The base rate of re-offense in the studied population is 20%: $P(\text{re-offense}) = 0.20$. - The algorithm flags 75% of people who will re-offend as "high risk": $P(\text{high risk} \mid \text{re-offense}) = 0.75$ (the algorithm's sensitivity). - The algorithm flags 22% of people who will NOT re-o → Chapter 9: Conditional Probability and Bayes' Theorem
Setup:: Overall, 15% of users who are shown a recommendation click and watch it. This is the **prior**: $P(\text{watch}) = 0.15$. - Among users who watched, 60% had previously watched a movie in the same genre. This is the **likelihood**: $P(\text{same genre} \mid \text{watch}) = 0.60$. - Among users who di → Chapter 9: Conditional Probability and Bayes' Theorem
Shape: Is it symmetric? Skewed? How many peaks? 2. **Center** — Where is the "middle" of the data? 3. **Spread** — How far does the data stretch? 4. **Unusual features** — Outliers? Gaps? Clusters? → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
Shrink: This is the most common case. The simple regression coefficient was "bloated" because it included the effects of correlated omitted variables. Adding those variables deflates the original coefficient to its "true" partial effect. (This happened with Maya's poverty rate: 11.4 → 5.81.) → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Similar distribution shapes: the two populations should have roughly the same shape, just shifted horizontally (if you want to interpret the test as comparing medians; otherwise, it's a general "stochastic dominance" test) 4. **At least ordinal data** — the observations need to be rankable → Chapter 21: Nonparametric Methods: When Assumptions Fail
Simpson's paradox: data can tell opposite stories at different levels of aggregation. This is deeply counterintuitive and challenges students' trust in simple summaries. Once understood, it permanently changes how students think about aggregated data. → Chapter 27: Ethical Data Practice — Instructor Notes
skewed right: a long tail to the right is pulling the mean above the median. > 3. Standard deviation measures the **typical distance** of values from the mean. It tells you how spread out the data is around the center. > 4. For **bell-shaped, symmetric** distributions: about **68%** of data falls within 1 SD of t → Chapter 6: Numerical Summaries: Center, Spread, and Shape
spurious correlation: two variables that track each other over time purely by coincidence. Both happened to increase over the same period. The correlation is real (the numbers genuinely co-vary), but the relationship is meaningless. → Chapter 22: Correlation and Simple Linear Regression
standard normal distribution: a special normal with $\mu = 0$ and $\sigma = 1$. → Chapter 10: Probability Distributions and the Normal Curve
standardized scores: they strip away the original units and put everything on the same "standard deviations from the mean" scale. You'll use z-scores throughout this course — they're the foundation of hypothesis testing in Chapter 13. → Chapter 6: Numerical Summaries: Center, Spread, and Shape
State hypotheses:: $H_0$: The variable follows the specified distribution - $H_a$: The variable does not follow the specified distribution 2. **Calculate expected frequencies:** $E_i = n \times p_i$ where $p_i$ is the hypothesized proportion for category $i$ 3. **Check conditions:** All expected counts $\geq 5$ 4. **C → Key Takeaways: Chi-Square Tests: Categorical Data Analysis
statistic: a number calculated from a sample of shots (the 65 she's taken so far). It's his best estimate of the parameter, but it's not exactly right. If Daria took another 65 shots, she might shoot 35% or 41%. The statistic *varies* from sample to sample; the parameter does not. → Chapter 2: Types of Data and the Language of Statistics
Statistical Analysis: [ ] I've used at least three distinct statistical methods - [ ] Each method is appropriate for the data type and question - [ ] I've verified conditions/assumptions for each method - [ ] I've calculated and interpreted effect sizes - [ ] I've distinguished between statistical and practical significa → Capstone Rubric
Statistical thinking: seeing the world through the lens of variation and uncertainty. This is a gradual shift that begins here and deepens throughout the course. Don't expect it to click fully in Chapter 1; plant the seed. → Chapter 1: Why Statistics Matters — Instructor Notes
Statistics is about decisions under uncertainty: not just formulas and calculations. Frame this as the course's central idea from day one. 2. **Descriptive vs. inferential statistics** — students need this distinction immediately; it structures the entire course. 3. **AI systems run on statistics** — this motivates the course for students who thin → Chapter 1: Why Statistics Matters — Instructor Notes
StatQuest: "Chi-Square Tests" (YouTube): visual walkthrough of the goodness-of-fit and independence tests - **Khan Academy: "Chi-Square Distribution" (khanacademy.org)** — the distribution behind the test - **StatKey: Chi-Square Test module** — you can compare the chi-square test to a simulation-based version, connecting Chapter 18's ideas → Further Reading: The Bootstrap and Simulation-Based Inference
StatQuest: "Confidence Intervals" (YouTube): Josh Starmer's explanation of what "95% confident" really means - **OnlineStatBook: Confidence Interval Simulation** (https://onlinestatbook.com/stat_sim/conf_interval/index.html) — build confidence intervals interactively and watch the coverage probability → Further Reading: Sampling Distributions and the Central Limit Theorem
StatQuest: "Hypothesis Testing" (YouTube): Josh Starmer's explanation of the logic behind hypothesis tests - **Seeing Theory — Hypothesis Testing** (https://seeing-theory.brown.edu/frequentist-inference/) — interactive visualization of p-values and rejection regions - **Wheelan, *Naked Statistics*, Chapter 9** — accessible introduction to hy → Further Reading: Confidence Intervals: Estimating with Uncertainty
StatQuest: "One-Proportion Z-Test" (YouTube): focused walkthrough of the proportion test - **Khan Academy: "Hypothesis Test for a Proportion" (khanacademy.org)** — multiple worked examples - **Seeing Theory: Hypothesis Testing module** — interactive p-value visualization for proportions → Further Reading: Hypothesis Testing: Making Decisions with Data
StatQuest: "One-Way ANOVA" (YouTube): clear visual walkthrough of the $F$-test - **Khan Academy: "ANOVA" (khanacademy.org)** — step-by-step introduction to the decomposition of variability - **SciPy documentation: `scipy.stats.f_oneway`** — the Python function for one-way ANOVA → Further Reading: Chi-Square Tests: Categorical Data Analysis
StatQuest: "Statistical Power" (YouTube): clear visual explanation of what power is and why it matters - **Khan Academy: "Effect Size" (khanacademy.org)** — Cohen's d and its interpretation - **Seeing Theory: Power module** — interactive power curve visualization → Further Reading: Comparing Two Groups
StatQuest: "Student's t-test" (YouTube): focused walkthrough of the one-sample t-test - **Khan Academy: "One-sample t-test" (khanacademy.org)** — multiple worked examples - **Seeing Theory: Frequentist Inference module** — interactive t-test visualization → Further Reading: Inference for Proportions
StatQuest: "Two-Sample t-Test" (YouTube): clear walkthrough of the independent-samples t-test - **Khan Academy: "Paired t-Test" (khanacademy.org)** — multiple worked examples of before-and-after designs - **Seeing Theory: Frequentist Inference module** — interactive visualization of two-sample comparisons → Further Reading: Inference for Means
Stay roughly the same: If the predictors are uncorrelated with each other (rare in observational data), adding new variables doesn't change existing coefficients much. → Chapter 23: Multiple Regression: The Real World Has More Than One Variable
Step 1: Check conditions.: *Random sample?* Yes — Maya used a random sample from the county's health records. - *Independence?* The county has 500,000 adults. Is $120 \leq 0.10 \times 500{,}000 = 50{,}000$? Yes, easily. - *Nearly normal or large $n$?* $n = 120 \geq 30$, so the CLT guarantees the sampling distribution of $\bar → Chapter 12: Confidence Intervals: Estimating with Uncertainty
Step 1: Compute the group means and grand mean.: $\bar{x}_1 = (72 + 78 + 65 + 74 + 80 + 71 + 76 + 69 + 77 + 73)/10 = 73.5$ - $\bar{x}_2 = (68 + 75 + 71 + 62 + 69 + 74 + 66 + 72 + 67 + 70)/10 = 69.4$ - $\bar{x}_3 = (81 + 85 + 79 + 88 + 83 + 86 + 80 + 90 + 84 + 82)/10 = 83.8$ - $\bar{x}_4 = (55 + 62 + 58 + 51 + 60 + 57 + 63 + 54 + 59 + 56)/10 = 57.5 → Chapter 20: Analysis of Variance (ANOVA)
Step 1: Hypotheses: $H_0: p = 0.03$ (county prevalence equals the national rate) - $H_a: p > 0.03$ (county prevalence exceeds the national rate) → Case Study 2: Medical Screening and the False Positive Paradox — When Proportion Inference Meets Bayes' Theorem
Step 1: State the Hypotheses: $H_0: \mu = 130$ (population mean BP is 130) - $H_a: \mu > 130$ (population mean BP exceeds 130) - One-tailed test (right-tailed) → Chapter 13: Hypothesis Testing: Making Decisions with Data
Step 2: Check Conditions: Random sample? Yes. - Independence? $500 \leq 0.10 \times 85{,}000$. ✓ - Success-failure? $np_0 = 500 \times 0.03 = 15 \geq 10$ ✓ and $n(1-p_0) = 500 \times 0.97 = 485 \geq 10$ ✓ → Case Study 2: Medical Screening and the False Positive Paradox — When Proportion Inference Meets Bayes' Theorem
Step 2: Check conditions.: *Random sample?* Yes. - *Independence?* $800 \leq 0.10 \times 500{,}000$. ✓ - *Success-failure?* $n\hat{p} = 800 \times 0.12 = 96 \geq 10$ ✓ and $n(1-\hat{p}) = 800 \times 0.88 = 704 \geq 10$ ✓ → Chapter 12: Confidence Intervals: Estimating with Uncertainty
Step 2: Verify CLT conditions.: $n\hat{p} = 1000 \times 0.54 = 540 \geq 10$ ✓ - $n(1-\hat{p}) = 1000 \times 0.46 = 460 \geq 10$ ✓ - Random sample assumed ✓ → Case Study 1: How Polls Predict Elections — The CLT Behind the Curtain
Step 3: Apply the test to each group.: Of the 100 with disease: 99% test positive → **99 true positives**, 1 false negative. - Of the 99,900 without disease: 2% test positive → **1,998 false positives**, 97,902 true negatives. → Chapter 9: Conditional Probability and Bayes' Theorem
Student's t-distribution: or simply the **t-distribution**. → Chapter 12: Confidence Intervals: Estimating with Uncertainty
Success-failure condition:: $np_0 = 400 \times 0.47 = 188 \geq 10$ ✓ - $n(1 - p_0) = 400 \times 0.53 = 212 \geq 10$ ✓ → Chapter 14: Inference for Proportions
Suggested datasets for the project:: CDC's Behavioral Risk Factor Surveillance System (BRFSS) - Gapminder (global health and economics) - U.S. College Scorecard (education outcomes) - World Happiness Report - NOAA Climate Data Online → Introductory Statistics: Making Sense of Data in the Age of AI

T

t-test for means: the most commonly used hypothesis test in practice. The logic is identical to what you've learned here, but: → Further Reading: Inference for Proportions
test sets: they fit the model on one portion and evaluate it on another. If you'd tested your Chapter 22 regression model on a held-out sample and the $R^2$ dropped dramatically, that would have been a sign of overfitting. → Chapter 26: Statistics and AI: Being a Critical Consumer of Data
The $p(1-p)$ in the numerator: this is maximized when $p = 0.5$ and smaller when $p$ is near 0 or 1. This means proportions near 50% are hardest to estimate precisely, and proportions near 0% or 100% are easier. → Chapter 11: Sampling Distributions and the Central Limit Theorem
The analyst: What are my incentives? Am I under pressure to find certain results? > 2. **The decision-maker** — Who is using this analysis? What decisions will they make? > 3. **The subjects** — Whose data is being analyzed? Did they consent? Can they be harmed? > 4. **The affected community** — Who will be impa → Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice
the bars in a histogram touch: because the data is continuous. Moving from one bin to the next means moving along a continuous number line, not jumping to a different category. → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
The Central Limit Theorem: this is the most important threshold concept in the entire course. It is the bridge from probability to inference. Once students truly understand the CLT, confidence intervals (Ch. 12) and hypothesis tests (Ch. 13) make logical sense. Without it, those chapters are just recipes to follow. → Chapter 11: Sampling Distributions and the Central Limit Theorem — Instructor Notes
The courtroom analogy:: $H_0$ = presumption of innocence - Data = prosecution's evidence - p-value = how convincing the evidence is - $\alpha$ = "beyond a reasonable doubt" threshold - Reject $H_0$ = guilty verdict - Fail to reject $H_0$ = not guilty (NOT the same as innocent) → Key Takeaways: Hypothesis Testing: Making Decisions with Data
The data type is the same: continuous numerical, ratio level — but the **operational definition** changes what the numbers mean. This is why data dictionaries are essential: two teams working with "watch time" could be measuring fundamentally different things. → Case Study: Classifying Data at Scale — When Every Click Becomes Data
The finding: what happened 2. **The magnitude** — how big (effect size) 3. **The uncertainty** — how confident (CI) 4. **The implication** — so what? (recommendation) → Key Takeaways: Communicating with Data: Telling Stories with Numbers
The menu bar: File, Edit, View, Insert, Runtime, Tools, Help - **The toolbar** — buttons for common actions - **Cells** — the rectangular boxes where you write code or text → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
The normal distribution as a model: real data is never perfectly normal, but the model is useful. This "all models are wrong, but some are useful" insight prepares students for the CLT (Ch. 11) and assumption checking throughout the inference chapters. → Chapter 10: Probability Distributions and the Normal Curve — Instructor Notes
The p-value: one of the most misunderstood concepts in all of science. Getting this right transforms statistical reasoning. Getting it wrong leads to the kind of errors documented in the replication crisis. Plan to spend 15-20 minutes specifically on what the p-value does NOT mean. → Chapter 13: Hypothesis Testing — Instructor Notes
They tend to agree when:: Sample sizes are moderate to large ($n \geq 20$ per group) - The data are approximately normal or at least symmetric - There are no extreme outliers - The data are on an interval or ratio scale → Chapter 21: Nonparametric Methods: When Assumptions Fail
They tend to disagree when:: Sample sizes are small ($n < 15$) and the data are skewed - Heavy outliers are present (these inflate the parametric test's standard error) - The data are ordinal (means may not be meaningful) - The distributions have very different shapes across groups → Chapter 21: Nonparametric Methods: When Assumptions Fail
Thinking in odds: the shift from probability to odds/log-odds is a genuine conceptual leap. Students who can move fluidly between probability, odds, and log-odds have the foundation for more advanced modeling. → Chapter 24: Logistic Regression — Instructor Notes
three fatal errors: errors that every student of probability should understand. → Case Study 2: Bayes in the Courtroom — The Prosecutor's Fallacy in Criminal Trials
threshold: typically 0.5. If the predicted probability exceeds the threshold, predict "yes." Otherwise, predict "no." → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
Time next Monday: 30 seconds: just change the filename and re-run. When the manager asks for the age-group breakdown, Alex adds one line and re-runs in 5 seconds. → Case Study: Excel vs. Python — When Each Tool Shines
two groups: which is what Alex has been waiting for with her A/B test, and what Professor Washington needs for his algorithm audit. → Chapter 14: Inference for Proportions

U

uncertainty is not failure: it's information. Nonparametric methods embody a related principle: **real data is messy, and good statistics adapts to the mess.** → Chapter 21: Nonparametric Methods: When Assumptions Fail
underpowered: they simply don't have enough participants to reliably detect the effects they're looking for. An underpowered study is like trying to spot a bird with binoculars that are out of focus. The bird might be right there, but you'll never see it. → Chapter 17: Power, Effect Sizes, and What "Significant" Really Means
Use a spreadsheet when:: You're doing quick, one-off calculations on small data (under ~1,000 rows) - You need to manually enter or edit data - You're sharing results with someone who doesn't know Python - You want to quickly eyeball data by scrolling through it → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Use both when:: You want to check whether the formula-based and bootstrap results agree (they should for standard statistics under good conditions) - You're learning statistics and want to build intuition about sampling distributions → Chapter 18: The Bootstrap and Simulation-Based Inference
Use formula-based methods when:: You're computing a CI or test for a mean or proportion - The sample is large enough for the CLT - The data are reasonably normal (or you have $n \geq 30$) - You want a quick answer without programming → Chapter 18: The Bootstrap and Simulation-Based Inference
Use Python when:: Your dataset has more than ~1,000 rows - You need to reproduce your analysis later (or share exact steps) - You're doing anything that requires multiple steps (filter, then calculate, then graph) - You'll need to do the same analysis again on new data - You need statistical tests beyond basic averag → Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks
Use simulation-based methods when:: You need inference for a non-standard statistic (median, ratio, correlation, etc.) - Your data are non-normal and your sample is moderate-sized - You want to avoid distributional assumptions - The formula-based conditions are questionable → Chapter 18: The Bootstrap and Simulation-Based Inference

V

Visualization: [ ] Every graph has a title, axis labels, and legend (if needed) - [ ] I've used appropriate graph types for each variable type - [ ] My graphs clearly communicate their intended message - [ ] I've included a variety of visualization types - [ ] My visualizations are integrated with my narrative → Capstone Rubric

W

What "95% confidence" really means: it's about the process, not the specific interval. This interpretation issue is the most commonly tested misconception on AP Statistics exams and in intro stats courses nationwide. → Chapter 12: Confidence Intervals — Instructor Notes
What a p-value is NOT:: It is NOT the probability that the null hypothesis is true - It is NOT the probability that the result happened by chance - It is NOT the probability that you'll get the same result if you repeat the study → Chapter 25: Communicating with Data: Telling Stories with Numbers
What makes this book different:: **Conversational tone** — like learning from a friend who happens to be great at explaining things - **Real-world examples** from healthcare, sports, technology, criminal justice, and everyday life - **Python and Excel** side by side — learn the tools you'll actually use - **Progressive portfolio pr → Introductory Statistics
What research tells us:: A 2012 study in *Pediatrics* found that parents who received false-positive newborn screening results experienced significantly elevated anxiety and depression levels even after the results were cleared. - A follow-up study found that some parents continued to perceive their children as "vulnerable" → Case Study 1: Medical Screening — When a Positive Test Doesn't Mean What You Think
When normality DOES matter:: Small sample sizes ($n < 30$): With small samples, the CLT hasn't kicked in, so the shape of the data matters more - Extreme outliers: Even robust procedures break down when there are extreme outliers - Prediction intervals: If you're predicting *individual* outcomes (not averages), you need the und → Chapter 10: Probability Distributions and the Normal Curve
When normality DOESN'T matter much:: Large sample sizes ($n > 30$ or so): The CLT rescues you - Mild skewness: A little skewness usually doesn't cause problems - Sample means and proportions: Even if individual observations aren't normal, their averages tend to be → Chapter 10: Probability Distributions and the Normal Curve
When NOT to use it:: The data is MAR or MNAR (dropping rows introduces bias) - You'd lose too much data - The remaining data would no longer represent the population → Chapter 7: Data Wrangling: Cleaning and Preparing Real Data
When pie charts fail:: You have more than 5-6 categories (the slices become impossible to compare) - Categories have similar proportions (can you tell the difference between 22% and 24% slices? Neither can anyone else) - The data doesn't represent parts of a whole - You need precise comparisons (bar charts are always bett → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
When pie charts work:: You have a small number of categories (3-5) - You want to show parts of a whole (must sum to 100%) - One or two categories dominate, and that dominance is the main story - Your audience is non-technical and familiar with pie charts → Chapter 5: Exploring Data: Graphs and Descriptive Statistics
When to use listwise deletion:: The data is MCAR (so the remaining rows are still representative) - A small percentage of rows are affected (< 5%) - You have plenty of data to spare → Chapter 7: Data Wrangling: Cleaning and Preparing Real Data
where:: $\bar{x}$ = sample mean - $\mu_0$ = hypothesized population mean (from $H_0$) - $s$ = sample standard deviation - $n$ = sample size - $df = n - 1$ (degrees of freedom) → Key Takeaways: Inference for Means
Who would care about these answers: a specific person, organization, or decision-maker → Chapter 1: Why Statistics Matters (and Why You Might Actually Enjoy This)
Why the sigmoid works:: When $b_0 + b_1 x$ is a large positive number (say, +10), $e^{-10}$ is tiny, so $P \approx \frac{1}{1+0} \approx 1$ - When $b_0 + b_1 x$ is a large negative number (say, -10), $e^{10}$ is huge, so $P \approx \frac{1}{1+22026} \approx 0$ - When $b_0 + b_1 x = 0$, $e^0 = 1$, so $P = \frac{1}{1+1} = 0. → Chapter 24: Logistic Regression: When the Outcome Is Yes or No
within-group variability: the natural, unexplained noise within each group. → Chapter 20: Analysis of Variance (ANOVA)
Writing principles for policy briefs:: Lead with the conclusion, not the methodology - Use plain language (no jargon, no formulas, no p-values in the main text) - Present numbers in context ("Black applicants were denied mortgages at 1.8 times the rate of White applicants with similar incomes and credit scores" — not "the chi-square test → Capstone Project 3: Social Justice Data Audit

Y

Yes: same person, same location, matched pair | **Paired t-test** | The pairing captures within-pair change | | **No** — different people, different units, no matching | **Two-sample t-test** | The groups are independent | → Chapter 16: Comparing Two Groups
Your business question must:: Be specific enough to guide analysis but broad enough to require multiple techniques - Have clear implications for a business decision - Involve at least one comparison between groups or one predictive relationship - Be answerable with the data available (don't promise what the data can't deliver) → Capstone Project 2: Business Analytics Report
Your investigation question must:: Involve a clear comparison between groups defined by a protected or socially relevant characteristic (race, gender, income, disability, geography, etc.) - Be answerable with the data available — don't claim to measure what the data doesn't contain - Be framed neutrally: you're investigating whether → Capstone Project 3: Social Justice Data Audit
Your research question must:: Be specific and answerable with the data you have - Involve at least one numerical variable and at least one categorical variable - Be relevant to a real public health concern - Require more than descriptive statistics to answer (i.e., it should call for inference) → Capstone Project 1: Public Health Data Investigation

Z

z-score transformation: the exact same z-score formula from Chapter 6! Now we see its deeper purpose: it converts any normal distribution into the standard one. → Chapter 10: Probability Distributions and the Normal Curve
z-scores: each observation's distance from the mean, measured in standard deviations. So the correlation coefficient is the average product of paired z-scores. > > That's all it is. The standard deviation from Chapter 6 is doing the heavy lifting inside the correlation formula. → Chapter 22: Correlation and Simple Linear Regression