Glossary

1270 terms from Introduction to Data Science: From Curiosity to Code

# A B C D E F G H I J K L M N O P Q R S T U V W Y Z

#

"Add Anaconda3 to my PATH environment variable"
The installer recommends against this, but for beginners, checking this box can make things easier. If you're unsure, leave it unchecked (the Anaconda Prompt will still work). - **"Register Anaconda3 as my default Python"** — Check this box. It means when programs look for Python on your computer, t → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
"Any population"
it doesn't matter if the population is uniform, skewed, bimodal, or any other shape. The magic works regardless. → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
"as extreme as or more extreme than"
We're not asking "what's the probability of getting exactly 63 heads?" We're asking "what's the probability of getting 63 or more (or 37 or fewer)?" This is because 64, 65, 66... heads would be even more evidence against the null. → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
"As n increases"
with small samples (n=2, n=5), the bell shape is approximate. By n=30, it's usually quite good. This is why "n=30" is often cited as a rule of thumb for when the CLT kicks in. → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
"assuming the null hypothesis is true"
This is the crucial caveat. The p-value is computed in a hypothetical world where the null is true. It doesn't tell you the probability that the null *is* true. → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
"Become a data scientist in 30 days!"
Real data science competence takes months to years. Anyone promising mastery in 30 days is selling dreams. - **"No prerequisites required" for advanced topics.** If a deep learning course says you don't need to know Python, be skeptical. You need foundations before you build higher. - **Income guara → Chapter 36: What's Next: Career Paths, Continuous Learning, and the Road to Intermediate Data Science
"Sigma / sqrt(n)"
the standard deviation of sample means (called the **standard error**) gets smaller as n grows. Larger samples produce more precise estimates. → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
"the probability of observing..."
It's about the data, not about the hypothesis. The p-value is a property of the *data given the null*, not a property of the *null given the data*. → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
(a)
Data: Country-level dataset, aggregated to continent means - Aesthetics: x = continent, y = mean life expectancy, color = continent - Geom: Bar - Scale: y-axis linear starting at 0; categorical x-axis; distinct colors per continent - Coordinates: Cartesian - Faceting: None → Chapter 14 Exercises: The Grammar of Graphics
(a) Daily work:
A data analyst's typical day involves pulling data from databases using SQL, creating dashboards and reports, computing business metrics, and presenting findings to stakeholders. The work is primarily descriptive — answering "what happened?" and "how are we doing?" - A data scientist's typical day i → Chapter 36 Quiz: Reflection and Career Planning
(a) Ethical issues:
**Harm to users:** The algorithm may be promoting content that causes anxiety, outrage, political polarization, and decreased wellbeing. Optimizing for engagement is not the same as optimizing for user value — users can be "engaged" by content that makes them angry or upset. - **Societal harm:** Amp → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
(b)
Data: Country-level dataset, one row per country - Aesthetics: x = GDP per capita, y = CO2 per capita, size = population - Geom: Point (circle) - Scale: x and y linear (or log for GDP); size proportional to population - Coordinates: Cartesian - Faceting: None → Chapter 14 Exercises: The Grammar of Graphics
(b) Most relevant skills from this book:
For data analyst: pandas data wrangling (Chapters 7-12), visualization (Chapters 14-18), descriptive statistics (Chapter 19), and communication skills (Chapter 31). The biggest gap is SQL, which the book didn't cover in depth. - For data scientist: everything the analyst needs, plus hypothesis testi → Chapter 36 Quiz: Reflection and Career Planning
(c)
Data: Country-level vaccination rates - Aesthetics: x = vaccination rate (binned), y = count - Geom: Bar (histogram bars) - Scale: x-axis linear; y-axis linear (count) - Coordinates: Cartesian - Faceting: By WHO region (6 panels) → Chapter 14 Exercises: The Grammar of Graphics
(c) Alternative approaches:
Optimize for "time well spent" rather than "time spent" — measure user satisfaction, not just engagement - Include content diversity metrics in the optimization function to prevent filter bubbles - Down-weight content that is flagged as divisive or misleading - Add friction to sharing (e.g., "Did yo → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
1. Data team (Monday):
**Format:** Jupyter notebook shared via repository, discussed in meeting - **Level of detail:** Full — methodology, code, statistical tests, limitations - **Include:** Reproducible code, data sources, confidence intervals, alternative models tested - **Leave out:** Policy recommendations (that is fo → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
1. Is preprocessing inside the pipeline?
[ ] Scaling, encoding, and imputation are all pipeline steps - [ ] No `fit_transform` on the full dataset before splitting → Case Study 2: The Data Leakage Disaster — A Cautionary Tale
1. Who benefits and who is harmed?
Benefits: Students who are correctly identified as at-risk and receive helpful advising. The university (higher retention rates, better outcomes). - Potential harms: Students falsely identified as at-risk may feel stigmatized, treated as less capable, or resentful of mandatory requirements. The "at- → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
12 slices
far too many for a pie chart. The smallest slices are indistinguishable. 3. **Gradient fills** add visual complexity without encoding data. 4. **Overlapping labels** make several categories unreadable. 5. **Drop shadow and textured background** are chartjunk. 6. **Pie chart is wrong for this data.** → Case Study 1: Redesigning a Government Report for Accessibility
14
multiplication before addition. 2. `(2 + 3) * 4` = `5 * 4` = **20** --- parentheses override precedence. 3. `10 - 6 / 2` = `10 - 3.0` = **7.0** --- division before subtraction; note the result is a float because `/` always returns float. 4. `2 ** 3 + 1` = `8 + 1` = **9** --- exponentiation before ad → Answers to Selected Exercises
2. Basic Inspection
Shape (rows x columns) - Column names and types - First and last 5 rows - Unique values for categorical columns → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
2. Is the data representative?
Using family income and zip code as predictors means the model will disproportionately flag low-income and minority students as "at-risk." These students may face real barriers, but the model may be capturing socioeconomic disadvantage rather than individual academic risk. Students from wealthy fami → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
2. School principal (Wednesday):
**Format:** 8-10 slide presentation with one-page executive summary handout - **Level of detail:** Moderate — key findings with supporting charts, recommendations - **Include:** Specific, actionable recommendations (e.g., "redirect resources from X to Y") with estimated impact - **Leave out:** Code, → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
3-6 charts
Use **consistent colors and scales** - Include **filters** for different user needs - Add **data source and last-updated date** → Key Takeaways: Communicating Results: Reports, Presentations, and the Art of the Data Story
[ ] If the data has natural groups (patients, users, companies), all observations from the same group are in the same split - [ ] No duplicate or near-duplicate rows exist across train and test → Case Study 2: The Data Leakage Disaster — A Cautionary Tale
3. Parents (Thursday):
**Format:** 4-5 slides with large, simple charts, plus a one-page handout to take home - **Level of detail:** High-level — big-picture trends and what they mean for children - **Include:** What the school is doing in response and how parents can help - **Leave out:** All statistical terminology, com → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
3. Summary Statistics
Overall coverage: count, min, max, mean, median - Coverage by region - Coverage by vaccine type → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
3. What are the failure modes?
False positives: Students wrongly labeled as at-risk are subjected to mandatory advising they do not need, potentially experiencing it as patronizing or stigmatizing. - False negatives: Students who are actually at risk but do not match the model's profile (e.g., wealthy students with personal probl → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
4. Could it be misused?
The risk labels could be used for purposes beyond advising — e.g., admissions committees could use them to screen out "risky" applicants. Insurers could use them to adjust financial aid. Faculty could treat labeled students differently. → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
4. Data Quality Assessment
Missing values by column - Any values outside expected ranges - Consistency checks on categorical columns → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
4. Is the performance realistic?
[ ] The model's AUC or accuracy is in a plausible range for the problem domain - [ ] Performance doesn't drop significantly when moving from cross-validation to production - [ ] If performance seems "too good to be true," investigate → Case Study 2: The Data Leakage Disaster — A Cautionary Tale
4. Network and apply.
Attend local meetups or virtual data science communities - Contribute to open-source data analysis projects - Apply broadly --- "entry-level" job postings often list aspirational requirements, not strict minimums - Be prepared to discuss your portfolio projects in detail → Appendix E: Frequently Asked Questions
5. Is it transparent?
Can students see their risk score? Can they understand why they were flagged? Can they challenge the classification? If the system operates in secret, students cannot advocate for themselves. → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
`catplot()`
Categorical plots. Box, violin, swarm, strip, bar, count, point plots. Use when asking "How does Y vary across categories?" → Key Takeaways: Statistical Visualization with seaborn
`describe()`
summary statistics for all numeric columns: → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`displot()`
Distribution plots. Histograms, KDEs, ECDFs, rug plots. Use when asking "What does the distribution of X look like?" → Key Takeaways: Statistical Visualization with seaborn
`dtypes`
the data type of each column: → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`head()`
the first few rows (default: 5): → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`if`/`elif`/`else` statements
how to make your code take different paths based on data values (like categorizing vaccination rates as "low," "medium," or "high") - **`for` loops** — how to repeat an operation for every item in a collection (like computing a statistic for each country) - **Functions** — how to package reusable lo → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
`info()`
a concise summary of the DataFrame: → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`loc`
uses labels (index values and column names) - **`iloc`** — uses integer positions (like list indices) → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`range_color=[-30, 30]`
centers the diverging colormap at zero and caps at 30 points. Counties won by more than 30 points all appear the same saturated color. - **`size="total_votes"`** — larger dots for more populous counties, so the visual weight reflects the number of votes, not just geographic area. - **Hidden axis lab → Case Study 2: Election Night Live — Building an Interactive Results Tracker
`read_csv`
One-line CSV loading with automatic type detection and `NaN` for missing values. Replaces `csv.DictReader` + loop + manual type conversion. → Key Takeaways: Introduction to pandas
`relplot()`
Relational plots. Scatter and line plots. Use when asking "How are X and Y related?" → Key Takeaways: Statistical Visualization with seaborn
`shape`
the dimensions, as a tuple of (rows, columns): → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation

A

A brief description of the notebook's purpose
one or two sentences explaining what question the notebook addresses, what data it uses, or what project it belongs to. This helps readers (and future-you) decide whether this is the notebook they're looking for. → Chapter 2 Quiz: Setting Up Your Toolkit
A detailed project specification
what the finished product should contain 3. **A milestone checklist** — how to break the work into manageable pieces 4. **A rubric** — how the project will be evaluated (or how you can evaluate yourself) 5. **Examples of what "done" looks like** — concrete descriptions of successful capstone project → Chapter 35: Capstone Project: A Complete Data Science Investigation
A few more common patterns:
**Sum of squared values:** $\sum x_i^2$ means square each value, then add them up. - **Sum of squared differences from the mean:** $\sum (x_i - \bar{x})^2$. This is the numerator in the variance formula. It measures how spread out the data is. - **Double summation:** $\sum_{i=1}^{m} \sum_{j=1}^{n} a → Appendix A: Math Foundations Refresher
A local meetup
find one on Meetup.com, Eventbrite, or your local tech community calendar 2. **An online community** — Reddit, Discord, Slack, or a forum relevant to your interests 3. **A conference or virtual event** — upcoming data science conferences, hackathons, or workshops → Chapter 36 Exercises: Planning Your Future in Data Science
Abstract
a brief summary (like the executive summary, but with more technical detail) 2. **Introduction** — background, research question, and significance 3. **Data and Methods** — data sources, cleaning steps, analytical methods, tools used 4. **Results** — findings presented with tables, charts, and stati → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Accessibility
[ ] The palette is colorblind-safe (or a second encoding differentiates groups). - [ ] Text has sufficient contrast against the background. - [ ] Alt text is provided for web or document use. - [ ] Font sizes are readable at the intended display size. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
accumulator
starting with zero and adding to it on each iteration: → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
accuracy
the fraction of predictions that are correct. For regression, `.score()` returned R-squared. Same method name, different metric. → Chapter 27: Logistic Regression and Classification — Predicting Categories
Acquire
recording the source of data 3. **Communicate** — presenting findings in a readable format 4. **Clean** — standardizing text data by removing whitespace and converting to consistent case → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
active title
a title that states the finding, not just the topic. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Africa and Europe don't overlap
their confidence intervals are completely separate. This strongly suggests that the true means are genuinely different. (We'll formalize this idea in Chapter 23 with hypothesis testing.) → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
After deployment:
[ ] Am I monitoring the model's performance over time, including subgroup performance? - [ ] Is there a feedback mechanism for people to report problems? - [ ] Is there a plan to retrain or retire the model if it becomes harmful? → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
After groupby:
Use `.reset_index()` to flatten a multi-index into regular columns - Use `.sort_values()` to rank groups - Use `.unstack()` as an alternative to `pivot_table` for reshaping grouped results → Key Takeaways: Reshaping and Transforming Data
alt text
a text description that screen readers can read aloud to blind users. Good alt text for a chart: → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Alternative approaches:
Use objective performance metrics (sales numbers, code commits, project completion rates) instead of subjective reviews. - Blind the training data by removing demographic information AND potential proxies (names, photos, university names). - Audit the model's predictions across demographic groups be → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Always fix axis ranges
prevents disorienting rescaling. 2. **Always set `animation_group`** — enables smooth entity tracking. 3. **Keep frame count reasonable** — 10-30 frames, not 100. 4. **Fix `range_color` for choropleth animations** — ensures consistent color meaning. → Key Takeaways: Interactive Visualization — plotly, Dashboard Thinking
Americas and W Pacific overlap substantially
we can't confidently say their true means are different. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
An "All" option in the dropdown
when no specific region is selected, all data is shown. This is the default that provides the big picture. - **Side-by-side charts** — the map and scatter share the same row, giving geographic and statistical views simultaneously. - **A vertical reference line** on the trend chart shows which year i → Chapter 17: Interactive Visualization — plotly, Dashboard Thinking
An overfit weather model
one that tries to predict based on dozens of local, short-lived atmospheric features — might have low bias (it captures real phenomena) but high variance (its predictions are unstable, sensitive to small measurement errors). On days when its inputs are accurate, it's brilliant. On days when a sensor → Case Study 1: The Weather Forecaster's Dilemma — Simple vs. Complex Models
Anaconda
a free distribution that bundles Python, Jupyter, and hundreds of data science libraries into one installer - **Python** — the programming language we'll use throughout this book - **Jupyter Notebook** — the interactive environment where we write and run code alongside explanatory text → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
Annotated key data points
the ones that support your message - [ ] **Reference lines** where relevant (targets, benchmarks, thresholds) - [ ] **Event markers** if the data spans a period with significant events - [ ] **No chartjunk** — unnecessary decoration, 3D effects, or gridlines removed → Key Takeaways: Communicating Results: Reports, Presentations, and the Art of the Data Story
Annotations:
At month 13: "Campaign launched →" with an arrow pointing to the spike - At the peak (month 13-14): "Peak: 80,000 visits/month" - At month 24: "Current: 60,000 — still 20% above pre-campaign baseline" → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Annually:
Take a course or read a book on a topic adjacent to your current skills. - Update your portfolio with a new project. → Appendix E: Frequently Asked Questions
anonymization
removing personally identifying information (names, addresses, social security numbers) from datasets. The assumption is that if you cannot identify individuals, you cannot harm them. → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
ANOVA
comparing means across 3+ groups. 2. **Chi-square test** — both variables are categorical. 3. **One-sample t-test** — comparing a sample mean to a known value. 4. **Paired t-test** — the same subjects measured twice (before and after), so observations are not independent. 5. **Two-sample t-test** — → Chapter 23 Exercises: Hypothesis Testing
Anscombe's Quartet
that have nearly identical summary statistics. Each dataset has the same mean of x, the same mean of y, the same variance of x, the same variance of y, the same correlation between x and y, and the same linear regression line. If you only looked at the numbers, you would conclude these four datasets → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
Answer the question they asked
they want *recommendations for reducing churn*, not just a model. End with actionable business insights, not just accuracy metrics. 2. **Communicate clearly** — write narrative Markdown throughout; include a summary at the top so the reviewer can get the gist in 60 seconds. 3. **Show judgment, not j → Chapter 34 Quiz: Building Your Portfolio
Ask
Section 6.1 (formulating questions about the WHO data) 2. **Acquire** — Section 6.2 (loading the CSV file) 3. **Clean** — Section 6.5 (identifying data quality issues — though we didn't fully clean the data yet) 4. **Explore** — Sections 6.3 and 6.4 (inspecting structure and computing statistics) 5. → Chapter 6 Exercises: Your First Data Analysis
Assists and rebounds are negatively correlated
players who grab many rebounds tend not to dish many assists, and vice versa. This is a position effect (centers rebound, guards assist). → Case Study 2: Finding the Story in NBA Statistics with seaborn
Audience:
[ ] I know who my audience is and what they need to decide - [ ] I have adapted my language, depth, and format to their level - [ ] I have removed or translated all jargon → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Automatic type detection
`year` is an integer, `coverage_pct` is a float, text columns are objects 3. **Missing values become `NaN`** — not empty strings that crash your math. `NaN` (Not a Number) is pandas's sentinel for missing data. It participates safely in computations: `NaN + 5` is `NaN`, not an error. → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
Avoid regex when:
A string method solves the problem just as well (simpler is better) - The pattern would be unreadable (more than ~30 characters — consider breaking it up) - You're trying to parse a structured format like HTML or JSON (use a proper parser) - You're trying to match natural language meaning, not struc → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning

B

baseline model
the simplest possible model that gives you a reference point. → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Battery factory:
Population: All 12,000 batteries produced that day. - Sample: The 50 tested batteries. - Parameter: True average lifetime of all 12,000 batteries. - Statistic: Average lifetime of the 50 tested batteries. → Answers to Selected Exercises
Before deployment:
[ ] Can I explain why the model makes specific predictions? - [ ] Have I documented the model's limitations and failure modes? - [ ] Is there a process for people to challenge or appeal the model's decisions? - [ ] Have I considered how the system could be misused? → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Before you begin:
[ ] Have I clearly defined the problem I am solving, and is that problem worth solving? - [ ] Who will be affected by the results? Have I considered impacts on marginalized or vulnerable groups? - [ ] Does the data I am using represent the population I am making claims about? - [ ] Was the data coll → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Better alternatives:
`"viridis"` — perceptually uniform, colorblind-safe, works in grayscale - `"plasma"` — similar properties, different aesthetic - `"cividis"` — specifically designed for colorblind accessibility → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
between-country variability
the fact that countries differ from each other. It does *not* capture: → Case Study 2: Estimating Global Vaccination Coverage from Incomplete Data
Blood pressure study:
Population: All patients (present and future) who could take this medication. - Sample: The 80 patients studied. - Parameter: True average blood pressure effect of the medication. - Statistic: Average effect observed in the 80 patients. → Answers to Selected Exercises
boolean expression
the test. Python evaluates it and gets either `True` or `False`. - The colon `:` at the end of the `if` line is required. Forget it, and Python will complain. - The next line is **indented** by four spaces. This indentation isn't decorative — it tells Python that this line *belongs to* the `if` bloc → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Boolean indexing
Filtering rows using a True/False mask (`df[df["col"] > value]`). The pandas replacement for loop-with-if-statement. → Key Takeaways: Introduction to pandas
bootstrap
one of the most powerful and elegant ideas in modern statistics. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Build the preprocessing pipeline
Mistake: preprocessing outside the pipeline, causing data leakage. → Chapter 30 Quiz: The Machine Learning Workflow

C

Calculating monthly and quarterly totals
previously done with error-prone SUM formulas across tabs - **Comparing products** — previously done by manual tallying - **Identifying trends** — previously done by eyeballing - **Producing charts** — previously done by fighting with Excel chart formatting - **Answering ad hoc questions** — "what w → Case Study 2: From Spreadsheet Chaos to Notebook Clarity — A Business Analyst's Migration Story
Calendar features:
`day_of_week`: Monday through Sunday (categorical) - `month`: January through December (categorical) - `is_holiday`: Whether tomorrow is a federal holiday (binary) - `is_weekend`: Whether tomorrow is Saturday or Sunday (binary) → Case Study 1: End-to-End — From Raw Data to Deployed Prediction
Capstone work session.
**Lab:** Evaluate models with multiple metrics. Build a complete pipeline. Write executive summary. Conduct ethical audit. **Capstone workshop.** - **Assignment:** Capstone project due end of week 14. Chapters 31--33 quiz. → 15-Week University Semester Syllabus
Causal
it's asking whether sitting in the front row *causes* better grades. (It probably doesn't — motivated students choose to sit up front, and motivation, not location, drives the grades. This is a classic confounding variable situation.) > 2. **Predictive** — it's asking what we'd *expect* for an unobs → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
cell
the basic building block of a Jupyter notebook. Everything you do in a notebook happens inside cells. Right now you have one empty cell, and it's waiting for you to type something. → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
cells
either **code cells** (for Python) or **Markdown cells** (for formatted text). The **kernel** is the engine that executes your code. The **notebook server** runs in the background. - **Code cells:** You write Python code and run it with Shift+Enter. Jupyter displays the output immediately below the → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
Ch.14
**Ch.18** accessibility, Ch.18 18.4 choosing chart types, Ch.14 14.6 color, Ch.18 18.3 design principles, Ch.18 18.1 misleading charts, Ch.18 18.5 → Index
Chaining tools:
`.query()` for filtering (cleaner in chains than bracket notation) - `.assign()` for new columns (returns new DataFrame, doesn't modify in place) - `.pipe(func)` for custom functions that take a DataFrame and return a DataFrame → Key Takeaways: Reshaping and Transforming Data
Chapter 7: Introduction to pandas
DataFrames, Series, and the grammar of data manipulation - **Chapter 8: Cleaning Messy Data** — Professional techniques for handling the problems you spotted manually - **Chapter 9: Reshaping and Transforming** — Merging datasets, pivoting tables, grouping and aggregating - **Chapters 10-13: Working → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
Chapter introduction
What you'll learn, why it matters, and what you need to have completed first. 2. **Core content** — The main teaching material, with worked examples, code walkthroughs, and visualizations. 3. **Project checkpoint** — A task that adds to your progressive public health analysis project. 4. **Key takea → How to Use This Book
Chart Plan:
Question: How have vaccination rates changed over time for three countries with different trajectories? - Chart type: Multi-panel line chart (3 panels) - Data: Time series for three countries - Audience: Explanatory — for a policy brief → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
Check consistency
do the cross-validation results match the final test set results? → Case Study 2: Comparing Three Models — Which Predicts Vaccination Best?
classification tree
it predicts a category (approve, review, or deny). Decision trees can also do **regression** — predicting a continuous number, like the loan amount to offer — but we'll focus on classification in this chapter because it connects directly to the logistic regression work you did in Chapter 27. → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
Code:
[ ] Random seeds are set for all stochastic operations - [ ] The analysis runs top-to-bottom without manual intervention - [ ] File paths use relative paths (not absolute paths like `/Users/alex/data/`) → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Cohen's d
the standardized effect size. It divides the difference in means by the pooled standard deviation. → Chapter 23 Quiz: Hypothesis Testing
colormap
a gradient palette from yellow through orange to red. matplotlib has many colormaps: `"viridis"` (default, perceptually uniform), `"Blues"`, `"coolwarm"` (diverging), etc. - **`fig.colorbar(scatter, ...)`**: Adds a color legend showing what the colors mean. - **`edgecolors="gray"`**: Adds a thin gra → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
comment
text that Python ignores. Comments are notes to yourself.) → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
commit
a snapshot of all the files at that point in time. Each commit has a unique identifier, a timestamp, an author, and a message describing what changed. → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Common Chartjunk to Remove:
3D effects on 2D charts - Gradient fills that encode nothing - Background images or textures - Excessive or heavy gridlines - Decorative borders and shadows - Redundant legends (single-group charts) - Different colors for bars in a single-variable bar chart → Key Takeaways: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Common colormaps:
Sequential: `"viridis"`, `"Blues"`, `"YlOrRd"`, `"Greens"` - Diverging: `"coolwarm"`, `"RdBu"`, `"PiYG"` - Categorical: use explicit color lists → Key Takeaways: matplotlib Foundations
Common student struggles:
Confusing data science with programming. Students assume they need to become expert coders before they can "do" data science. Emphasize that code is a means, not the end. - Struggling to formulate specific, answerable questions. Students propose vague questions like "What is happening with vaccinati → Teaching Notes for All 36 Chapters
conda environment file
a simple text file that specifies exactly which packages and versions to install. She calls it `ph-data-env.yml`: → Case Study 1: Setting Up a Data Science Environment for a University Research Lab
confidence intervals
ranges that tell you how precise your estimates are. And in Chapter 23, we'll use them for **hypothesis testing** — deciding whether a pattern is real or just noise. The Central Limit Theorem you just learned is the engine that powers both. → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
confounding variable
temperature, or more broadly, summer weather — causes *both*. Hot weather makes people buy more ice cream. Hot weather also makes people swim more, which increases the opportunity for drowning. Ice cream and drowning are correlated not because one causes the other, but because they share a common ca → Case Study 1: Ice Cream and Drowning — The Classic Confounding Story (And Its Modern Equivalents)
Consider the people in the data
especially when personal information is involved. - **Document your collection methods** so others can evaluate your approach. → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
Continue learning
the ethical landscape is evolving - **Accept uncertainty** — ethical dilemmas rarely have clear answers → Key Takeaways: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Converting between them:
Proportion to percentage: multiply by 100. So $0.73 \rightarrow 73\%$. - Percentage to proportion: divide by 100. So $85\% \rightarrow 0.85$. → Appendix A: Math Foundations Refresher
Correct: (A)
**(A)** is correct. The `.str` accessor gracefully handles missing values by propagating `NaN` (displayed as `None` or `NaN`) without raising errors. This is one of its key advantages over writing a manual loop. - **(B)** would happen if you tried to call `.lower()` directly on `None` in regular Pyt → Chapter 10 Quiz: Working with Text Data
Correct: (B)
**(A)** is too narrow — machine learning is one tool within data science, not the whole field. A project that never builds an ML model can still be data science (e.g., a descriptive analysis or a controlled experiment). - **(B)** captures the interdisciplinary nature, the focus on answering question → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Correct: (C)
**(A)** is structured — rows, columns, numeric values. - **(B)** is structured — a relational table with a defined schema. - **(C)** is unstructured — scanned images of handwritten text have no predefined schema, no rows or columns. Extracting information requires OCR and possibly handwriting recogn → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Correct: (D)
**(A)** works technically — `pandas.DataFrame(...)` is valid Python — but virtually nobody writes it this way. You'd have to type `pandas` in full every time. - **(B)** is the universal convention used by the pandas documentation, tutorials, books, and the overwhelming majority of data scientists. T → Chapter 7 Quiz: Introduction to pandas
correlation matrix
a table (and often a heatmap) that shows every pairwise relationship at a glance. → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
Craft:
[ ] I have practiced my presentation out loud (if presenting) - [ ] I have cut everything that does not serve the narrative - [ ] I have had someone else review my work for clarity - [ ] The document/deck/notebook can be understood without me present to explain it → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Critical parameters:
`str.contains(pat, case=False)` — case-insensitive search - `str.contains(pat, na=False)` — treat NaN as False (essential for filtering) - `str.replace(old, new, regex=False)` — literal replacement (no regex interpretation) → Key Takeaways: Working with Text Data
CSV
the universal plain-text format. You mastered encoding (`encoding='latin-1'`), delimiters (`sep=';'`), header management (`skiprows`, `header`), selective loading (`usecols`), and type control (`dtype`). → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases

D

Data Analyst:
Focuses on answering specific business questions with existing data - Primary tools: SQL, Excel, Tableau, basic Python or R - Outputs: dashboards, reports, ad-hoc analyses - Typical question: "What were our sales by region last quarter?" → Appendix E: Frequently Asked Questions
Data Engineer:
Focuses on building and maintaining the infrastructure that makes data available - Primary tools: SQL, Python, cloud platforms (AWS, GCP), Apache Spark, Airflow - Outputs: data pipelines, warehouses, ETL systems - Typical question: "How do we move 50 million rows of transaction data from production → Appendix E: Frequently Asked Questions
Data is accessible
without the data file, the notebook can't run. 2. **Cells run in order** — out-of-order execution creates hidden state that can't be replicated. 3. **Dependencies are documented** — missing libraries cause import errors. 4. **Random seeds are set** — without seeds, random processes produce different → Chapter 6 Exercises: Your First Data Analysis
data literacy
the ability to read, interpret, and reason with data — becomes essential. Data literacy is for data what reading comprehension is for text. It's not about technical skills; it's about understanding what numbers and charts are actually *saying*. → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Data Scientist:
Focuses on building models, conducting statistical investigations, and discovering patterns - Primary tools: Python or R, SQL, machine learning libraries - Outputs: models, statistical analyses, notebooks, research findings - Typical question: "Can we predict which customers will churn, and what dri → Appendix E: Frequently Asked Questions
Data they already had:
**Point-of-sale (POS) transaction records:** Every sale at every location, going back three years. This included the item sold, the price, the time of day, the payment method, and whether it was dine-in, takeout, or (at the two locations that already had it) drive-through. - **Staffing schedules:** → Case Study 1: From Spreadsheets to Strategy — How a Local Coffee Chain Used Data Science to Survive a Pandemic
Data they needed to find:
**Neighborhood demographics:** Population density, median household income, age distribution, and percentage of residents working from home (a new and suddenly crucial metric in 2020). - **Delivery radius data:** How far could they reasonably deliver from each location? What other coffee and food de → Case Study 1: From Spreadsheets to Strategy — How a Local Coffee Chain Used Data Science to Survive a Pandemic
Data they wished they had but didn't:
**Customer home addresses:** They knew where people *bought* coffee but not where they *lived*. Without this, they couldn't estimate delivery demand directly. - **Competitor data:** They had no idea what other coffee shops were doing — who was closing, who was opening for delivery, who was offering → Case Study 1: From Spreadsheets to Strategy — How a Local Coffee Chain Used Data Science to Survive a Pandemic
data-ink ratio
the proportion of the ink on the page that represents data versus the ink used for borders, backgrounds, gridlines, and decorations. → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
Data:
[ ] The data source is documented (URL, download instructions, date accessed) - [ ] Raw data is never modified — processing steps create new files - [ ] If data is too large for git, download instructions are in the README → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Database
2 million rows updated nightly requires efficient querying, not reloading entire files. SQL lets the pipeline extract exactly what's needed. 2. **Excel** — CEOs expect Excel. They can open, sort, filter, and add comments without any technical tools. 3. **JSON** — Nested configuration (settings withi → Chapter 12 Exercises: Getting Data from Files
Databases/SQL
the enterprise format. You wrote your first SQL queries (SELECT, WHERE, JOIN, GROUP BY) and used `pd.read_sql()` to bring database data into pandas. → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
DataFrame
a two-dimensional table, like a spreadsheet or a SQL table - **Series** — a one-dimensional column, like a single column from a spreadsheet → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
Decision heuristic:
Need to look things up by name? **Dictionary** - Need an ordered collection you will modify? **List** - Need to track unique values or test membership? **Set** - Need a fixed, unchangeable group of values? **Tuple** → Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data
Decorative gridlines
light gridlines help; heavy, numerous gridlines distract. - **3D effects** — adding depth to a 2D bar chart distorts bar lengths and adds no information. - **Gradient fills** — making bars fade from dark to light adds visual complexity without encoding data. - **Background images** — a photo behind → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
default parameter
if you call `format_as_percentage(42.356)`, it uses 1 decimal place. But you can override it: `format_as_percentage(42.356, 2)` gives `"42.36%"`. → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Define the problem and metric
Mistake: jumping into modeling without a clear question or choosing the wrong evaluation metric (e.g., accuracy for an imbalanced problem). → Chapter 30 Quiz: The Machine Learning Workflow
Descriptive
It asks "what happened?" using historical data. No prediction or causal claim involved. 2. **Predictive** — It asks about a future outcome for a specific patient. The goal is forecasting, not explaining *why*. 3. **Causal** — The word "cause" is a giveaway, but even rephrased ("Did the new flow incr → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Design for balance:
Use decentralized architecture (phones exchange anonymous tokens, not identities) - Implement automatic data deletion after 14-21 days - Make participation voluntary, not mandatory - Use differential privacy for aggregate analysis - Prohibit use of contact tracing data for law enforcement - Establis → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
dictionary
a mapping from measurement names to values. > - The countries in South America? That is a **set** — a collection where uniqueness matters and order does not. > - A row of CSV data? That is a **list** — an ordered sequence of fields. > - A GPS coordinate? That is a **tuple** — a fixed pair of values → Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data
diminishing returns of sample size
you need to *quadruple* your sample size to cut the standard error in half. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Distribution thinking
seeing data as a shape, not just a single number — is the mindset shift that makes everything else in statistics click. Every time someone gives you an "average," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?" → Key Takeaways: Descriptive Statistics
distributions
the mathematical shapes that describe how probabilities are spread across outcomes. We'll meet the normal curve (the bell curve), learn why it shows up everywhere, and discover the Central Limit Theorem — the reason that your sampling variability simulation at the end of this chapter produced bell-s → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
diversity
each tree needs to be different enough to capture different aspects of the data. → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
docstrings
they describe what the function does. They're not comments — they're actually stored by Python and can be accessed with `help(format_as_percentage)`. Writing docstrings is a professional habit worth starting now. → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Documentation
the README, commit messages, and inline comments — is communication with your future self and your collaborators. Write as if the reader has never seen your project before, because they have not. → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Documentation:
[ ] The README explains what the project does, how to set it up, and how to run it - [ ] Key decisions and assumptions are documented (in the notebook or README) - [ ] Results are clearly labeled and connected to the code that produced them → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
domain knowledge
expertise in whatever field you're working in — to ask the right questions, interpret the results correctly, and know when something doesn't make sense. → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Drop rows with missing values
this would reduce the dataset from 183 to 147 countries and systematically exclude low-income countries, biasing the analysis toward wealthier nations where data infrastructure is stronger. > 2. **Impute with regional medians** — this assumes countries within a WHO region have similar healthcare wor → Case Study 1: A Model Capstone: Complete Vaccination Rate Analysis
DRY
**Don't Repeat Yourself**. The idea is simple: if you find yourself writing the same code more than once, something is wrong. You should write it once, give it a name, and then reuse it. → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Duplicate records
rows that appear more than once, sometimes identically, sometimes with slight variations. The same patient shows up twice because they were entered at two different clinics. The same country appears under both "Cote d'Ivoire" and "Ivory Coast." → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
During analysis:
[ ] Have I checked for representation gaps in the data? Which groups are underrepresented or absent? - [ ] If I am using proxy variables, could any of them serve as proxies for protected attributes? - [ ] Have I tested my model's performance across subgroups, not just overall? - [ ] Am I optimizing → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice

E

Elena
Public health analyst exploring COVID vaccination rates across demographics and regions, discovering disparities and communicating findings to policymakers 2. **Marcus** — Small business owner analyzing sales data to understand seasonal patterns, customer segments, and product promotion strategy 3. → Introduction to Data Science: From Curiosity to Code
Encoding
[ ] The chart type is appropriate for the data type and question. - [ ] Position and length are used for the most important comparisons. - [ ] Color is used purposefully, not decoratively. - [ ] No variable is encoded with two channels redundantly. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Endpoints:
`/games?date=YYYY-MM-DD` — returns games for a specific date - `/boxscore/{game_id}` — returns player stats for a game - `/standings` — returns current standings → Case Study 1: Building a Sports Stats Pipeline — Priya Automates NBA Data Collection
Environment:
[ ] Dependencies are recorded in `requirements.txt` or `environment.yml` - [ ] Library versions are pinned (exact versions, not just package names) - [ ] The environment can be recreated from scratch on a clean machine → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
error rate equality
the idea that innocent people of all races should be equally likely to be wrongly flagged. Northpointe chose to prioritize **predictive parity** — the idea that a score of "7" should mean the same thing regardless of race. → Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice
Ethical framework application:
**Who benefits?** The company (reduced churn). **Who is harmed?** Users who want to cancel but cannot easily do so (financial harm, frustration, erosion of trust). - **Was there consent?** No — users did not agree to be part of this test. - **Is it transparent?** No — the design intentionally obscur → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Ethical reflection should engage genuinely:
Who is represented in your data, and who is invisible? - How could your findings be misinterpreted or misused? - What assumptions have you embedded in your analysis choices? - What responsibility do you have as the person presenting these results? → Chapter 35 Exercises: Capstone Project Milestones
Ethics:
[ ] I have not cherry-picked findings - [ ] I have disclosed limitations - [ ] I have not let my visualizations create misleading impressions - [ ] My communication is honest about what the data can and cannot support → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Europe has the narrowest interval
both because it has the most countries (n=53) and the smallest standard deviation. We're quite precise about Europe's average. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Evaluate on the held-out test set
Mistake: reporting only accuracy, or using the test set more than once. → Chapter 30 Quiz: The Machine Learning Workflow
Evidence:
Rural vaccination rates declined by an average of 11 percentage points between 2019 and 2022, compared to 3 points in urban areas, creating a growing rural-urban gap. - Among rural counties, those with community health clinics maintained rates 8 points higher than demographically similar counties wi → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Excel
the business world's format. You learned sheet selection (`sheet_name`), handling messy layouts (`skiprows`, `header`), and loading all sheets into a dictionary (`sheet_name=None`). → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
Exercise 1.10
*Reformulating vague questions* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.12
*Matching lifecycle stages to activities* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.13
*Stakeholder translation* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.14
*Data science in the headlines* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.15
*When data science goes wrong* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.16
*The data you don't have* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.17
*Structured and unstructured in the same project* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.19
*The lifecycle is a lie (sort of)* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.20
*Cross-domain transfer* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.21
*Building a flawed argument* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.22
*Designing a question for each type* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.23
*Data science origin story* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.24
*Interview a data practitioner* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.25
*Design your own anchor example* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.3
*The lifecycle, from memory* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.4
*Three flavors of questions* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.5
*Structured vs. unstructured* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.6
*Domain knowledge matters* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.7
*Data literacy for everyone* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.8
*Lifecycle in action* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 1.9
*Is this data science?* → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Exercise 10.10
*Text replacement pipeline* → Chapter 10 Exercises: Working with Text Data
Exercise 10.11
*Regex capture groups* → Chapter 10 Exercises: Working with Text Data
Exercise 10.13
*Building a text normalization function* → Chapter 10 Exercises: Working with Text Data
Exercise 10.14
*Cleaning medication names* → Chapter 10 Exercises: Working with Text Data
Exercise 10.15
*Parsing semi-structured log entries* → Chapter 10 Exercises: Working with Text Data
Exercise 10.16
*Survey response categorization* → Chapter 10 Exercises: Working with Text Data
Exercise 10.17
*Address parsing and standardization* → Chapter 10 Exercises: Working with Text Data
Exercise 10.18
*Finding patterns in social media text* → Chapter 10 Exercises: Working with Text Data
Exercise 10.19
*Regex crossword* (3-star) → Chapter 10 Exercises: Working with Text Data
Exercise 10.2
*The `.str` accessor* → Chapter 10 Exercises: Working with Text Data
Exercise 10.20
*Building a data validation function* (3-star) → Chapter 10 Exercises: Working with Text Data
Exercise 10.21
*Reverse engineering regex* (4-star) → Chapter 10 Exercises: Working with Text Data
Exercise 10.22
*Complete text cleaning pipeline* (4-star) → Chapter 10 Exercises: Working with Text Data
Exercise 10.23
*Combining text cleaning with data types (Ch. 3, 8, 10)* → Chapter 10 Exercises: Working with Text Data
Exercise 10.24
*Text cleaning meets groupby (Ch. 9, 10)* → Chapter 10 Exercises: Working with Text Data
Exercise 10.25
*Regex and missing data (Ch. 8, 10)* → Chapter 10 Exercises: Working with Text Data
Exercise 10.3
*Regex as a mini-language* → Chapter 10 Exercises: Working with Text Data
Exercise 10.4
*When NOT to use regex* → Chapter 10 Exercises: Working with Text Data
Exercise 10.5
*Regex building blocks* → Chapter 10 Exercises: Working with Text Data
Exercise 10.6
*Cleaning company names* → Chapter 10 Exercises: Working with Text Data
Exercise 10.7
*Extracting numbers from descriptions* → Chapter 10 Exercises: Working with Text Data
Exercise 10.8
*Filtering with `.str.contains()`* → Chapter 10 Exercises: Working with Text Data
Exercise 10.9
*Splitting compound columns* → Chapter 10 Exercises: Working with Text Data
Exercise 11.1
*Why dates need parsing* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.10
*Rolling vs. expanding* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.11
*Extracting date components for analysis* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.12
*date_range for scheduling* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.14
*Sales time series analysis* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.15
*Pandemic timeline analysis* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.16
*Handling multiple date formats* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.17
*Finding temporal patterns* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.18
*Event-driven time series* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.19
*Complete time series pipeline* (3-star) → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.20
*Multi-series comparison* (3-star) → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.21
*Handling missing dates in a time series* (4-star) → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.22
*Year-over-year analysis* (4-star) → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.23
*Combining text cleaning and date parsing (Ch. 10, 11)* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.24
*Merging time series data (Ch. 9, 11)* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.25
*Data cleaning pipeline with dates (Ch. 8, 10, 11)* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.3
*Resampling vs. groupby* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.4
*Rolling windows explained* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.5
*Timedelta vs. DateOffset* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.6
*Parsing a messy date column* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.8
*Building and using a DatetimeIndex* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 11.9
*Resampling daily to weekly and monthly* → Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
Exercise 12.11
*JSON with record_path* → Chapter 12 Exercises: Getting Data from Files
Exercise 12.12
*SQL SELECT and WHERE* → Chapter 12 Exercises: Getting Data from Files
Exercise 12.13
*The mystery file* ⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.14
*Choosing the right format* ⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.15
*The encoding detective* ⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.16
*Multi-format integration* ⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.18
*Build a data loading report* ⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.19
*SQL vs. pandas comparison* ⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.20
*The format converter* ⭐⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.21
*Chunked loading for large files* ⭐⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.22
*The multi-source weather project* ⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.23
*Review: Boolean indexing (Chapter 7)* ⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.24
*Review: String methods (Chapter 10)* ⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.25
*Review: Merging (Chapter 9) meets new loading* ⭐⭐⭐ → Chapter 12 Exercises: Getting Data from Files
Exercise 12.5
*When to use a database* → Chapter 12 Exercises: Getting Data from Files
Exercise 12.6
*CSV with encoding issues* → Chapter 12 Exercises: Getting Data from Files
Exercise 12.7
*Semicolon-delimited CSV* → Chapter 12 Exercises: Getting Data from Files
Exercise 12.9
*Excel multi-sheet loading* → Chapter 12 Exercises: Getting Data from Files
Exercise 13.1
*The request-response cycle* → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.10
*BeautifulSoup basics* → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.11
*HTML table scraping* → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.14
*Building a multi-city weather collector* ⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.15
*Debugging a scraper* ⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.17
*robots.txt analysis* ⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.18
*Complete data pipeline* ⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.19
*Scraping vs. API comparison* ⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.20
*Retry logic with exponential backoff* ⭐⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.21
*Caching for repeated API calls* ⭐⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.22
*Ethical scenario analysis* ⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.23
*Review: Data types (Chapter 7)* ⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.24
*Review: Merge patterns (Chapter 9)* ⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.25
*Review: Dates (Chapter 11) + APIs* ⭐⭐⭐ → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.6
*Making a GET request* → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.8
*Handling errors gracefully* → Chapter 13 Exercises: Getting Data from the Web
Exercise 13.9
*Parsing JSON responses into DataFrames* → Chapter 13 Exercises: Getting Data from the Web
Exercise 14.1
*Identifying components* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.10
*Progressive project chart plans* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.11
*Spot the manipulation* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.13
*Exploratory vs. explanatory* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.14
*When the grammar breaks down* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.15
*The ethics of emphasis* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.16
*Design for a different audience* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.17
*The visualization taxonomy* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.18
*Anscombe's lesson revisited* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.19
*Cross-cultural chart design* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.2
*Changing one component* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.20
*Your vaccination data chart plans* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.21
*Critique a real public health chart* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.22
*The pie chart debate* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.23
*Is "data-ink ratio" always right?* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.24
*Visualization and trust* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.25
*Your visualization philosophy* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.3
*Grammar decomposition* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.5
*The "it depends" exercise* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.6
*Marcus's menu analysis* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.8
*Three views of one dataset* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 14.9
*Redesign a default chart* → Chapter 14 Exercises: The Grammar of Graphics
Exercise 15.1
*Your first line chart* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.10
*Highlighting a single bar* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.11
*Side-by-side histograms* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.13
*Small multiples for time series* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.14
*Vaccination rates by region* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.15
*Before and after: the Tufte makeover* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.16
*Saving in multiple formats* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.17
*Annotated policy chart* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.18
*Scatter plot with size encoding* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.19
*The complete workflow* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.2
*Bar chart of course enrollments* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.20
*Recreate a published chart* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.21
*Dynamic figure sizing* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.22
*Bar chart of vaccination rates by region* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.23
*Line chart of vaccination trends* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.24
*Scatter plot of GDP vs. vaccination rate* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.25
*Histogram of vaccination rate distribution* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.3
*Scatter plot of study hours vs. exam score* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.4
*Histogram of exam scores* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.5
*Horizontal bar chart* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.6
*Multi-line chart with legend* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.7
*Removing chart clutter* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.8
*Color-coded scatter plot* → Chapter 15 Exercises: matplotlib Foundations
Exercise 15.9
*Annotation practice* → Chapter 15 Exercises: matplotlib Foundations
Exercise 16.1
*Figure-level vs. axes-level* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.12
*Theme and palette exploration* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.14
*Weather data analysis* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.15
*Student grades exploration* (2-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.16
*E-commerce analysis* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.17
*The complete exploration workflow* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.18
*Publication-ready figure* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.19
*Custom FacetGrid mapping* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.2
*When to use which categorical plot* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.20
*Comparing distributions rigorously* (4-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.21
*seaborn vs. matplotlib comparison* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.22
*Data cleaning before visualization* (2-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.23
*Reshaping for visualization* (2-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.24
*Grammar of Graphics revisited* (2-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.25
*From question to visualization* (3-star) → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.5
*Correlation heatmap interpretation* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.6
*Distribution exploration* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.7
*Categorical comparisons* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.8
*Regression exploration* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 16.9
*Correlation heatmap* → Chapter 16 Exercises: Statistical Visualization with seaborn
Exercise 17.1
*Static vs. interactive* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.10
*Animated choropleth* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.11
*Faceted interactive plot* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.12
*Custom tooltips and formatting* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.13
*HTML export comparison* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.14
*COVID-style time series dashboard* (3-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.15
*Interactive box plot exploration* (2-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.16
*Building a simple Dash dashboard* (3-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.17
*Dashboard with slider and dropdown* (3-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.18
*Comparing plotly templates* (2-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.19
*Multi-view coordinated exploration* (3-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.2
*plotly.express vs. plotly.graph_objects* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.20
*The complete interactive report* (4-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.21
*plotly.graph_objects deep dive* (4-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.22
*Dash with multiple callbacks* (4-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.23
*Data preparation for plotly* (2-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.24
*seaborn vs. plotly side-by-side* (2-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.25
*From question to interactive visualization* (3-star) → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.3
*Choropleth requirements* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.4
*Dashboard callback model* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.5
*Animation best practices* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.6
*Interactive scatter plot* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.7
*Interactive line chart* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 17.9
*Animated scatter plot* → Chapter 17 Exercises: Interactive Visualization — plotly, Dashboard Thinking
Exercise 18.1
*Pre-attentive features* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.10
*Truncated axis demonstration* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.11
*Pie chart replacement* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.12
*Aspect ratio exploration* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.14
*The design checklist in practice* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.15
*Redesign a government chart* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.16
*Accessibility audit* (2-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.17
*The misleading chart gallery* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.18
*Visualization for different audiences* (2-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.19
*Annotation practice* (2-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.20
*Complete redesign portfolio* (4-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.21
*Design principles debate* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.22
*Ethics scenario analysis* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.23
*Tool selection and design* (2-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.24
*End-to-end visualization workflow* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.25
*Teaching visualization principles* (3-star) → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.3
*Cleveland and McGill hierarchy* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.5
*Color palette types* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.7
*Identifying misleading techniques* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.8
*Colorblind-safe redesign* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 18.9
*Before/after: Removing chartjunk* → Chapter 18 Exercises: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Exercise 19.10
*Visualizing center and spread* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.11
*Outlier detection function* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.12
*Grouped descriptive statistics* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.13
*Standard deviation intuition builder* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.14
*Comparing describe() across subgroups* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.15
*The effect of sample size on statistics* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.16
*Robust vs. non-robust statistics* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.17
*Misleading averages in the news* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.18
*Simpson's Paradox and descriptive statistics* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.19
*Elena's full vaccination analysis* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.20
*Comparing two groups rigorously* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.21
*Building a descriptive statistics dashboard* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.22
*Historical data investigation* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.23
*Percentile-based analysis* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.27
*Coefficient of variation* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.28
*Recreating Anscombe's Quartet* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.29
*The Datasaurus Dozen* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.3
*Choosing the right measure* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.30
*Design your own misleading statistic* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.4
*Five-number summary interpretation* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.6
*Z-score interpretation* → Chapter 19 Exercises: Descriptive Statistics
Exercise 19.9
*Complete descriptive profile* → Chapter 19 Exercises: Descriptive Statistics
Exercise 2.10
*Mixing code and Markdown* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.11
*The cell creation drill* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.12
*The delete and undo drill* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.13
*Cell type switching* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.15
*The well-organized notebook* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.16
*Naming and navigation* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.17
*The Restart & Run All test* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.19
*Elena's quick calculation* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.2
*Kernels and notebooks* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.20
*Priya's three-point comparison* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.21
*Jordan's grade comparison* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.22
*Build a Markdown reference card* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.23
*The debugging challenge* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.24
*Exploring the Help system* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.25
*Your own analysis notebook* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.4
*The out-of-order problem* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.6
*Hello, Data Science* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.7
*Python as a calculator* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 2.8
*A data science calculation* → Chapter 2 Exercises: Setting Up Your Toolkit
Exercise 20.1
*Three interpretations* → Chapter 20 Exercises: Probability Thinking
Exercise 20.11
*The birthday problem (extended)* → Chapter 20 Exercises: Probability Thinking
Exercise 20.12
*Monty Hall extended* → Chapter 20 Exercises: Probability Thinking
Exercise 20.13
*Medical test Bayes simulation* → Chapter 20 Exercises: Probability Thinking
Exercise 20.14
*Sampling variability experiment* → Chapter 20 Exercises: Probability Thinking
Exercise 20.15
*Simulate P(at least one) problems* → Chapter 20 Exercises: Probability Thinking
Exercise 20.16
*Random walk simulation* → Chapter 20 Exercises: Probability Thinking
Exercise 20.17
*Monte Carlo estimation of pi* → Chapter 20 Exercises: Probability Thinking
Exercise 20.18
*Simulating rare events* → Chapter 20 Exercises: Probability Thinking
Exercise 20.19
*Elena's sampling strategy* → Chapter 20 Exercises: Probability Thinking
Exercise 20.20
*The false positive paradox in real life* → Chapter 20 Exercises: Probability Thinking
Exercise 20.21
*Probability in risk communication* → Chapter 20 Exercises: Probability Thinking
Exercise 20.22
*Simulating a real-world process* → Chapter 20 Exercises: Probability Thinking
Exercise 20.23
*Bayesian updating sequence* → Chapter 20 Exercises: Probability Thinking
Exercise 20.24
*The coupon collector problem* → Chapter 20 Exercises: Probability Thinking
Exercise 20.25
*Two-envelope problem* → Chapter 20 Exercises: Probability Thinking
Exercise 20.26
*Simulating the St. Petersburg paradox* → Chapter 20 Exercises: Probability Thinking
Exercise 20.27
*Simpson's Paradox revisited with probability* → Chapter 20 Exercises: Probability Thinking
Exercise 20.28
*Probability calibration* → Chapter 20 Exercises: Probability Thinking
Exercise 20.29
*Benford's Law simulation* → Chapter 20 Exercises: Probability Thinking
Exercise 20.30
*Your own Monte Carlo* → Chapter 20 Exercises: Probability Thinking
Exercise 20.4
*The gambler's fallacy* → Chapter 20 Exercises: Probability Thinking
Exercise 20.5
*Conditional probability intuition* → Chapter 20 Exercises: Probability Thinking
Exercise 20.6
*Bayes' theorem in words* → Chapter 20 Exercises: Probability Thinking
Exercise 20.9
*Coin flip simulator* → Chapter 20 Exercises: Probability Thinking
Exercise 21.1
*Discrete vs. continuous* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.10
*Distribution fitting* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.12
*Binomial simulation vs. formula* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.14
*Z-score probability calculator* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.15
*Standard error exploration* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.16
*Comparing distributions visually* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.17
*Real data normality check* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.18
*Inverse CDF problems* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.19
*Elena's complete distribution analysis* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.20
*When the CLT breaks down* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.21
*Normal approximation to the binomial* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.22
*Distribution mixture* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.23
*Log-normal distribution* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.24
*Comparing populations with normal models* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.25
*The multivariate normal* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.26
*Bootstrap distribution* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.27
*Power of normality tests* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.28
*Build a distribution explorer* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.29
*The German Tank Problem* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.30
*Your own distribution analysis* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.4
*Z-score interpretation* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.5
*CLT in plain English* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.6
*Matching distributions* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.8
*Binomial in context* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 21.9
*scipy.stats exploration* → Chapter 21 Exercises: Distributions and the Normal Curve
Exercise 22.1
*Population vs. sample identification* → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.10
*Comparing two groups with CIs* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.11
*The Literary Digest in numbers* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.12
*Interpreting real-world CIs* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.13
*Sensitivity of CI to outliers* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.14
*Simulate the sampling distribution* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.15
*Bootstrap from scratch* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.16
*Visualize CI coverage* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.17
*Effect of sample size on CI width* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.18
*Stratified vs. simple random sampling simulation* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.19
*Bootstrap for a correlation coefficient* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.2
*Bias identification* ⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.20
*The paradox of precision* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.21
*Designing a sampling plan* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.22
*When CIs mislead: the issue of multiple comparisons* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.23
*Bayesian vs. frequentist confidence* ⭐⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.24
*Project extension: Regional comparisons with CIs* ⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.25
*The ethics of uncertainty communication* ⭐⭐⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.3
*Standard error reasoning* ⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.4
*Confidence interval interpretation* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.5
*Confidence level trade-offs* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.6
*When the population is small* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.7
*Sampling strategies for the real world* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.8
*Computing confidence intervals by hand* → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 22.9
*Sample size determination* ⭐⭐ → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Exercise 23.10
*Power calculation* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.11
*Before-and-after comparison* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.12
*Interpreting non-significance* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.13
*Reporting results correctly* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.14
*Simulate the null distribution* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.15
*Permutation test implementation* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.16
*Power simulation* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.17
*Multiple testing simulation* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.18
*Complete hypothesis test workflow* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.19
*Effect of sample size on p-values* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.2
*P-value interpretation* ⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.20
*The significance filter* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.21
*Designing a study* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.22
*Critiquing a published study* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.23
*The ASA statement on p-values* ⭐⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.24
*Project extension: Comprehensive analysis* ⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.25
*Reflection: The limits of testing* ⭐⭐⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.3
*Type I and Type II errors* ⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.4
*Statistical vs. practical significance* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.5
*The logic of "failing to reject"* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.6
*Choosing the right test* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.7
*Multiple testing reasoning* ⭐⭐ → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.8
*Two-sample t-test by hand* → Chapter 23 Exercises: Hypothesis Testing
Exercise 23.9
*Chi-square test interpretation* → Chapter 23 Exercises: Hypothesis Testing
Exercise 24.1
*Interpreting correlation coefficients* → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.10
*Correlation matrix interpretation* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.11
*Partial correlation* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.12
*Correlation does not imply no confounding* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.13
*When r = 0 doesn't mean no relationship* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.14
*Correlation exploration* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.15
*Simulating confounding* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.16
*Simpson's paradox simulation* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.17
*Anscombe's quartet extended* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.18
*Correlation significance testing* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.19
*Project: Full correlation analysis* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.2
*Spotting confounders* ⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.20
*News article analysis* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.21
*Designing an RCT* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.22
*The chocolate-Nobel Prize paper* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.23
*Ethical implications of causal claims* ⭐⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.24
*Correlation in your daily life* ⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.25
*The limits of observational data* ⭐⭐⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.3
*Three explanations for every correlation* ⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.4
*Evaluating causal claims* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.5
*Simpson's paradox identification* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.6
*Pearson vs. Spearman* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.7
*The hierarchy of evidence* ⭐⭐ → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 24.9
*Computing and interpreting correlations* → Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two
Exercise 25.1
*Models as simplifications* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.10
*Framing a prediction problem* → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.11
*When models mislead* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.12
*Choosing features wisely* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.13
*The bias-variance tradeoff in practice* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.14
*Interpreting train-test results* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.15
*Train-test split in practice* → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.16
*Visualizing overfitting* → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.17
*Comparing train and test scores* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.18
*The effect of training set size* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.19
*Baseline comparison* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.2
*Prediction vs. explanation* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.20
*Random state exploration* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.21
*The model simplification paradox* → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.22
*Ethical analysis: predictive policing* ⭐⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.23
*The "just predict the average" challenge* ⭐⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.24
*Design your own modeling problem* ⭐⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.25
*Reflection: what changes now?* ⭐⭐⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.3
*Supervised vs. unsupervised* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.4
*Identifying features and targets* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.5
*The cardinal rule* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.6
*Overfitting vs. underfitting* ⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.7
*The bias-variance tradeoff in everyday life* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.8
*Baseline thinking* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 25.9
*The Goldilocks zone* ⭐⭐ → Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Exercise 26.1
*Interpreting slope and intercept* ⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.10
*Baseline comparison* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.11
*Feature engineering* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.12
*Extrapolation danger* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.13
*Simple linear regression from scratch* → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.14
*Full workflow with train-test split* → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.15
*Multiple regression comparison* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.16
*Residual analysis* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.17
*Log transformation* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.18
*Feature scaling and coefficient comparison* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.19
*Overfitting with too many features* ⭐⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.2
*Understanding residuals* ⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.20
*Predicted vs. actual plot* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.21
*The interpretation trap* → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.22
*When to stop adding features* ⭐⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.23
*Model comparison report* ⭐⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.24
*Real-world regression critique* ⭐⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.25
*Teaching linear regression* ⭐⭐⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.3
*Interpreting R-squared* ⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.4
*Least squares intuition* ⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.5
*Multiple regression coefficients* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.6
*When linear regression fails* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.7
*Multicollinearity* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.8
*Choosing features for a regression model* → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 26.9
*Reading a regression report* ⭐⭐ → Chapter 26 Exercises: Linear Regression — Your First Predictive Model
Exercise 27.1
*Regression vs. classification* ⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.10
*The cost of errors in healthcare* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.11
*Comparing models with different error profiles* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.12
*Probability calibration* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.13
*Basic logistic regression* → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.14
*Using predict_proba* → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.15
*Threshold exploration* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.16
*Confusion matrix visualization* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.17
*Handling class imbalance* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.18
*Complete vaccination classification* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.19
*Comparing regression and classification approaches* ⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.2
*The sigmoid function* ⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.20
*Feature scaling effect* ⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.21
*The threshold as a policy decision* → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.22
*Ethical implications of classification in criminal justice* ⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.23
*Model comparison essay* ⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.24
*When probabilities beat predictions* ⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.25
*Building a complete classification report* ⭐⭐⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.3
*Reading a confusion matrix* ⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.4
*Precision vs. recall* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.5
*The threshold effect* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.6
*Class imbalance awareness* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.7
*Why linear regression fails for classification* ⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.8
*Designing a classification system* → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 27.9
*Interpreting logistic regression output* ⭐⭐ → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Exercise 28.10
*Depth vs. accuracy curve* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.11
*Random forest comparison* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.12
*The effect of n_estimators* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.14
*Permutation importance* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.15
*Marcus's bakery decisions* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.16
*The interpretability trade-off* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.17
*Priya's NBA predictor* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.18
*Jordan's grade predictor* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.19
*Elena's policy tree* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.20
*Model comparison table* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.21
*The bias-variance trade-off in trees* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.23
*Design a model selection strategy* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.24
*Out-of-bag evaluation* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.25
*Beyond random forests: gradient boosting preview* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.3
*Information gain calculation* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.5
*Bagging and bootstrap* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.7
*Trees vs. linear models* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 28.9
*Visualize and interpret* → Chapter 28 Exercises: Decision Trees and Random Forests
Exercise 29.11
*ROC curves for model comparison* → Chapter 29 Exercises: Evaluating Models
Exercise 29.12
*Cross-validation comparison* → Chapter 29 Exercises: Evaluating Models
Exercise 29.13
*Learning curve diagnosis* → Chapter 29 Exercises: Evaluating Models
Exercise 29.14
*Regression evaluation* → Chapter 29 Exercises: Evaluating Models
Exercise 29.15
*Elena's vaccination model evaluation* → Chapter 29 Exercises: Evaluating Models
Exercise 29.16
*Marcus's sales prediction* → Chapter 29 Exercises: Evaluating Models
Exercise 29.17
*The metric mismatch* → Chapter 29 Exercises: Evaluating Models
Exercise 29.18
*Jordan's grading fairness* → Chapter 29 Exercises: Evaluating Models
Exercise 29.19
*Threshold selection in practice* → Chapter 29 Exercises: Evaluating Models
Exercise 29.2
*Confusion matrix from scratch* → Chapter 29 Exercises: Evaluating Models
Exercise 29.20
*Metric selection matrix* → Chapter 29 Exercises: Evaluating Models
Exercise 29.21
*Cross-validation edge cases* → Chapter 29 Exercises: Evaluating Models
Exercise 29.22
*The full evaluation workflow* → Chapter 29 Exercises: Evaluating Models
Exercise 29.23
*Designing an evaluation plan* → Chapter 29 Exercises: Evaluating Models
Exercise 29.24
*Precision-recall curves* → Chapter 29 Exercises: Evaluating Models
Exercise 29.25
*Custom scoring functions* → Chapter 29 Exercises: Evaluating Models
Exercise 29.3
*Precision vs. recall scenarios* → Chapter 29 Exercises: Evaluating Models
Exercise 29.6
*Cross-validation purpose* → Chapter 29 Exercises: Evaluating Models
Exercise 29.7
*Regression metrics comparison* → Chapter 29 Exercises: Evaluating Models
Exercise 29.8
*Confusion matrix in code* → Chapter 29 Exercises: Evaluating Models
Exercise 29.9
*Classification report interpretation* → Chapter 29 Exercises: Evaluating Models
Exercise 3.1
*Variables as labels* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.11
*Type conversion chain* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.12
*Comparison expressions* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.13
*f-string formatting* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.14
*Augmented assignment* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.2
*Type identification* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.20
*Debug this: multiple errors* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.22
*Temperature conversion* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.23
*Data summary report* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.24
*Course grade calculation* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.25
*Cleaning messy strings* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.26
*Data type detective* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.27
*Building a data dictionary* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.28
*Floating-point exploration* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.29
*Data science lifecycle revisited (from Chapter 1)* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.3
*Operator precedence* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.30
*Jupyter workflow (from Chapter 2)* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.4
*Assignment vs. comparison* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.6
*Immutability of strings* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.8
*Arithmetic with data* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 3.9
*String methods practice* → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Exercise 30.10
*Model comparison pipeline* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.13
*Missing value handling in pipelines* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.14
*Examining grid search results* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.15
*Elena's complete pipeline* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.16
*Marcus's sales prediction pipeline* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.18
*Deployment scenario* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.19
*Cost-sensitive grid search* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.20
*End-to-end workflow design* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.21
*Pipeline vs. manual workflow debate* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.22
*Feature engineering within pipelines* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.23
*Reflecting on Part V* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.24
*Nested cross-validation* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.25
*Building a reusable ML template* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.3
*Parameter naming conventions* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.6
*Reproducibility checklist* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.7
*Your first pipeline* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 30.8
*ColumnTransformer practice* → Chapter 30 Exercises: The Machine Learning Workflow
Exercise 31.1
*Findings vs. insights* → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.10
*Narrative arc construction* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.11
*Dashboard design critique* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.12
*Translating numbers* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.13
*Building a narrative notebook* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.14
*Designing a slide deck outline* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.15
*Annotated visualization* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.16
*Communication audit* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.17
*Three versions of one story* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.18
*Ethical communication dilemma* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.19
*From notebook to narrative* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.2
*Audience identification* ⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.20
*The anti-dashboard* ⭐⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.21
*Stakeholder simulation* ⭐⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.22
*Cross-format communication plan* ⭐⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.23
*Presentation rehearsal and feedback* ⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.24
*Communication failure post-mortem* ⭐⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.25
*The one-page portfolio piece* ⭐⭐⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.3
*The Pyramid Principle* ⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.4
*Narrative arc identification* ⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.5
*Spotting communication mistakes* ⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.6
*Rewriting for a different audience* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.7
*Writing assertive slide titles* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.8
*Annotating a chart* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 31.9
*Writing an executive summary* ⭐⭐ → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Exercise 32.1
*Identifying where bias enters the pipeline* → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.10
*Auditing a dataset* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.11
*The A/B test ethics* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.12
*Disparate impact analysis* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.13
*COMPAS deep dive* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.14
*Facial recognition policy* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.15
*Cambridge Analytica analysis* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.16
*Ethical audit of the vaccination project* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.17
*The trolley problem of data science* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.18
*Designing an ethical data science practice* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.19
*The cost of not being biased* ⭐⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.2
*Understanding fairness definitions* ⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.20
*Writing a data ethics statement* ⭐⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.21
*Historical harms and data* ⭐⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.22
*The future of data ethics* ⭐⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.23
*Ethical case debate* ⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.24
*Ethical impact assessment* ⭐⭐⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.25
*Reflection: your ethical compass* ⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.3
*Proxy discrimination* ⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.4
*Privacy and anonymization* ⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.5
*Informed consent evaluation* ⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.6
*The hiring algorithm* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.7
*The predictive policing dilemma* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.8
*Privacy vs. public health* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 32.9
*The credit scoring question* ⭐⭐ → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Exercise 33.1
*Why reproducibility matters* → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.10
*Viewing git diffs* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.11
*Undoing mistakes in git* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.12
*Simulating a merge conflict* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.13
*Pull request simulation* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.14
*Code review practice* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.15
*Writing a CONTRIBUTING.md* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.16
*Full project setup* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.17
*Reproducibility forensics* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.18
*Git history detective* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.19
*The reproducibility report card* ⭐⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.20
*Environment debugging* ⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.21
*Reproducibility across programming languages* ⭐⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.22
*Docker for reproducibility* ⭐⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.23
*Data versioning* ⭐⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.24
*Team workflow design* ⭐⭐⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.25
*The reproducibility pledge* ⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.3
*Identifying reproducibility problems* ⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.4
*The .gitignore file* ⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.5
*Initialize a repository* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.6
*Branching and merging* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.7
*Creating a virtual environment* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.8
*Writing a README* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 33.9
*Setting random seeds* ⭐⭐ → Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Exercise 34.1
*Spotting the generic portfolio* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.10
*The before/after notebook transformation* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.11
*Write a project description for your resume* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.13
*LinkedIn profile upgrade* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.14
*The interest inventory* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.16
*The range assessment* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.17
*Data source scavenger hunt* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.18
*The STAR-D practice* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.19
*Take-home simulation* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.20
*Behavioral question bank* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.21
*SQL practice for interviews* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.22
*The six-month portfolio plan* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.23
*Peer review exchange* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.25
*The honest self-assessment* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.3
*Reading like a hiring manager* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.5
*Distinguishing portfolio from homework* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.6
*Write your project introduction* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.7
*Curate your visualizations* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.8
*Write your project README* → Chapter 34 Exercises: Building Your Portfolio
Exercise 34.9
*Set up your GitHub profile* → Chapter 34 Exercises: Building Your Portfolio
Exercise 35.1
*Define your question* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.10
*Statistical analysis* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.11
*Build and evaluate models* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.12
*Sensitivity analysis* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.13
*Write the introduction* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.14
*Write the conclusions* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.15
*Write the limitations and ethical reflection* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.16
*Polish all visualizations* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.18
*Write the final README* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.19
*Conduct peer review* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.2
*Inventory your resources* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.20
*Final submission and reflection* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.3
*Set up the repository* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.4
*Load and inspect all data* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.5
*Clean the data with documented decisions* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.6
*Merge and prepare the analytical dataset* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.7
*Exploratory visualization set* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.8
*Discover something surprising* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 35.9
*Summarize exploration findings* → Chapter 35 Exercises: Capstone Project Milestones
Exercise 36.1
*The honest skills audit* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.10
*The six-month roadmap* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.11
*The accountability structure* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.12
*The portfolio gap analysis* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.13
*Writing your professional story* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.14
*The learning resource evaluation* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.15
*The conference and community plan* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.16
*Letter to your future self* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.17
*Letter to a future student* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.18
*The one-year vision* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.19
*Gratitude inventory* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.2
*What did you learn that surprised you?* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.3
*The skills you didn't know you'd need* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.4
*Revisiting Chapter 1* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.5
*Job posting analysis* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.6
*Career path comparison* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.7
*The informational interview plan* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.8
*Day-in-the-life simulation* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 36.9
*The three-skill priority* → Chapter 36 Exercises: Planning Your Future in Data Science
Exercise 4.1
*Basic conditional* ⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.10
*While loop: countdown* ⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.11
*While loop: find first match* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.14
*Function with conditional* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.15
*Function with a loop* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.16
*Multiple return values* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.17
*Default parameters* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.18
*Functions calling functions* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.19
*Trace the output* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.22
*Trace a function call* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.23
*Data quality report* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.24
*Running total with threshold alert* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.25
*Letter grade converter* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.26
*Pseudocode first* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.27
*Data validation pipeline* ⭐⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.28
*Types and conditionals* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.29
*F-strings in loops* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.3
*Nested conditionals* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.30
*The data science lifecycle in code* ⭐⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.4
*Boolean expressions* ⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.5
*Conditional with string data* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.7
*Accumulator pattern* ⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.8
*Count with condition* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 4.9
*Loop with string formatting* ⭐⭐ → Chapter 4 Exercises: Control Flow, Functions, and Thinking Like a Programmer
Exercise 5.1
*Choosing the right structure* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.10
*Dictionary from two lists* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.11
*Counting with dictionaries* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.12
*Write and read a CSV* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.14
*Building a frequency table* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.15
*Set-based data cleaning* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.16
*Weather data processing* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.17
*Inverting a mapping* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.18
*Grade distribution analysis* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.19
*JSON data exploration* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.2
*Mutable vs. immutable* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.20
*Marcus's sales analysis* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.21
*Data structure trade-offs* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.22
*Designing a mini-database* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.23
*From spreadsheet to code* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.24
*Comparing file formats* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.26
*Lifecycle with dictionaries (Chapter 1 + Chapter 5)* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.27
*Functions with data structures (Chapter 4 + Chapter 5)* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.28
*Type conversion meets data structures (Chapter 3 + Chapter 5)* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.29
*Conditionals inside comprehensions (Chapter 4 + Chapter 5)* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.3
*Dictionary access patterns* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.30
*Jupyter notebook narrative (Chapter 2 + Chapter 5)* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.31
*Beyond built-in: collections module* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.32
*Real data challenge* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.4
*Comprehension anatomy* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.5
*Reading the error message* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.6
*The file reading pattern* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.7
*Set operations in plain English* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.8
*Building a data record* → Chapter 5 Exercises: Working with Data Structures
Exercise 5.9
*Filtering a list of dictionaries* → Chapter 5 Exercises: Working with Data Structures
Exercise 6.1
*EDA as conversation* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.11
*Mode (most common value)* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.12
*Percentile calculation* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.13
*Filter and summarize* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.14
*Year-over-year comparison* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.15
*Data quality detective* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.16
*Pandemic impact analysis* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.17
*Regional deep dive* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.18
*Vaccine comparison* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.19
*Top and bottom performers* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.20
*Build a complete summary report* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.21
*Text-based bar chart* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.22
*Correlation by eye* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.23
*Coverage trajectory* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.24
*Write your own data dictionary* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.25
*EDA question generator* (4-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.26
*Weather data exploration* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.27
*Bookshelf analysis* (2-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.28
*Classroom grades* (3-star) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.29
*Data science lifecycle in action* (1-star, Ch.1 review) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.3
*Missing value categories* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.30
*Functions for reuse* (2-star, Ch.4 review) → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.5
*Notebook narrative vs. code dump* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.6
*Reproducibility checklist* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.7
*Flexible data loader* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.8
*Count unique values* → Chapter 6 Exercises: Your First Data Analysis
Exercise 6.9
*Safe numeric extraction* → Chapter 6 Exercises: Your First Data Analysis
Exercise 7.1
*The threshold concept* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.10
*Creating computed columns* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.12
*apply() with a custom function* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.13
*describe() interpretation* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.14
*Boolean indexing with multiple conditions* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.15
*Weather station data* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.16
*Book sales analysis* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.17
*Vaccination data exploration* (3-star) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.18
*Comparing pure Python and pandas* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.19
*Debugging challenge* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.2
*Series vs. DataFrame* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.20
*Custom summary function* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.21
*Before-and-after comparison essay* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.22
*Build a mini-analysis pipeline* (4-star) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.23
*Data grammar translation* (3-star) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.24
*Investigating a pattern* (4-star) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.25
*Build and analyze a classroom dataset* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.26
*Exploring unfamiliar data* (3-star) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.5
*Boolean indexing mechanics* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.6
*Method chaining readability* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.7
*Build a DataFrame from scratch* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.8
*Selecting and filtering* → Chapter 7 Exercises: Introduction to pandas
Exercise 7.M1
*Vocabulary bridge* (Chapter 3 + 7) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.M2
*Functions in two worlds* (Chapter 4 + 7) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.M3
*Data structure evolution* (Chapter 5 + 7) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.M4
*EDA revisited* (Chapter 6 + 7) → Chapter 7 Exercises: Introduction to pandas
Exercise 7.M5
*The big picture* (Chapter 1 + 7) → Chapter 7 Exercises: Introduction to pandas
Exercise 8.1
*The five types of mess* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.10
*to_numeric with errors* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.11
*Missing value audit function* ⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.12
*Cleaning a price column* ⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.13
*Standardizing country names* ⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.14
*Detecting implausible values* ⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.15
*Group-based imputation* ⭐⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.16
*Complete cleaning pipeline* ⭐⭐⭐ → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.19
*The disappearing rows* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.21
*The NaN equality trap* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.22
*The imputation experiment* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.23
*Design a cleaning strategy* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.25
*Sensitivity analysis* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.26
*Building a cleaning function* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.27
*Your cleaning instincts* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.28
*What surprised you?* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.29
*Fuzzy deduplication* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.3
*Cleaning as analysis* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.30
*Automated data quality report* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.4
*Why not just dropna()?* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.5
*Data type consequences* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.6
*Outlier or insight?* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 8.9
*Duplicate detection* → Chapter 8 Exercises: Cleaning Messy Data
Exercise 9.1
*Join type selection* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.11
*Method chain construction* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.12
*Diagnosing a key explosion* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.13
*Fixing a type mismatch* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.14
*Complex melt with multiple id_vars* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.15
*Merging messy country data* (2-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.16
*Reshaping survey data* (2-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.17
*Sales analysis pipeline* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.18
*Handling duplicates in a merge* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.19
*Weather data reshaping* (2-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.2
*Wide vs. long identification* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.20
*Vaccination equity analysis* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.21
*Self-join: Comparing countries within regions* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.22
*Building a complete report* (4-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.23
*Performance awareness* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.24
*Chained debugging* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.25
*Multi-level groupby with unstack* (4-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.26
*Movie ratings analysis* (2-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.27
*Fitness tracker data* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.28
*Combining election data* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.29
*Full pipeline from loading to analysis* (3-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.3
*Split-apply-combine in plain English* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.30
*Debugging a multi-step analysis* (2-star) → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.4
*Predicting merge output* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.5
*Method chaining readability* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.7
*Basic merge practice* → Chapter 9 Exercises: Reshaping and Transforming Data
Exercise 9.9
*GroupBy aggregations* → Chapter 9 Exercises: Reshaping and Transforming Data
Explainability
the ability to explain why a model made a particular prediction — is both a technical goal and an ethical requirement. → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice

F

F1 score
the harmonic mean of precision and recall: → Chapter 29: Evaluating Models — Accuracy, Precision, Recall, and Why "Good" Depends on the Question
false positive rate
and it's exactly what the significance level controls. → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
Faster to write
you express intent, not mechanics - **Faster to run** — pandas operates on entire columns at once using optimized C code under the hood - **Easier to read** — even someone unfamiliar with pandas can guess what `groupby("region")["coverage_pct"].mean()` does - **Safer** — pandas handles type conversi → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
feature importance
a score for each feature indicating how much it contributed to the model's predictions. This is computed by measuring how much each feature reduces impurity across all the trees in the forest. → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
Features (known at application time):
`credit_score`: Applicant's credit score (300-850) - `annual_income`: Self-reported annual income - `debt_to_income`: Monthly debt payments divided by monthly income - `loan_amount`: Amount requested - `employment_years`: Years at current employer - `loan_purpose`: Reason for the loan (home improvem → Case Study 1: Should We Approve the Loan? A Decision Tree for Credit Risk
File 1: `player_stats.csv`
Per-game statistics for 150 players across 30 teams. → Case Study 1: Combining Player Stats and Team Records — Priya Merges NBA Datasets
File 2: `team_records.csv`
Season records for all 30 teams. → Case Study 1: Combining Player Stats and Team Records — Priya Merges NBA Datasets
file drawer problem
the non-significant results are stuck in researchers' file drawers, invisible to the scientific community. → Case Study 2: The Replication Crisis — When Significant Results Disappear
Fixes:
Reduce alpha: `scatter_kws={"alpha": 0.1}` - Use a 2D histogram or hexbin: `plt.hexbin(df["x"], df["y"], gridsize=30)` - Use KDE: `sns.kdeplot(data=df, x="x", y="y")` → Chapter 16: Statistical Visualization with seaborn
For Black defendants who did NOT reoffend:
44.9% were incorrectly classified as medium or high risk (false positive rate) → Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice
For Black defendants who DID reoffend:
28.0% were incorrectly classified as low risk → Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice
For white defendants who did NOT reoffend:
23.5% were incorrectly classified as medium or high risk → Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice
For white defendants who DID reoffend:
47.7% were incorrectly classified as low risk (false negative rate) → Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice
Forgetting `fig.tight_layout()` before save/show
labels get cut off - [ ] **Bar chart y-axis not starting at zero** -- visually misleading - [ ] **Overlapping x-axis labels** -- fix with rotation or horizontal bars - [ ] **Rainbow colors on bars that represent the same variable** -- use one color - [ ] **Title says the topic, not the finding** -- → Key Takeaways: matplotlib Foundations
Formulas are best when:
The answer needs to be exact (not approximate) - You need to compute the answer quickly (simulation takes time) - You want to understand *why* the answer is what it is (formulas reveal structure) - You need to communicate the logic to others (formulas are more transparent than code) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Four career paths
data analyst, data scientist, ML engineer, and data engineer — with honest descriptions of daily work, required skills, and compensation - **The skills gap** between introductory and intermediate data science, including SQL, deep learning, NLP, A/B testing, cloud computing, and software engineering → Chapter 36: What's Next: Career Paths, Continuous Learning, and the Road to Intermediate Data Science
function
a command that tells Python to do something. In this case, it tells Python to display whatever is inside the parentheses. - `"Hello, world!"` is a **string** — a piece of text. The quotation marks tell Python "this is text, not code." - When you ran the cell, the notebook sent `print("Hello, world!" → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook

G

GDP per capita
wealthier countries can afford both higher health spending and better vaccination infrastructure. (2) **Government effectiveness** — well-functioning governments both allocate more to health and implement programs effectively. (3) **Education levels** — more educated populations both demand more hea → Chapter 24 Quiz: Correlation, Causation, and the Danger of Confusing the Two
GDPR requirements:
Users must explicitly consent to their social media data being used for credit decisions (consent must be specific, informed, and freely given) - The company must explain the logic of the decision-making process (Article 22) - Users have the right to contest automated decisions - The data must be ad → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Generalization
performing well on new data — is the whole point. Everything else (the splits, the baselines, the complexity tuning) is in service of generalization. → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
geometric object
often shortened to "geom" — is the visual mark that represents data on the chart. Common geometric objects include: → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
git
a version control system that tracks every change to every file in a project, lets you go back to any previous version, and enables multiple people to work simultaneously without conflicts. Alongside git, you will learn about **virtual environments** (which capture the exact software versions your c → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
GitHub
Your portfolio home base. Learn GitHub Pages if you want a free personal website. - **Kaggle Datasets** — For finding interesting datasets (use the datasets section, not just competitions). - **Google Dataset Search** — A search engine specifically for datasets. - **data.gov / data.gov.uk / EU Open → Further Reading: Building Your Portfolio
Good scope for a portfolio project:
Can be completed in 15-25 hours of focused work - Uses one to three data sources - Requires meaningful cleaning but not months of it - Has a clear question that can be answered with the data available - Produces three to eight polished visualizations - Fits in a single notebook with clear narrative → Chapter 34: Building Your Portfolio: Projects That Get You Hired
greedy
it matches as much as possible. We'll learn about greedy versus lazy matching later in this chapter. → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
grouping
are the structural transformations at the heart of data wrangling. They don't change the *values* in your data; they change its *shape*. And until you're comfortable with them, you'll be stuck: staring at data that has all the information you need but isn't arranged in a way that lets you use it. → Chapter 9: Reshaping and Transforming Data — Merge, Join, Pivot, Melt, and GroupBy

H

Headline
one sentence capturing the key insight 2. **Context** — 2-3 sentences on why the analysis was done 3. **Key findings** — 3-5 bullet points (insights, not raw findings) 4. **Recommendation** — specific, actionable next step 5. **Caveats** — honest limitations → Key Takeaways: Communicating Results: Reports, Presentations, and the Art of the Data Story
Highlighted datasets:
COVID-19 case surveillance (millions of rows --- good for large-data practice) - BRFSS (Behavioral Risk Factor Surveillance System) --- annual survey of 400,000+ adults - WONDER (Wide-ranging ONline Data for Epidemiologic Research) --- mortality and population data → Appendix D: Data Sources Guide
Historical demand:
`demand_yesterday`: Yesterday's actual demand (MWh) - `demand_last_week`: Demand 7 days ago - `demand_last_year`: Demand on the same date last year - `avg_demand_7day`: Rolling 7-day average demand → Case Study 1: End-to-End — From Raw Data to Deployed Prediction
Honesty
[ ] Bar chart y-axes start at zero. - [ ] The time range is representative, not cherry-picked. - [ ] Dual y-axes are avoided (or clearly labeled and justified). - [ ] Area encodings are proportional to values, not to radii or heights. - [ ] Missing context (sample size, uncertainty, baseline) is pro → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
How the web works
at least enough to understand HTTP requests and responses. - **APIs** — the structured, polite way to ask a server for data. - **Web scraping** — the sometimes-necessary, sometimes-controversial alternative when no API exists. - **Ethics and legality** — because just because you *can* access data do → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
How to write well:
Start with the question, not the code. Your reader should understand what you're investigating in the first paragraph. - Use visualizations as anchors for the narrative. A good blog post alternates between text and charts, with each chart accompanied by interpretation. - Show your code, but not all → Chapter 34: Building Your Portfolio: Projects That Get You Hired
hyperparameters
settings that you choose before training and that affect the model's behavior. For a random forest: `n_estimators`, `max_depth`, `max_features`, `min_samples_leaf`. For logistic regression: `C` (regularization strength). For a decision tree: `max_depth`, `min_samples_split`. → Chapter 30: The Machine Learning Workflow — Pipelines, Validation, and Putting It All Together

I

Identify the message
What question should this chart answer? 2. **Audit the encodings** — Is each visual element the best choice for its variable? 3. **Check accessibility** — Colorblind-safe? High contrast? Alt text? 4. **Check honesty** — Zero-based bars? Full time range? Fair scales? 5. **Remove chartjunk** — Can any → Key Takeaways: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
if statement
the simplest form of a **conditional**. Let's break down every piece: → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
In your community:
Are potholes more common in some neighborhoods than others? Why? - How does air quality in your city compare to similar cities? - Are there food deserts (areas without nearby grocery stores) near you? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
In your daily life:
Is the "express" checkout lane at the grocery store actually faster? - Does the weather affect your mood? - How has the cost of your grocery basket changed over the past year? - Do you actually sleep better on weekends, or does it just feel that way? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
In your interests:
Does home-field advantage matter more in some sports than others? - Are sequels generally rated lower than original movies? - Has the length of popular songs changed over the past 50 years? - Do books that win literary prizes actually sell more copies? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
In your work or school:
Does class size affect student performance in your department? - Does the day of the week affect productivity at your workplace? - Do some types of marketing emails get more engagement than others? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Inconsistent categories
the same thing spelled different ways. "Male," "male," "M," "m," "MALE." "New York," "NY," "N.Y.," "new york," "New York City." You'd be amazed how many ways people can spell the same word. → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
independent
one doesn't affect the other — multiply: → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Independent framing
the title and context are original, not referencing a textbook assignment; (2) **Decision justification** — the choices are explained with reasoning, not attributed to instructions; (3) **Awareness of consequences** — the note about Sub-Saharan Africa shows the author understands the analytical impl → Chapter 34 Exercises: Building Your Portfolio
index
pandas's way of labeling each row. By default, the index is just sequential integers starting at 0, like list indices. → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
infinite loop
a loop whose condition never becomes `False`: > > ```python > count = 1 > while count <= 5: > print(f"Count is {count}") > # Oops! Forgot to update count! > ``` > > This prints "Count is 1" forever (or until you interrupt it). In Jupyter, you'll see the cell keep running with a `[*]` that never turn → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Institutional Review Board (IRB)
a committee that evaluates whether the benefits of the research justify the risks to participants and whether participants are adequately informed. → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Interpreting mean vs. median:
Mean = Median: roughly symmetric distribution - Mean > Median: right-skewed (pulled up by high outliers) - Mean < Median: left-skewed (pulled down by low outliers) → Key Takeaways: Your First Data Analysis
IQR Fence Method:
Lower fence = Q1 - 1.5 * IQR - Upper fence = Q3 + 1.5 * IQR - Values beyond the fences are flagged as outliers - Robust — based on the median and quartiles → Key Takeaways: Descriptive Statistics

J

JSON
the web's format. You learned to load flat JSON with `pd.read_json()`, flatten nested structures with `pd.json_normalize()`, and use `record_path` and `meta` to handle arrays inside records. → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
Just right:
"I analyzed 10 years of USDA crop yield data to investigate whether organic farming productivity has been catching up to conventional farming." - "I scraped 5,000 job postings for data science positions to identify the most in-demand skills by city and company size." - "I built a model to predict wh → Chapter 34: Building Your Portfolio: Projects That Get You Hired

K

Key activities:
Install Anaconda and create your project notebook - Complete all Chapter 3 and 4 coding exercises --- these fundamentals must be solid - Write helper functions for the progressive project - If you get stuck on installation, consult Appendix C (Setup Guide) and Appendix E (FAQ) → Self-Paced Learning Guide
Key concepts from this chapter:
**Exploratory data analysis (EDA)** is the process of systematically examining a dataset to discover patterns, spot anomalies, check assumptions, and generate hypotheses. It's a conversation with your data. - **Data loading** with Python's `csv.DictReader` gives you a list of dictionaries — one per → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
Key distinction:
`Timedelta(days=30)` = exactly 30 days (calendar-unaware) - `DateOffset(months=1)` = one calendar month (handles varying month lengths) → Key Takeaways: Working with Dates, Times, and Time Series Data
Key findings:
GDP per capita explained 45% of variance alone, but healthcare worker density (physicians + nurses per capita) added 12 percentage points of explanatory power - Sub-Saharan Africa showed the widest within-region variation, suggesting country-level factors dominate over regional ones - The relationsh → Chapter 35: Capstone Project: A Complete Data Science Investigation
Key function vocabulary:
**Domain:** The set of valid inputs. For $f(x) = \sqrt{x}$, the domain is $x \geq 0$ (you cannot take the square root of a negative number in the real numbers). - **Range:** The set of possible outputs. - **Monotonic:** A function that only goes up (increasing) or only goes down (decreasing), never → Appendix A: Math Foundations Refresher
Key principles:
"Publicly accessible" does not mean "freely usable" - Legal and ethical are separate questions — something can be legal but unethical - Scale matters — what's fine for 50 data points may be problematic for 50,000 - Always prefer APIs over scraping - When in doubt, slow down and investigate → Key Takeaways: Getting Data from the Web
Key rules you probably remember:
**Addition and subtraction** are performed left to right: $10 - 3 + 2 = 9$. - **Multiplication and division** are performed before addition and subtraction: $2 + 3 \times 4 = 14$, not 20. - **Parentheses** override everything: $(2 + 3) \times 4 = 20$. - **Exponents** are performed before multiplicat → Appendix A: Math Foundations Refresher
KeyError
column names are case-sensitive. Fix: `df["country"]`. 2. **ValueError** — use `&` instead of `and`, and wrap conditions in parentheses. Fix: `df[(df["coverage_pct"] > 90) & (df["year"] == 2022)]`. 3. **KeyError** — multiple columns need double brackets. Fix: `df[["country", "region"]]`. 4. **Settin → Chapter 7 Exercises: Introduction to pandas

L

label
the index values and column names: → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
layout
what charts and controls appear on the page. 2. Defines **callbacks** — functions that run when the user interacts with a control. 3. Runs a local web server that serves the dashboard in a browser. → Chapter 17: Interactive Visualization — plotly, Dashboard Thinking
lazy
it matches as little as possible: → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Left join
the student list is your primary dataset. You want every student, with scores where available. (b) **Inner join** — you only want complete records that exist in both systems. (c) **Outer join** — you need the full picture from both sides to identify gaps. (d) **Left join** — your customer list is pr → Chapter 9 Exercises: Reshaping and Transforming Data
left-skewed
pulled down by a tail of low values. In practical terms, this tells us that *most* countries have fairly high vaccination coverage, but a smaller group of countries has much lower rates, dragging the average down. That's an important insight. → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
Level 2 (Evidence):
"Stores that adopted the new checkout system saw 12% higher sales." - "The effect was strongest in high-traffic stores (18% increase) and weaker in low-traffic stores (5%)." - "Customer survey data suggests faster checkout is the primary driver." → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Level 3 (Details):
"We analyzed three years of sales data across 200 stores." - "The analysis controlled for store size, location, and seasonal trends." → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
likelihood
P(B) = P(positive test) = P(positive|disease) * P(disease) + P(positive|no disease) * P(no disease) - P(A|B) = P(disease | positive test) — the **posterior** (what we want) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Likely ethical
government data is public by design, the purpose is public benefit, and no personal harm results. Still, check robots.txt and ToS. 2. **It depends** — competitive intelligence is common, but hourly scraping could violate ToS and strain the competitor's servers. The legality varies by jurisdiction. 3 → Chapter 13 Exercises: Getting Data from the Web
Limitations should be specific:
Not "the data may have errors" but "the WHO vaccination data relies on country self-reporting, and countries with weaker health information systems may undercount doses administered" - Not "more data would be helpful" but "individual-level vaccination data (rather than country-level aggregates) woul → Chapter 35 Exercises: Capstone Project Milestones
List
ordered and allows duplicates. 2. Country-to-code mapping: **Dictionary** --- fast lookup by name. 3. Unique vaccine manufacturers: **Set** --- automatic deduplication, order irrelevant. 4. Latitude/longitude pair: **Tuple** --- fixed, immutable pair that can serve as a dictionary key. 5. Patient re → Answers to Selected Exercises
logistic regression
a model specifically designed for classification. Despite its name (it has "regression" in it), logistic regression is a classification algorithm. It predicts the *probability* that an observation belongs to a particular category, and then uses that probability to make a classification decision. → Chapter 27: Logistic Regression and Classification — Predicting Categories
Look at the caret
it points to where Python first noticed the problem. > 3. **Check for missing quotes, parentheses, or colons.** > 4. **Check the line *above*** — sometimes the error is on the previous line, but Python doesn't notice until the next line. → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
Looking back:
Chapter 19 gave us the tools to describe individual variables (means, standard deviations) - Chapters 22-23 gave us tools to estimate parameters and test hypotheses about one or two variables - This chapter extends the toolkit to *relationships between variables* and introduces the critical distinct → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
Looking forward:
Chapter 25 introduces formal modeling — using one variable to *predict* another - Chapters 26-28 build regression and classification models that quantify relationships while controlling for confounders - Chapter 32 revisits the ethical implications of causal claims → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
loop variable
it automatically takes on each value in the sequence, one at a time. On the first pass (or **iteration**), `country` is `"Brazil"`. On the second iteration, it's `"India"`. And so on, until the list is exhausted. → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Lower the threshold
flag more countries, accept more false alarms. → Chapter 27: Logistic Regression and Classification — Predicting Categories

M

Machine Learning Engineer:
Focuses on deploying models into production systems - Primary tools: Python, Docker, Kubernetes, ML frameworks (TensorFlow, PyTorch) - Outputs: production-grade ML services, APIs - Typical question: "How do we serve this recommendation model to 10 million users with 50ms response time?" → Appendix E: Frequently Asked Questions
MAR (Missing at Random)
the missingness is related to equipment availability, not to the actual temperature or humidity values. The data is probably missing on days when a sensor malfunctioned or the station was offline, which is likely unrelated to the weather conditions themselves. → Chapter 6 Quiz: Your First Data Analysis
Matchup features:
`win_pct_diff`: home_win_pct - away_win_pct - `net_rating_diff`: (home offensive rating - home defensive rating) - (away offensive rating - away defensive rating) → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA
mean
technically the *arithmetic mean* — is the one you already know. Add up all the values, divide by how many there are. Done. → Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell
median
the middle value when you sort all the data — gives you a more robust sense of what's "typical." Computing it requires sorting: → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
Message
[ ] The chart answers a specific question. - [ ] The title states what the chart shows (and ideally, the key finding). - [ ] A non-expert can understand the main message in 10 seconds. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Message:
[ ] I have identified my key insight (not just findings) - [ ] I lead with the conclusion, not the methodology - [ ] I have answered the question "so what?" - [ ] I have included a call to action or clear recommendation → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
method chaining
stringing multiple operations together in a single expression, where each operation feeds its result to the next. → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
methods
functions attached to the string that transform or inspect it. Here are the ones you'll use most often in data science: → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
Midterm review.
**Lecture 2:** **MIDTERM EXAMINATION** (Chapters 1--18 concepts through Chapter 13 code; Chapters 14--15 conceptual only). - **Lab:** matplotlib: Figure/Axes interface. Line plots, bar charts, scatter plots, histograms. Customization. Subplots. - **Assignment:** Chapter 14--15 exercises. Project Mil → 15-Week University Semester Syllabus
Minimum
the smallest value 2. **Q1** — the 25th percentile 3. **Median** — the 50th percentile 4. **Q3** — the 75th percentile 5. **Maximum** — the largest value → Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell
Misleading title
"soared" is editorializing; the increase from Q3 to Q4 is about 12%. (3) **No context** — is this growth normal? How does it compare to the same quarter last year? → Chapter 18 Quiz: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Missing values
cells that should contain data but don't. The vaccination count is blank. The patient's age is recorded as "unknown." The sensor reading for 3 AM is absent because the sensor was rebooting. → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
MNAR
the very conditions you want to measure are the ones most likely to be missing. → Chapter 6 Quiz: Your First Data Analysis
Model A: "Nobody has the disease"
Predicts "healthy" for all 10,000 people - Gets 9,990 right (the truly healthy ones) and 10 wrong (the sick ones it missed) - Accuracy: 9,990 / 10,000 = **99.9%** - People it helped: **zero** → Chapter 29: Evaluating Models — Accuracy, Precision, Recall, and Why "Good" Depends on the Question
Model B
recall of 80% vs. 50%. It catches 80% of customers who will churn. 2. **Model A** — precision of 60% vs. 45%. Of the customers it flags, 60% actually churn (vs. 45% for Model B). 3. Suppose 100 customers, 20 will churn. Model A: catches 10 churners (recall 50%), flags 10/0.6 ≈ 17 total. Cost = 17 * → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Model B: An actual screening test
Correctly identifies 8 of the 10 sick people (misses 2) - Correctly identifies 9,800 of the 9,990 healthy people (falsely alarms 190) - Accuracy: (8 + 9,800) / 10,000 = **98.08%** - People it helped: **8 who got early treatment** → Chapter 29: Evaluating Models — Accuracy, Precision, Recall, and Why "Good" Depends on the Question
Monte Carlo simulation
estimating probabilities by running random experiments many times. It's one of the most powerful techniques in data science. → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Month 1: _______________
Primary learning goal: - Resource: - Time commitment: ___ hours/week - Deliverable by end of month: → Chapter 36 Exercises: Planning Your Future in Data Science
Month 1: Skill Building + Portfolio Expansion
Start an intensive SQL course (e.g., chapters 1-10 of *Practical SQL*). Complete one chapter per day. - Begin learning Tableau (free trial + public gallery for practice). - Build a second portfolio project that demonstrates SQL and visualization skills — perhaps analyzing a dataset entirely in SQL, → Chapter 36 Quiz: Reflection and Career Planning
Month 2: _______________
Primary learning goal: - Resource: - Time commitment: ___ hours/week - Deliverable by end of month: → Chapter 36 Exercises: Planning Your Future in Data Science
Month 2: Active Application + Interview Prep
Complete SQL course. Take a practice SQL assessment to benchmark skills. - Build a third portfolio project in a domain relevant to companies you're targeting. - Apply to 15-20 positions per week, tailoring cover letters to each. - Practice behavioral interview answers using STAR-D framework with pro → Chapter 36 Quiz: Reflection and Career Planning
Month 3: _______________
Primary learning goal: - Resource: - Time commitment: ___ hours/week - Deliverable by end of month: → Chapter 36 Exercises: Planning Your Future in Data Science
Month 3: Intensify + Refine
Based on interview feedback, address any consistent skill gaps. - Write a blog post about one of your portfolio projects to increase visibility. - Continue applications (15-20/week). Reach out to at least three people for informational interviews. - Practice SQL interview questions (window functions → Chapter 36 Quiz: Reflection and Career Planning
Month 4: _______________
Primary learning goal: - Resource: - Time commitment: ___ hours/week - Deliverable by end of month: → Chapter 36 Exercises: Planning Your Future in Data Science
Month 4: Final Push
Continue applications and interviews. By now you should have had several phone screens and hopefully some technical interviews. - Refine your portfolio based on what generates the most interview interest. - Follow up with all networking contacts. - If no offers yet, assess: is the issue the resume/p → Chapter 36 Quiz: Reflection and Career Planning
Month 5: _______________
Primary learning goal: - Resource: - Time commitment: ___ hours/week - Deliverable by end of month: → Chapter 36 Exercises: Planning Your Future in Data Science
Month 6: Review and Recalibrate
Self-assessment: What did I learn? What gaps remain? - Portfolio update: What new projects can I add? - Career progress: What applications, interviews, or connections have I made? - Next six months: What comes after this roadmap? → Chapter 36 Exercises: Planning Your Future in Data Science
Monthly (1--2 hours):
Read one in-depth article or blog post about a technique you have not used. - Try a Kaggle competition or work on a personal project. → Appendix E: Frequently Asked Questions
Monthly Goals:
Month 1: [specific goal + specific resource] - Month 2: [specific goal + specific resource] - Month 3: [specific goal + specific resource] - Month 4: [specific goal + specific resource] - Month 5: [specific goal + specific resource] - Month 6: [specific goal + specific resource] → Chapter 36: What's Next: Career Paths, Continuous Learning, and the Road to Intermediate Data Science
multi-index
an index with two levels (region and year). You can access specific values with: → Chapter 9: Reshaping and Transforming Data — Merge, Join, Pivot, Melt, and GroupBy
Multiple choice
tests recall and conceptual understanding - **True/False** — tests common misconceptions - **Short answer** — tests your ability to explain concepts in your own words - **Code reading** — shows you code and asks "what does this produce?" or "what's wrong with this?" → How to Use This Book

N

NameError
`pritn` is not a recognized name. It's a typo for `print`. Should be: `print("This should work")` → Chapter 2 Exercises: Setting Up Your Toolkit
natural experiments
events that change a country's wealth for reasons unrelated to health — to estimate the causal effect of income on health. → Case Study 2: Does Money Buy Health? Disentangling GDP, Spending, and Outcomes
natural frequencies
actual counts of people or events. → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
No rate limiting
1,000 rapid requests will likely trigger a 429 (Too Many Requests) response, and after that, `response.json()` might return an error message without a `name` key, causing a `KeyError`. 2. **No error handling** — if any request fails (network error, timeout, server error), the script crashes. 3. **No → Chapter 13 Quiz: Getting Data from the Web
No, 95% is not impressive
it matches the baseline exactly. The model might simply be predicting "will not cancel" for everyone, which requires no machine learning at all. It hasn't demonstrated any ability to identify the 5% who will cancel. → Chapter 25 Quiz: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
normal distribution
the bell curve. You've seen it mentioned a hundred times. Now you'll understand *why* it's everywhere. The answer involves one of the most beautiful results in mathematics: the **Central Limit Theorem**. And the way we'll discover it is by running a simulation that will make you say "wait, THAT happ → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
Not data science
This is **data engineering**. Building pipelines is essential infrastructure, but it is not itself asking or answering a question. (It supports stage 2, data collection, but it is not *doing* data science.) 2. **Not data science** — This is **mathematical statistics / statistical theory**. There is → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
notebook server
the engine that runs Jupyter in the background. **Don't close this window.** If you close it, Jupyter stops working. Just minimize it and forget about it. → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
nothing
no Python, no statistics, no SQL. It is written for: → Introduction to Data Science

O

object-oriented interface
the `Figure` and `Axes` approach — rather than the simpler but less flexible `pyplot` shortcut. The object-oriented interface gives you full control over every element of your chart, and it's what you'll need for professional-quality work. → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
One idea per paragraph
**Translate every number** into something meaningful - **Cut mercilessly** — every sentence must earn its place - **Use formatting** — bold key findings, use bullets, add headers → Key Takeaways: Communicating Results: Reports, Presentations, and the Art of the Data Story
Outliers and implausible values
data points that are technically present but seem wrong. A patient listed as 250 years old. A temperature reading of -999 (a common placeholder for "missing" in scientific data). A sales figure that's ten thousand times the average. → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job

P

Paired t-test
the same countries are measured before and after. 2. Differences: [6, 3, 7, 2, 8, 1, 7, 2, 7, 6]. Mean = 4.9, SD = 2.56. 3. SE = 2.56/√10 = 0.809. t = 4.9/0.809 = 6.06. With df = 9, p < 0.001. 4. The increase is statistically significant. However, we can't conclude the campaign *caused* the increase → Chapter 23 Exercises: Hypothesis Testing
parameter
a variable that represents the input the function expects. When you call the function, you'll pass in actual data, and it will be assigned to this parameter name. - The colon `:` and indented block work just like `if` and `for` — everything indented is the function's body. - **`return total, average → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Part I
You download the data and take your first look. What's in here? What are the columns? What questions could we ask? - **Part II** — You clean the data, handle missing values, reshape it, and merge in additional sources. - **Part III** — You create visualizations that reveal patterns — and learn to sp → Preface
Part I is fully linear
beginners need to build skills in a specific order. - **Later parts have more flexibility** — once you have the foundations, you can explore topics that interest you most. → How to Use This Book
Part I: Welcome to Data Science (Chapters 1--6)
Establishes Python fundamentals and the mindset of data analysis using pure Python, creating productive frustration that motivates the pandas library. - **Part II: Data Wrangling (Chapters 7--13)** --- Introduces pandas, data cleaning, reshaping, text/date handling, and data acquisition from files a → Instructor Guide: Overview
Perception
Does the human visual system process this chart effectively? 2. **Accessibility** — Can everyone in the audience understand this chart? 3. **Ethics** — Does this chart represent the data honestly? → Key Takeaways: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Perceptual non-uniformity
equal data differences produce unequal perceived color differences, so some data variations appear larger than they are. (2) **Colorblind inaccessibility** — the red-green transitions are invisible to deuteranopic viewers (~8% of men). (3) **False boundaries** — sharp hue transitions create perceive → Chapter 18 Quiz: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Perceptual principles
how the human visual system processes charts, and how to design for it. 2. **Accessibility** — how to ensure your visualizations work for people with color vision deficiency, low vision, and screen readers. 3. **Ethics** — how to avoid (and detect) misleading visualization techniques. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
perfect classifier
it achieves 100% true positive rate with 0% false positive rate. AUC = 1.0. This rarely happens in practice. 2. A **random classifier** — no better than flipping a coin. AUC = 0.5. The model provides no useful information. 3. A **good classifier** — it achieves high true positive rates with relative → Chapter 29 Exercises: Evaluating Models
pipeline
a sequence of transformations where data flows through each step. This mirrors how data scientists describe their work in plain English: → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
Play-by-play data from the NBA's stats API
Structured. Each row is an event (foul, shot, turnover) with timestamps, player IDs, and score at the time of the event. 2. **Referee assignment records** — Structured. Which referees officiated each game. Available from official NBA data. 3. **Game video footage** — Unstructured. Priya might review → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
point estimate
a single number that represents our best guess. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Polish
[ ] Chartjunk has been removed. - [ ] Axis labels include units. - [ ] The legend is clear and necessary. - [ ] The data source is cited. - [ ] The figure size and resolution are appropriate for the medium. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Political poll:
Population: All registered voters in Ohio. - Sample: The 1,200 polled voters. - Parameter: True proportion of all Ohio registered voters who support the measure. - Statistic: Proportion in the sample who support it. → Answers to Selected Exercises
Possible explanations for the score gap:
**Bias in peer evaluations:** Research consistently shows that identical behaviors are perceived as "leadership" in men and "bossiness" in women. Peer evaluations may reflect gender stereotypes. - **Opportunity gap:** If women are less likely to be assigned to high-visibility projects, they may have → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Pre-merge checklist:
[ ] Check both key columns have the same dtype (`df['key'].dtype`) - [ ] Check for duplicate keys (`df['key'].duplicated().sum()`) - [ ] Check for key overlap (`left['key'].isin(right['key']).sum()`) - [ ] Check for NaN in key columns (`df['key'].isna().sum()`) - [ ] Strip whitespace and standardize → Key Takeaways: Reshaping and Transforming Data
pre-registration
publicly recording the study design, hypotheses, sample size, and analysis plan before collecting any data. → Case Study 1: Does This Drug Work? A Clinical Trial Analysis
Prepare and split data
Mistake: not holding out a test set, or using the test set during development. → Chapter 30 Quiz: The Machine Learning Workflow
Priya's design decisions:
She chose a line chart because this is temporal data with a continuous trend. - She set the y-axis from 15 to 50 rather than 0 to 100 because the full range would flatten the trend into near-invisibility. For a line chart (which encodes position, not length), this is appropriate. - She annotated two → Case Study 2: The Sports Page Goes Digital — Priya's NBA Shot Charts
Probability basics:
Probability is a number between 0 (impossible) and 1 (certain) - Three interpretations: classical (counting outcomes), frequentist (long-run proportion), subjective (degree of belief) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Problem: Classify emails as spam or not spam
Target: Spam or not spam (category) - Features: Number of exclamation marks, presence of suspicious words, sender reputation, email length → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Problem: Predict house prices
Target: Sale price (dollars) - Features: Square footage, bedrooms, bathrooms, lot size, neighborhood, year built → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Progressive project milestones
Graded as a portfolio of incremental submissions. Early milestones should be graded generously to build confidence; later milestones demand higher polish and rigor. → Instructor Guide: Overview
prosecutor's fallacy
confusing P(evidence | innocent) with P(innocent | evidence). These are different conditional probabilities, and confusing them is exactly the error we explored with Bayes' theorem in Chapter 20. → Case Study 1: The Prosecutor's Fallacy — When Probability Goes to Court
PyCon
The annual Python conference. Affordable, welcoming to beginners, with excellent tutorials and talks. Many are recorded and available free online. - **csv,conf** — A conference about data, with an emphasis on practical data work. Smaller, community-oriented, very welcoming. - **Local conferences** — → Chapter 36: What's Next: Career Paths, Continuous Learning, and the Road to Intermediate Data Science

Q

Quarterly:
Learn one new tool or library and apply it to a real problem. - Attend a meetup, webinar, or conference talk (many are free and virtual). → Appendix E: Frequently Asked Questions
query parameters
key-value pairs that specify what data you want: → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
Quick decision rule:
Need to look up by name? Use a **dictionary**. - Need an ordered, changeable collection? Use a **list**. - Need unique values or fast membership checks? Use a **set**. - Need a fixed, unchangeable sequence? Use a **tuple**. → Key Takeaways: Working with Data Structures

R

Raise the threshold
be more selective, accept more misses. → Chapter 27: Logistic Regression and Classification — Predicting Categories
Random Forest for predictions
when we need to identify which countries need intervention > - Use the **Decision Tree for communication** — when we need to explain the key risk factors to policymakers > - The two models agree on the most important features (GDP per capita and physicians per 1,000), which gives us confidence in bo → Case Study 2: Comparing Three Models — Which Predicts Vaccination Best?
randomized controlled experiment
the same design used in medical trials. You randomly assign some people to get the treatment and others to get a placebo, and you compare the outcomes. But in data science, experiments aren't always possible. You can't randomly assign countries to have high GDP to see if it increases vaccination rat → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Rate limiting
the server needs to know who's making requests so it can enforce usage limits (e.g., "1,000 requests per hour per user"). 2. **Access control** — some data is only available to authorized users. 3. **Accountability** — if someone misuses the API, the provider needs to know who did it. → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
raw string
it tells Python not to interpret backslashes as escape sequences. Always use raw strings for regex patterns. Always. It will save you from mysterious bugs. → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Recall
measures what fraction of actual positives the model catches, which is critical when positive cases are rare and important (fraud, disease). (2) **F1-score** — the harmonic mean of precision and recall, providing a balanced measure that is only high when both precision and recall are reasonable. Bot → Chapter 27 Quiz: Logistic Regression and Classification — Predicting Categories
Recommendations:
Audit the leadership score for gender bias specifically - Consider using structured evaluation criteria rather than subjective peer ratings - Test the promotion algorithm with and without the leadership score to measure its contribution to the gender gap - If the feature cannot be debiased, consider → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Titanic survival data (classification practice) - House prices (regression practice) - Iris flower dataset (clustering and classification) - New York City Airbnb listings (exploratory analysis) → Appendix D: Data Sources Guide
Red-green color pair
inaccessible to deuteranopic viewers. Red and green are the first two colors assigned to what are presumably the two most important regions. 2. **Six different colors for a single variable** — the bars represent different regions but the same metric (a value). Using different colors implies the colo → Chapter 18 Quiz: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
regex
are a pattern-matching language that has been part of computing since the 1960s. They work in Python, JavaScript, Java, Ruby, SQL, grep, sed, and dozens of other tools. Learning regex once means you can use it everywhere. → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
remote repository
a copy of the repository hosted on a server that everyone can access. → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Remove the legend box border
the legend labels are sufficient. 4. **Make sure the y-axis starts at zero** — for bar charts, this is non-negotiable. 5. **Add a descriptive title** that states the finding, not just the topic. Not "Vaccination Rates by Region" but "Sub-Saharan Africa Lags 30 Points Behind the Global Average in Vac → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
reproducibility
when he runs the same analysis next month, the code is already written, not a series of forgotten clicks; (2) **automation** — he can write a script that processes his sales data every week without manual work; and (3) **scale** — as his business grows, Excel will struggle with larger datasets, but → Chapter 2 Exercises: Setting Up Your Toolkit
Respect the citation honesty system:
Tier 1: Only for sources you can verify exist - Tier 2: Attributed but unverified claims - Tier 3: Clearly labeled illustrative examples 4. **Maintain voice consistency** with the existing chapters 5. **Test code examples** if modifying any code (Python 3.12+) 6. **Submit a pull request** with a cle → Contributing to Introduction to Data Science
return a new string
they don't modify the original. Strings in Python are **immutable**: once created, they cannot be changed. If you want the uppercase version, you need to save it: > > ```python > city = "Minneapolis" > city_upper = city.upper() > print(city_upper) # "MINNEAPOLIS" > ``` > > Or reassign: > > ```python → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
Rolling vs. Expanding:
Rolling = fixed window, moves through data (recent trend) - Expanding = window grows from start to current point (all-time statistic) → Key Takeaways: Working with Dates, Times, and Time Series Data
Rubric:
(a) 2 points: correctly identifies the causal vs. descriptive gap. - (b) 4 points: 2 points per plausible, well-explained alternative. - (c) 4 points: describes a study with a comparison group and random assignment (or a strong quasi-experimental design). Loses points for vague designs ("just study → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Run the analysis both ways
with and without the missing rows — and report whether the conclusions differ. 2. **Create a "Not Reported" category** for race/ethnicity, treating it as its own group rather than deleting it. 3. **Use ZIP code as a supplementary indicator** — while not a substitute for individual race/ethnicity dat → Case Study 1: The Messy Reality of Hospital Records

S

Safeguards:
Use victim-reported crime data (calls for service) rather than arrest data - Exclude minor offenses (drug possession, loitering) that are enforcement-dependent - Cap the amount of additional policing the model can direct to any single area - Regular audits of the model's racial and geographic impact → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Sample size
larger samples give more power. (2) **Effect size** — larger effects are easier to detect. (3) **Significance level** — higher α (e.g., 0.10 vs. 0.01) gives more power but at the cost of more false positives. → Chapter 23 Quiz: Hypothesis Testing
Sample the data
use `df.sample(5000)` to plot a random subset. Fast, but loses some information. (2) **Aggregate** — use `groupby` to compute means or medians by group, reducing 500K rows to hundreds. Changes the chart type from scatter to bar or grouped scatter. (3) **Use `px.density_heatmap()`** — bin the data in → Chapter 17 Quiz: Interactive Visualization — plotly, Dashboard Thinking
sampling bias
a systematic tendency for your sample to differ from the population in ways that affect your conclusions. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Sampling variability
each random sample selects different countries, producing different results. This is the fundamental randomness of sampling. (b) The distribution would be approximately **bell-shaped** (normal), centered around the true population mean. This is a preview of the Central Limit Theorem (Chapter 21). (c → Chapter 20 Quiz: Probability Thinking
SAUCE test
that you can apply to any data claim you encounter: → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Save, document, and deploy
Mistake: not saving the fitted pipeline, not recording library versions, or not documenting assumptions and limitations. → Chapter 30 Quiz: The Machine Learning Workflow
SE Asia has the widest interval
with only 11 countries, there's substantial uncertainty. The mean could plausibly be anywhere from about 56% to 80%. → Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful
Section 10: References (no word count)
All data sources with URLs and access dates - Any external references cited in the analysis - Tools and libraries used (with versions) → Chapter 35: Capstone Project: A Complete Data Science Investigation
Section 1: Title and Abstract (200-300 words)
A descriptive title (not "Capstone Project" but something specific and interesting) - A brief abstract summarizing the question, data, methods, and key findings - This should be readable by someone with no data science background → Chapter 35: Capstone Project: A Complete Data Science Investigation
Section 9: Ethical Reflection (300-500 words)
Who is represented in your data, and who might be missing? - Could your findings be misused? By whom, and how? - What responsibilities do you have as the analyst? - Were there ethical tensions in the analysis itself? (e.g., privacy, consent, representation) → Chapter 35: Capstone Project: A Complete Data Science Investigation
Select and compare models
Mistake: trying only one model, or comparing models on a single train/test split instead of cross-validation. → Chapter 30 Quiz: The Machine Learning Workflow
self-join
the enrollments table is joined with itself, using different aliases (`e1` and `e2`). The `WHERE` clause filters so that `e1` only has CS101 records and `e2` only has CS201 records, while the `ON` clause ensures they're the same student. → Case Study 2: Querying a University Database — Jordan Discovers SQL
Series
A one-dimensional labeled array. A single column of a DataFrame. Has an index, values, and a name. → Key Takeaways: Introduction to pandas
Set random seeds everywhere
different results each run without them. 2. **Record library versions** — different scikit-learn versions may produce different results. 3. **Document the data source and date** — data may change over time. 4. **Save the trained pipeline** — so you can reload and verify without retraining. 5. **Incl → Chapter 30 Exercises: The Machine Learning Workflow
Shape arithmetic:
Melting a table with R rows and C value columns produces R x C rows - Pivoting reduces rows by the number of unique values in the `columns` parameter → Key Takeaways: Reshaping and Transforming Data
Shift+Enter
**(C)** Press **Ctrl+N** - **(D)** Press **Insert** on your keyboard → Chapter 2 Quiz: Setting Up Your Toolkit
Simulation is best when:
The problem is complex and hard to solve analytically - You want to check your analytical answer - You need to explore "what if" scenarios - You want to build intuition before learning the formula → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
Simulation is your Swiss Army knife:
Generate random outcomes, count successes, divide by total - Works for problems that would be difficult to solve with formulas - Monte Carlo simulation is used throughout professional data science → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
SIR model
one of the most fundamental tools in epidemiology. SIR stands for: - **S**usceptible: can catch the disease - **I**nfected: currently infected and can spread it - **R**ecovered: recovered and now immune → Case Study 2: Simulating Disease Spread — Elena Models Outbreak Probability
Situation
establishes the baseline (things were good in 2019). 2. **Complication** — introduces the problem (pandemic caused a decline; outbreaks occurred). 3. **Resolution** — presents the analysis finding (the decline was geographically concentrated; clinic closures explain it). 4. **Call to Action** — reco → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Slide 1: Title Slide
Title: "Protecting Our Children: Data-Driven Recommendations for Improving Vaccination Rates in [County]" - Visual: Clean title with county logo, presenter name, date - Speaker notes: "Good morning. I'm here today to share what our analysis of vaccination data tells us about where we are, where we'r → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Source 1: Coverage estimates
One CSV file downloaded from the WHO portal. It arrives in *wide format*, with each vaccine-year combination as a separate column: → Case Study 2: Reshaping Global Health Data for a WHO Report
Source 2: Population data
Total under-5 population per country (for weighting averages). → Case Study 2: Reshaping Global Health Data for a WHO Report
Source 3: Economic indicators
GDP per capita for correlation analysis. → Case Study 2: Reshaping Global Health Data for a WHO Report
spurious
it reflects their shared cause, not a direct link between them. → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
spurious correlations
statistical artifacts of looking at enough variables over enough time. → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
Standardization
Making equivalent values look the same ("NYC" and "New York City" should become a single value) 2. **Extraction** — Pulling structured information out of unstructured text (getting the number "250" out of "250mL bottle") 3. **Searching** — Finding rows that match certain patterns (all entries contai → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Start with a title cell
Markdown with heading, author, date, purpose 2. **Use section headings** — `##` for major sections, `###` for subsections 3. **Explain before you compute** — Markdown cell before code, interpretation after 4. **Name files descriptively** — `sales-analysis-jan-2024.ipynb`, not `Untitled3.ipynb` 5. ** → Key Takeaways: Setting Up Your Toolkit
Step 1: Frame the problem
**Target:** Vaccination rate (a continuous number from 0 to 100) - **Features:** GDP per capita, healthcare spending per capita, education index, urbanization rate - **Type:** Supervised learning, regression (because the target is continuous) - **Success metric:** How close are our predictions to ac → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Step 1: State the hypotheses
H₀: μ_high = μ_low (no difference in population means) - H₁: μ_high ≠ μ_low (there is a difference) → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
Stratified
stratify by urban/suburban/rural and sample within each stratum. This ensures representation of each school type and produces more precise estimates when strata differ. 2. **Systematic** — select every 100th item (or some regular interval). This is efficient on an assembly line and gives good covera → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Strip whitespace
remove leading/trailing spaces with `.str.strip()` 2. **Standardize case** — convert to lowercase (or uppercase) with `.str.lower()` 3. **Remove or standardize punctuation** — remove unnecessary dots, commas, etc. 4. **Collapse whitespace** — replace multiple spaces with a single space 5. **Map know → Chapter 10 Quiz: Working with Text Data
Structured
Rows and columns with defined types. Challenge: missing values, inconsistent coding across departments (e.g., "M" vs. "Male"), and privacy restrictions. 2. **Unstructured** — Free-form text with no fixed schema. Challenge: extracting meaning from natural language — sarcasm, misspellings, varying len → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Student C is underfitting
using a model that's too simple to capture the real patterns. → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
Students should be able to work offline
the campus Wi-Fi is unreliable in some buildings. → Case Study 1: Setting Up a Data Science Environment for a University Research Lab
Supervised learning comes in two flavors:
**Regression:** Predicting a continuous number (price, temperature, vaccination rate) - **Classification:** Predicting a category (spam or not spam, disease type, high vs. low vaccination) → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
survivorship bias
a concept they'll learn formally later, but one they can already recognize. → Case Study 2: Querying a University Database — Jordan Discovers SQL
SyntaxError
the closing quotation mark is missing. Should be: `print("Welcome to data science!")` → Chapter 2 Exercises: Setting Up Your Toolkit

T

Target variable:
`defaulted`: Whether the borrower defaulted within 2 years (1 = yes, 0 = no) → Case Study 1: Should We Approve the Loan? A Decision Tree for Credit Risk
Target:
`home_win`: 1 if the home team won, 0 if the away team won → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA
Technical insights:
Net rating difference and winning percentage difference are the two most important predictors — together accounting for 50% of the model's decisions. - Rest days have a measurable but small effect — about 6% of predictive weight. - The model is 68-69% accurate across seasons, matching professional p → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA
Test edge cases
earliest year, latest year, each region individually. (2) **Check for empty results** — some region-year combinations may have no data; the callback should handle this gracefully (e.g., show an empty chart with a message rather than crashing). (3) **Add validation in the callback** — check if the fi → Chapter 17 Quiz: Interactive Visualization — plotly, Dashboard Thinking
Test on a small string first
use `re.findall()` on a single example 2. **Build incrementally** — start with the simplest pattern that matches *something*, then add complexity 3. **Check for unescaped special characters** — `.`, `$`, `(`, `)`, `*`, `+`, `?` all need `\` for literal matching 4. **Check greedy vs. lazy** — if matc → Key Takeaways: Working with Text Data
Test tooltips exhaustively
hover over edge cases (small countries, missing data, outliers) to ensure they display correctly. - **Choose export format based on audience** — HTML for exploration, static images for reports, Dash for recurring dashboards. - **Start simple** — one excellent interactive chart is better than a medio → Case Study 1: An Interactive Global Health Dashboard
text normalization
the process of making equivalent text values consistent. → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
The administrative data is systematically inflated
by about 16-19 percentage points based on the three survey comparisons. This could be due to denominator problems, double-counting, or reporting incentives. → Case Study 2: Estimating Global Vaccination Coverage from Incomplete Data
The author's name and date
so readers know who created the analysis and when. This matters for accountability and for understanding whether the analysis might be outdated. → Chapter 2 Quiz: Setting Up Your Toolkit
The caveats
what should the reader be cautious about? "This analysis is based on observational data and cannot establish that mobile clinics directly cause higher vaccination rates. A randomized pilot study would strengthen the evidence." → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
The context
two to three sentences establishing why this analysis was done. "Following the 2020-2021 decline in childhood vaccination rates, the Department of Health requested an analysis of county-level trends and intervention effectiveness." → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
The dartboard analogy:
High bias, low variance: tight cluster, far from bullseye - Low bias, high variance: scattered around bullseye - Low bias, low variance: tight cluster near bullseye (the goal) → Key Takeaways: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
The data:
WHO COVID-19 Vaccination Data (194 countries, 2021-2023) - World Bank Development Indicators (GDP per capita, population, education levels) - WHO Global Health Expenditure Database (healthcare spending, workforce density) - Optional: additional sources you've identified during the course → Chapter 35: Capstone Project: A Complete Data Science Investigation
The Decision Rule:
Symmetric data → report mean + standard deviation - Skewed or outlier-heavy data → report median + IQR - Always report the five-number summary - Always, always, always *plot your data* → Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell
The headline
one sentence that captures the key finding. "Mobile clinics are the most cost-effective way to increase rural vaccination rates." → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
The key findings
three to five bullet points, each stating an insight (not a finding). Use plain language. Avoid jargon. Include numbers but make them meaningful. - "Rural vaccination rates dropped 12% between 2019 and 2022, compared to 3% in urban areas." - "Counties with mobile clinic programs maintained rates 8 p → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
The pairing rule:
Symmetric data: **mean + standard deviation** - Skewed data: **median + IQR** → Key Takeaways: Descriptive Statistics
The recommendation
what should the reader do? Be specific. "We recommend a pilot expansion of the mobile clinic program to 15 underserved rural counties, at an estimated annual cost of $2.3 million." → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
The rules:
**Complement:** P(not A) = 1 - P(A) - **Addition (OR):** P(A or B) = P(A) + P(B) - P(A and B) - **Multiplication (AND, independent):** P(A and B) = P(A) * P(B) - **Conditional:** P(A|B) = P(A and B) / P(B) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
thinking in vectors rather than loops
is the mental shift that unlocks pandas. When you catch yourself reaching for a `for` loop to process DataFrame data, stop and ask: "Is there a vectorized way to do this?" Almost always, the answer is yes. → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
Three immediate policies:
**Data inventory and access control:** Document what data the company collects, where it is stored, who can access it, and what it is used for. Restrict access to need-to-know basis. - **Privacy-by-design:** All new products and features must include a privacy review before launch. Default to collec → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Three things the project does well
be specific. "Good charts" is less helpful than "The chart comparing vaccination rates by income group is clear, well-labeled, and immediately communicates the main finding." 3. **Three things that could be improved** — be constructive. "Your limitations section is weak" is less helpful than "Your l → Chapter 35: Capstone Project: A Complete Data Science Investigation
Three types of missing data:
**MCAR (Missing Completely at Random):** No pattern to the missingness. Least problematic. - **MAR (Missing at Random):** Missingness is related to an observed variable. Manageable with care. - **MNAR (Missing Not at Random):** Missingness is related to the missing value itself. Most problematic — c → Key Takeaways: Your First Data Analysis
Three types of questions:
**Descriptive:** What happened? What does the data look like? (Start here.) - **Predictive:** What is likely to happen? (Requires modeling — Part V.) - **Causal:** What would happen if we changed something? (Requires experimental design — Chapter 24.) → Key Takeaways: Your First Data Analysis
Tips for self-study:
**Set a schedule and stick to it.** Consistency beats intensity. One hour a day, five days a week, is better than a single 10-hour marathon on Saturday. - **Type every code example.** Don't copy-paste. The act of typing forces your brain to engage with every character, and you'll catch details you'd → How to Use This Book
Title and axis labels
mandatory for every chart. 2. **One key finding** in the title or subtitle — the reader should know the main message without studying the chart. 3. **Reference lines or annotations** for the most important comparison. 4. **Data labels** on the most important data points (not all of them). 5. **Sourc → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Too big:
"I'm going to build a real-time stock price prediction system with a streaming data pipeline." - "I want to analyze every tweet ever posted about climate change." - "I'm building a recommendation engine for every movie on Netflix." → Chapter 34: Building Your Portfolio: Projects That Get You Hired
Too small:
"I loaded a dataset and made a bar chart." - "I calculated the mean and standard deviation of this column." → Chapter 34: Building Your Portfolio: Projects That Get You Hired
Too vague
"interesting" isn't specific enough to guide analysis. 2. **Good for EDA** — specific column, specific filter, specific statistic. 3. **Unanswerable with this data** — "why" requires causal information (conflict, infrastructure) not in the dataset. 4. **Good for EDA** — you could compute the standar → Chapter 6 Exercises: Your First Data Analysis
top to bottom
stops at the first `True` condition. - `elif` and `else` are optional. You can have just `if`. - You can have as many `elif` branches as you need. - Only **one** branch executes in an `if`/`elif`/`else` chain. → Key Takeaways: Control Flow, Functions, and Thinking Like a Programmer
True
any nonzero number is truthy. 2. `bool(0)` = **False** --- zero is falsy. 3. `bool(-1)` = **True** --- negative numbers are nonzero, therefore truthy. 4. `bool("")` = **False** --- empty string is falsy. 5. `bool(" ")` = **True** --- a space character makes the string non-empty. 6. `bool("0")` = **T → Answers to Selected Exercises
Tune hyperparameters
Mistake: hand-tuning one parameter at a time instead of searching systematically, or searching too fine a grid (overfitting hyperparameters). → Chapter 30 Quiz: The Machine Learning Workflow
Type errors
columns where the data type doesn't match what it should be. A column of numbers that pandas reads as text because one cell contains "N/A." A date column stored as a string. A ZIP code that lost its leading zero because Excel treated it as a number. → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
type inference
it examines the values and converts them to appropriate Python/NumPy types (int64, float64, object). The Python concept is **type conversion** (Ch.3) — pandas just does it for you automatically. > > > **From Chapter 4:** The `apply()` method takes a function as an argument. What is this p → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
TypeError
you can't use `+` to combine a string (`"The average is: "`) with a number (`average`). Fix options: `print("The average is:", average)` (using comma) or `print("The average is: " + str(average))` (converting number to string). → Chapter 2 Exercises: Setting Up Your Toolkit

U

University mental health survey:
Population: All students at the university. - Sample: The 400 randomly selected students. - Parameter: True proportion of all students who use the mental health center. - Statistic: Proportion in the sample who use the center. → Answers to Selected Exercises
Use a decision tree when:
Interpretability is the top priority — your audience needs to understand *why* the model makes each prediction - You're building a preliminary model to explore which features matter - The dataset is small and a complex model would overfit - You need to explain the model to non-technical stakeholders → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
Use a histogram when:
You want to see exact counts per bin - Your audience is not statistically sophisticated (histograms are universally understood) - You are presenting a single distribution → Chapter 16: Statistical Visualization with seaborn
Use a random forest when:
Accuracy is the top priority and you can sacrifice some interpretability - You have a medium-to-large dataset - You want a model that's robust to small changes in the data - You want reliable feature importance scores - You're in a competitive setting (random forests are strong default models for ma → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
Use a regular loop when:
You need multiple steps per iteration - You need to handle errors or edge cases - The logic is complex enough that a comprehension would be hard to read → Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data
Use comprehensions when:
The transformation is simple (one expression, optionally one filter) - The result is a new list or dictionary - The code is still readable on one or two lines → Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data
Use ECDF when:
You want to compare distributions without the visual confusion of overlapping curves - You need to read off percentiles directly (the y-axis gives cumulative probability) - You want a representation that does not depend on bin width or bandwidth choices → Chapter 16: Statistical Visualization with seaborn
Use KDE when:
You want to compare multiple distributions on the same axes (overlapping KDEs are clearer than overlapping histograms) - You want to emphasize the smooth shape of the distribution - You have enough data points (at least 50-100) for a reliable density estimate → Chapter 16: Statistical Visualization with seaborn
Use regex when:
You need to match a *pattern* rather than a fixed string ("any sequence of digits") - You need to *extract* part of a string (capture groups with `.str.extract()`) - You need to match with *flexibility* (one word OR another, optional characters) - You need *anchoring* (must start with, must end with → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Use rug when:
You want to show individual observations alongside another distribution plot - Your dataset is small to moderate (under a few hundred points) - You want to verify that the KDE or histogram is not masking gaps or clusters → Chapter 16: Statistical Visualization with seaborn
Use simple string methods when:
You're doing case conversion (`.str.lower()`, `.str.upper()`) - You're stripping whitespace (`.str.strip()`) - You're replacing a known, fixed substring (`.str.replace("old", "new", regex=False)`) - You're splitting on a simple delimiter (`.str.split(",")`) - You're checking for a known, fixed subst → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Use something else when:
You have a very large dataset (>100K samples) and need fast training — consider gradient boosting (XGBoost, LightGBM) - You need a linear model for inference or theoretical reasons — stick with logistic regression - Your data has a strong linear structure — tree-based models can struggle with simple → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss

V

Vectorized operations
Applying an operation to an entire column at once (`df["col"] * 2`) rather than looping through values one by one. Faster, safer, more readable. → Key Takeaways: Introduction to pandas
Version Control:
[ ] The project is in a git repository - [ ] All changes are committed with descriptive messages - [ ] The `.gitignore` excludes data files, environments, and secrets → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
Voluntary response bias
only people with strong feelings (very happy or very unhappy) bother filling out cards. The 85% likely overestimates satisfaction because moderately satisfied people don't fill out cards, but the few angry ones often do, creating an odd mix. 2. **Voluntary response / self-selection bias** — people w → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals

W

Web scraping
extracting data from websites programmatically - **Surveys and experiments** that you design and conduct yourself → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Weekly (15--30 minutes):
Skim one or two data science newsletters. Recommended: *Data Science Weekly*, *The Batch* (by Andrew Ng), or *Towards Data Science* (on Medium). - Browse the front page of [/r/datascience](https://www.reddit.com/r/datascience/) on Reddit. → Appendix E: Frequently Asked Questions
What context is missing
baselines, denominators, comparisons? → Case Study 2: The Gallery of Misleading Charts — A Forensic Analysis
What he used:
Spotify Web API audio features (tempo, energy, danceability, valence, acousticness, instrumentalness, speechiness, loudness) for tracks appearing on the Billboard Hot 100, 2004-2024 - Billboard Hot 100 chart data (song titles, artists, peak position, weeks on chart) - He collected data for approxima → Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset
What I'm working on:
Analyzing global vaccination rate disparities using WHO data - Building interactive dashboards with Plotly - Learning SQL and cloud data tools → Chapter 34: Building Your Portfolio: Projects That Get You Hired
What not to do:
Do not try to learn every new framework that appears on Hacker News. Most will be irrelevant to your work. - Do not feel inadequate because someone on Twitter is discussing techniques you have not learned. Everyone's knowledge has gaps. - Do not confuse reading about data science with doing data sci → Appendix E: Frequently Asked Questions
What she found:
Census tract-level demographic data from the American Community Survey (ACS), including median household income, racial composition, educational attainment, and housing tenure (rent vs. own) - Zillow Home Value Index data at the ZIP code level - Building permit data from the city's open data portal, → Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset
What she would change:
The county map as a scatter plot was functional but looked amateurish compared to a real choropleth with county boundaries. For the next election, she would use GeoJSON county boundaries with `px.choropleth_mapbox()`. - The 15-minute refresh cycle felt slow on election night. A true real-time dashbo → Case Study 2: Election Night Live — Building an Interactive Results Tracker
What the numbers show:
Routes serving downtown offices are carrying about half the riders they did in 2019 - Routes serving hospitals, schools, and retail areas are back to about 95% of pre-pandemic ridership - On the most recovered routes, buses are overcrowded during peak hours → Case Study 2: Writing for Different Audiences — Technical Report vs. Blog Post vs. Slide Deck
What to do about multicollinearity:
**For prediction:** Often nothing. Multicollinearity doesn't affect prediction accuracy much — it affects coefficient interpretation. If you only care about making good predictions, you can often ignore it. → Chapter 26: Linear Regression — Your First Predictive Model
What to document:
The seed value you used - Any library-specific seeding (some libraries have their own random number generators) → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
What to write about:
Take one of your portfolio projects and write it up as a narrative blog post. Not a tutorial ("how to build a random forest") but an investigation story ("What I learned about global vaccination disparities by building three different models"). - Write about a concept you struggled to understand. "A → Chapter 34: Building Your Portfolio: Projects That Get You Hired
What worked:
The tooltip design was critical. Board members and casual readers both praised the ability to hover over a county and see exact numbers without navigating to a separate table. - The demographic scatter was the most shared chart on social media. People found the education-voting correlation striking → Case Study 2: Election Night Live — Building an Interactive Results Tracker
What you'll do:
Combine and refine all the work you've done across chapters 1-34 - Fill any analytical gaps (sections you skipped, analyses you started but didn't finish) - Add new analysis where needed to tell a complete story - Polish everything into a single, cohesive narrative notebook - Write an executive summ → Chapter 35: Capstone Project: A Complete Data Science Investigation
When deletion is appropriate:
The missing values are a small percentage of your data (often cited as less than 5%) - The data is **missing completely at random** (MCAR) — the reason for missingness has nothing to do with the value itself or any other variable - You have enough data that losing some rows won't affect your analysi → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
When deletion is dangerous:
The missingness is *not* random. If low-income patients are more likely to have missing insurance information, dropping those rows silently removes low-income patients from your analysis. Your results now describe only the people with complete records — which may not represent the population you car → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
When to branch:
New feature or analysis - Experimental work - Bug fixes - Each team member's parallel work → Key Takeaways: Reproducibility and Collaboration: Git, Environments, and Working with Teams
When to recommend each format:
**CSV** when simplicity and compatibility matter, and the data is a single flat table. - **Excel** when sharing with non-technical stakeholders or when the data naturally has multiple related sheets. - **JSON** when the data is hierarchical or comes from a web API. - **Database** when the data is la → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
When to use each:
**Two-tailed:** Default choice. Use when you don't have a strong directional prediction, or when an effect in either direction would be interesting. - **One-tailed:** Use only when you have a clear directional hypothesis *stated before looking at the data*, and an effect in the other direction would → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
When to use KDE vs. histograms:
Use histograms when you want to see the exact count in each bin and when your audience is less technical. - Use KDE when you want to compare distributions across groups (overlapping KDEs are easier to read than overlapping histograms). - Use both together when exploring data for yourself. → Chapter 16: Statistical Visualization with seaborn
When to use which:
**Standard deviation** when your data is roughly symmetric and doesn't have extreme outliers. - **IQR** when your data is skewed or has outliers. → Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell
Where to blog:
**Medium** (and specifically its data science publications like *Towards Data Science*) has the largest built-in audience for data science content. - **dev.to** is popular among developers and data practitioners. - **A personal website** using GitHub Pages, Jekyll, Hugo, or a simple site builder giv → Chapter 34: Building Your Portfolio: Projects That Get You Hired
Where to set seeds:
At the top of every notebook or script - In every function that uses randomness - As the `random_state` parameter in scikit-learn functions → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
which type of error is more costly?
In **medical screening**, missing a sick person (false negative) could be fatal. *Prioritize recall.* You'd rather have false alarms than missed diagnoses. → Chapter 29: Evaluating Models — Accuracy, Precision, Recall, and Why "Good" Depends on the Question
word boundary
the position between a word character (`\w`) and a non-word character. It matches a *position*, not a character. This is why `r"\bcat\b"` matches "cat" as a whole word but not "catfish" or "concatenate." Note: outside of regex, `\b` does mean backspace in regular Python strings, which is another rea → Chapter 10 Quiz: Working with Text Data
Works for ANY shape
uniform, skewed, bimodal, anything 2. **n >= 30 is usually enough** (more for heavily skewed data) 3. **Standard error = sigma / sqrt(n)** — larger samples give more precise means → Key Takeaways: Distributions and the Normal Curve
Write and run code cells
use `print()`, do arithmetic, understand cell output - [ ] **Write and run Markdown cells** — create headings, bold, italic, lists, links - [ ] **Switch between cell types** — use the dropdown or M/Y shortcuts - [ ] **Use keyboard shortcuts** — at minimum: Shift+Enter, Esc, Enter, A, B, D-D, M, Y - → Key Takeaways: Setting Up Your Toolkit
Writing insights:
The decision tree was more useful for storytelling than the random forest. The tree's simple rules ("if the home team has a better record AND a better net rating, they'll probably win") are easy to explain in an article. The random forest's improved accuracy came at the cost of narrative clarity. - → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA

Y

Yes
the variable is still in memory from when you previously ran cell 3. Python doesn't "know" that the cell was deleted. 2. **NameError** — after a kernel restart, all variables are cleared. Cell 3 no longer exists to recreate `patient_count`. 3. Best practices: (a) periodically do Kernel → Restart & R → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
Your brain's probability bugs:
Gambler's fallacy (thinking past outcomes affect future independent events) - Base rate neglect (ignoring priors) - Underestimating AND probabilities and OR probabilities (birthday problem) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies

Z

Z-Score Method:
z = (x - mean) / standard deviation - |z| > 2: unusual - |z| > 3: very unusual - Less robust — based on the mean, which outliers themselves distort → Key Takeaways: Descriptive Statistics