Exercises: Data Wrangling — Cleaning and Preparing Real Data

Contributors

Exercises: Data Wrangling — Cleaning and Preparing Real Data

These exercises progress from concept checks to hands-on cleaning tasks and real-world judgment calls. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. Explain the difference between MCAR, MAR, and MNAR in your own words. For each type, give a real-world example not used in the chapter.

A.2. A dataset of 1,000 student exam scores has 50 missing values (5%). A classmate says, "It's only 5%, so let's just delete those rows and move on." Under what circumstances would this be a good strategy? Under what circumstances would it be a bad strategy — even though the percentage is small?

A.3. What does the phrase "garbage in, garbage out" mean in the context of data analysis? Give a specific example of how a single data entry error could lead to a wrong conclusion.

A.4. Explain the difference between data cleaning and data wrangling. Are they the same thing? Which is broader?

A.5. Why is it important to create a "missing data flag" when you impute values? What information would you lose if you simply replaced the missing values without tracking which ones were originally missing?

A.6. What are the three rules of tidy data? Give an example of a dataset that violates one of these rules and show how you would fix it.

A.7. A friend says, "I just used Find and Replace in Excel to change all the blank cells to zero." Explain why this might be a terrible idea, even if it makes the spreadsheet "look clean."

Part B: Identifying Data Quality Issues ⭐⭐

B.1. Examine the following small dataset of patient records. Identify at least five data quality issues.

patient_id	age	gender	blood_pressure	temperature	diagnosis
P001	34	Male	120/80	98.6	Flu
P002	28	female	135/90	101.2	flu
P003	-5	M	110/70	99.1	Flu
P004	67	Male	140/95	9999	Influenza
P005	45	Female		100.4	Flu
P006	28	female	135/90	101.2	flu
P001	34	Male	120/80	98.6	Flu
P007	892	MALE	125/82	37.2	Flu

B.2. For each issue you identified in B.1, classify it as one of: (a) missing data, (b) duplicate, (c) inconsistent formatting, (d) impossible/out-of-range value, (e) data type issue, or (f) tidy data violation.

B.3. The following pandas output shows .value_counts() for a "state" column in a U.S. survey dataset:

California     234
california      87
CA              56
CALIFORNIA      12
Calif           8
Calif.          5
ca              3
Cali            2

(a) How many total respondents are from California? (b) Write a Python code snippet that would standardize all of these to a single value. (c) After standardizing, would any missing data remain? How might you handle entries like "Cali" — are you confident that's California and not some other place?

B.4. You run df.describe() on a dataset and get the following output for an "income" column:

count    4500.000000
mean     8234.567890
std      45231.123456
min      -99999.000000
25%       32000.000000
50%       55000.000000
75%       82000.000000
max      999999.000000

Identify at least three red flags in this output. For each, explain what might have caused it and how you would investigate further.

Part C: Hands-On Cleaning with Python ⭐⭐

C.1. Write Python code to accomplish each of the following tasks (use pandas):

(a) Count the number of missing values in every column of a DataFrame called df. (b) Drop all rows where the column email is missing. (c) Fill missing values in the column salary with the median salary. (d) Create a new column salary_was_missing that equals 1 where salary was originally missing and 0 otherwise. (e) Remove rows that are exact duplicates.

C.2. Given this DataFrame:

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'charlie', 'DIANA', 'Eve ', ' frank'],
    'department': ['Sales', 'sales', 'SALES', 'Marketing', 'marketing', 'MARKETING'],
    'salary': [55000, 62000, 58000, 71000, None, 65000]
}
df = pd.DataFrame(data)

(a) Standardize the name and department columns so all values are lowercase with no leading/trailing whitespace. (b) Fill the missing salary with the mean salary of that person's department (after standardizing department names). (c) Create a new column salary_level that categorizes salaries as "Low" (< 60000), "Medium" (60000-70000), and "High" (> 70000).

C.3. You have a column called date_of_birth with the following entries:

'1990-03-15', '03/15/1990', 'March 15 1990', '15-03-1990', None, '1990'

(a) Which of these can be parsed by pd.to_datetime() with default settings? (b) Write code to convert as many of these as possible to datetime format, using errors='coerce' to handle unparseable values. (c) Explain what errors='coerce' does and why it's useful during data cleaning.

C.4. Write a Python function called assess_missing_data that takes a DataFrame as input and prints: - The total number of rows and columns - The number and percentage of missing values per column - A list of columns with more than 10% missing values (as a warning)

def assess_missing_data(df):
    # Your code here
    pass

Part D: Binning and Feature Engineering ⭐⭐

D.1. You have a dataset with a continuous variable age ranging from 0 to 95. Create the following bins using pd.cut():

(a) Age groups: "Child" (0-12), "Teen" (13-17), "Young Adult" (18-34), "Middle Aged" (35-54), "Senior" (55+) (b) Equal-width bins with 5 bins (c) Equal-frequency bins (quartiles) using pd.qcut()

For each, explain when you would choose that binning strategy.

D.2. Given a dataset with columns exam1, exam2, exam3, and homework_avg: (a) Create a new column final_grade calculated as: 25% exam1 + 25% exam2 + 30% exam3 + 20% homework_avg (b) Create a letter grade column using bins: A (90-100), B (80-89), C (70-79), D (60-69), F (0-59) (c) Explain why this is an example of feature engineering.

D.3. A dataset has separate columns for systolic_bp and diastolic_bp (blood pressure). Create: (a) A bp_category variable using standard medical categories: - Normal: systolic < 120 AND diastolic < 80 - Elevated: systolic 120-129 AND diastolic < 80 - High: systolic >= 130 OR diastolic >= 80 (b) Explain why this new variable might be more useful for analysis than the two separate numeric columns.

Part E: Missing Data Strategy ⭐⭐⭐

E.1. For each of the following scenarios, identify the most likely missing data mechanism (MCAR, MAR, or MNAR) and recommend a handling strategy. Justify your answer.

(a) A weather station's temperature sensor occasionally malfunctions during storms, producing missing readings. (b) In a survey about exercise habits, respondents who exercise very little tend to leave the "hours of exercise per week" field blank. (c) A hospital database has missing zip codes for some patients. Investigation reveals that the missing zip codes are concentrated in the emergency department, where intake staff are too busy to collect non-essential information. (d) An online retail dataset has missing product reviews. Products with very few sales are more likely to have no reviews. (e) In a school database, some students' standardized test scores are missing because they were absent on test day due to random illness.

E.2. A researcher has a dataset of 10,000 survey responses about mental health. The key variable — "Have you ever been diagnosed with depression?" — has 25% missing values. The researcher decides to impute the missing values with the mode (which is "No").

(a) Why is this a particularly dangerous imputation choice for this variable? (b) What type of missing data mechanism is most likely at work here? (c) What would be a better approach? (d) How would the imputation affect the estimated prevalence of depression in the sample?

E.3. You're analyzing two approaches to handling missing data in a dataset with 2,000 rows and 15% missing values in the key variable.

Approach A: Delete all rows with missing values (listwise deletion) Approach B: Impute with the column median

For each approach: (a) What happens to your sample size? (b) What happens to the mean if the data is MNAR and high values are more likely to be missing? (c) What happens to the standard deviation? (d) Which approach would you recommend, and under what assumptions?

Part F: Ethics and Judgment ⭐⭐⭐

F.1. Alex Rivera is analyzing StreamVibe data to compare watch time between users who received a new recommendation algorithm (treatment) and those who didn't (control). During data cleaning, he discovers that 18% of treatment-group users have missing watch-time data, compared to only 3% of control-group users.

(a) Is this missing data likely MCAR? Why or why not? (b) If Alex deletes rows with missing watch time, how would this affect his comparison of the two groups? (c) Suggest a better approach. (d) Why does this scenario illustrate Theme 6 (ethical data practice)?

F.2. Professor Washington is examining a predictive policing algorithm's arrest data. He discovers that the "race" variable has the following pattern of missing data:

Race Category	% Missing
White	5%
Black	8%
Hispanic	22%
Asian	12%
Other/Unknown	45%

(a) What does the pattern of missingness suggest about the data collection process? (b) If Washington deletes all rows with missing race data, whose experiences would be most underrepresented in his analysis? (c) How could this affect his conclusions about racial bias in the algorithm? (d) What would you recommend he do instead?

F.3. A health insurance company is using data to predict which customers will file expensive claims. During data cleaning, they decide to: - Remove all records from customers over age 80 (calling them "outliers") - Impute missing income data with the national median - Drop a column recording disability status because it has 40% missing values

For each decision, explain: (a) How it could bias the predictions (b) Who might be harmed by this bias (c) What an ethical alternative would look like

Part G: Comprehensive Scenarios ⭐⭐⭐⭐

G.1. You receive the following raw dataset of customer orders:

order_id	customer	email	date	amount	category	city
101	Smith, J	jsmith@mail.com	2024-01-15	45.99	Electronics	new york
102	JONES, A		2024-01-15	-12.50	electronics	New York
103	smith, j	jsmith@mail.com	01/16/2024	78.00	Home & Garden	NYC
104	Davis, M	mdavis@mail.com	2024-01-17	999999.99		Chicago
105	Brown, K	kbrown@mail.com	2024-01-18	23.50	Food	chicago
101	Smith, J	jsmith@mail.com	2024-01-15	45.99	Electronics	new york

(a) Write a complete data cleaning plan identifying every issue you see. (b) For each issue, specify the pandas code you would use to fix it. (c) Write a cleaning log documenting your decisions and justifications. (d) What assumptions are you making? What risks do your cleaning decisions introduce?

G.2. Maya's county health department receives flu surveillance data from 12 clinics. After merging the files, she has 15,000 patient records with the following missing data profile:

Column	% Missing	Suspected Mechanism
age	3%	MCAR (random data entry errors)
gender	1%	MCAR
race/ethnicity	18%	MAR (higher in urgent care clinics)
vaccination_status	12%	MNAR (unvaccinated patients less likely to know/report)
insurance_type	8%	MAR (higher in emergency patients)
symptom_onset_date	22%	MAR (patients seen days later may not remember exact date)
hospitalized	2%	MCAR
temperature	5%	MCAR (equipment issues)

Write a detailed cleaning plan for Maya that includes: (a) Your recommended strategy for each variable (with justification) (b) Which variables, if any, should be dropped entirely (c) What sensitivity analyses you would recommend (i.e., "check whether conclusions change if you use a different strategy") (d) How to document the cleaning decisions in a way that satisfies public health reporting requirements

G.3 (Research Extension). The concept of "tidy data" was formalized by Hadley Wickham in a 2014 paper. Research and answer the following:

(a) Find Wickham's paper (hint: published in Journal of Statistical Software). What are the three rules of tidy data as he defines them? (b) Wickham describes five common "messy data" patterns. List at least three. (c) Why does tidy data make analysis easier? Give a specific example using pandas groupby(). (d) Is tidy data always the best format? When might "wide" format be preferable? (Think about how you might want to present data in a report vs. how you need it for analysis.)