Key Takeaways: Data Wrangling — Cleaning and Preparing Real Data
One-Sentence Summary
Data cleaning is the unglamorous foundation of all analysis — handling missing values, fixing inconsistencies, removing duplicates, transforming variables, and documenting every decision so your results are transparent and reproducible.
Core Concepts at a Glance
| Concept |
Definition |
Why It Matters |
| Missing data mechanisms |
MCAR, MAR, MNAR — why data is missing |
Determines which handling strategies are appropriate |
| Imputation |
Replacing missing values with estimates |
Preserves sample size but introduces assumptions |
| Tidy data |
Each variable a column, each observation a row, each value a cell |
The standard structure for analysis-ready data |
| Reproducibility |
Others can follow your steps to the same result |
The foundation of trustworthy, transparent analysis |
| Feature engineering |
Creating new variables from existing ones |
Captures patterns that raw variables might hide |
Decision Guide: Handling Missing Data
Missing data detected in a column
│
├── How much is missing?
│ ├── < 5% → Simple methods usually fine
│ ├── 5-20% → Think carefully about mechanism
│ └── > 20% → Consider dropping the column or using advanced methods
│
├── What TYPE of variable?
│ ├── Numerical + symmetric distribution → Mean imputation
│ ├── Numerical + skewed distribution → Median imputation
│ └── Categorical → Mode imputation
│
├── What's the MECHANISM?
│ ├── MCAR → Deletion or imputation both OK
│ ├── MAR → Group-based imputation preferred
│ └── MNAR → No simple fix; document limitation
│
└── ALWAYS: Create a missing-data flag + document your decision
Missing Data Mechanisms
| Mechanism |
What It Means |
Example |
Safe to Delete? |
| MCAR |
Missingness unrelated to any variable |
Coffee spilled on random forms |
Yes (remaining data still representative) |
| MAR |
Missingness predictable from other variables |
Older patients more likely to have missing vaccination records |
Risky — introduces bias toward younger patients |
| MNAR |
Missingness related to the missing value itself |
High-income people skip income question |
No — the remaining data systematically misrepresents the population |
Imputation Methods
| Method |
How |
Pros |
Cons |
| Listwise deletion |
Drop rows with any NaN |
Simple; no assumptions |
Loses data; biased if not MCAR |
| Mean imputation |
Replace NaN with column mean |
Simple; preserves mean |
Reduces variance; biased for skewed data |
| Median imputation |
Replace NaN with column median |
Better for skewed data; resistant to outliers |
Reduces variance |
| Mode imputation |
Replace NaN with most common value |
Works for categorical data |
May overrepresent the dominant category |
| Group-based |
Impute using group mean/median |
Respects data structure |
Requires a meaningful grouping variable |
| Flagging |
Impute + create "was_missing" indicator |
Preserves information about missingness |
Extra column; pattern might not be used |
Data Quality Checklist
Use this checklist every time you start a new data analysis:
- [ ] Run
.info() — Check data types and non-null counts
- [ ] Run
.describe() — Look for impossible min/max, suspicious means, large standard deviations
- [ ] Run
.value_counts() on categorical columns — Check for inconsistent categories
- [ ] Check for duplicates —
df.duplicated().sum()
- [ ] Count missing values —
df.isna().sum() and percentages
- [ ] Check for placeholder values — 999, -1, -99, 0 in places where they shouldn't be
- [ ] Verify date columns are datetime type, not strings
- [ ] Check zip codes, phone numbers, IDs are strings, not numbers
The Data Cleaning Pipeline
Follow this order for every project:
1. ASSESS → .info(), .describe(), .value_counts(), .duplicated()
↓
2. DEDUPLICATE → .drop_duplicates()
↓
3. FIX TYPES → pd.to_datetime(), .astype()
↓
4. STANDARDIZE TEXT → .str.lower(), .str.strip(), .replace()
↓
5. FIX IMPOSSIBLE VALUES → Set out-of-range values to NaN
↓
6. HANDLE MISSING DATA → Flag first, then impute or drop
↓
7. CREATE NEW VARIABLES → Binning, calculations, recoding
↓
8. VERIFY → Final .info(), .describe(), .isna().sum()
↓
9. DOCUMENT → Write the cleaning log
↓
10. SAVE → .to_csv('cleaned_data.csv')
Tidy Data Rules
| Rule |
Violation Example |
Fix |
| Each variable → its own column |
"120/80" in a blood_pressure column |
Split into systolic and diastolic columns |
| Each observation → its own row |
Patient row with Jan_Temp, Feb_Temp, Mar_Temp columns |
Reshape to long format: Patient, Month, Temperature |
| Each value → its own cell |
"flu, pneumonia" in a diagnosis cell |
Separate rows or separate columns |
Python Quick Reference
import pandas as pd
import numpy as np
# --- Assessment ---
df.info() # Types, non-null counts
df.describe() # Summary stats (spot impossibles)
df.isna().sum() # Missing count per column
df.isna().mean() * 100 # Missing percentage per column
df.duplicated().sum() # Count duplicate rows
df['col'].value_counts() # Category frequencies
# --- Cleaning ---
df.drop_duplicates() # Remove duplicate rows
df.dropna() # Drop rows with any NaN
df.dropna(subset=['col']) # Drop rows where 'col' is NaN
df['col'].fillna(value) # Fill NaN with value
df['col'].fillna(df['col'].median()) # Fill with median
df['col'].str.lower().str.strip() # Standardize text
df['col'].replace({'old': 'new'}) # Map values
# --- Transformation ---
pd.cut(df['col'], bins=..., labels=...) # Equal-width bins
pd.qcut(df['col'], q=4, labels=...) # Equal-frequency bins
pd.to_datetime(df['col']) # Convert to datetime
df.loc[condition, 'col'] = new_value # Conditional assignment
# --- Reshaping ---
pd.melt(df, id_vars=..., value_vars=...) # Wide → long
df.pivot(index=..., columns=..., values=...) # Long → wide
# REMEMBER: Most methods return a NEW object.
# Always assign the result back: df['col'] = df['col'].fillna(0)
Excel Quick Reference
| Task |
Method |
| Count blanks |
=COUNTBLANK(range) |
| Remove spaces |
=TRIM(cell) |
| Remove non-printable chars |
=CLEAN(cell) |
| Standardize case |
=LOWER(cell), =UPPER(cell), =PROPER(cell) |
| Replace values |
Ctrl+H (Find & Replace) |
| Highlight blanks |
Conditional Formatting > Blanks |
| Highlight duplicates |
Conditional Formatting > Duplicate Values |
| Remove duplicates |
Data > Remove Duplicates |
| Restrict future entries |
Data > Data Validation |
| Handle errors |
=IFERROR(formula, "") |
Key Terms
| Term |
Definition |
| Missing data (NA/NaN) |
Values absent from the dataset — blank cells, null values |
| MCAR |
Missing Completely at Random — missingness unrelated to any variable |
| MAR |
Missing at Random — missingness predictable from other observed variables |
| MNAR |
Missing Not at Random — missingness related to the missing value itself |
| Imputation |
Replacing missing values with estimated values |
| Data cleaning |
Detecting and correcting errors, inconsistencies, and quality issues |
| Data wrangling |
Transforming raw data into analysis-ready format (broader than cleaning) |
| Tidy data |
Each variable a column, each observation a row, each value a cell |
| Recoding |
Creating a new variable with different values or categories |
| Binning |
Dividing a continuous variable into discrete intervals |
| Feature engineering |
Creating new variables from existing ones |
| Data quality |
Accuracy, completeness, consistency, and fitness for purpose |
| Reproducibility |
Others can follow your steps and get the same results |
| Cleaning log |
Documentation of all data cleaning decisions and justifications |
| Sensitivity analysis |
Testing whether conclusions change under different cleaning choices |
Common Pandas Pitfalls
| Pitfall |
Wrong |
Right |
| Not saving result |
df['col'].fillna(0) |
df['col'] = df['col'].fillna(0) |
Missing .str accessor |
df['col'].lower() |
df['col'].str.lower() |
| Modifying a view |
df_sub = df[mask] then modify |
df_sub = df[mask].copy() then modify |
| Replacing with spaces |
df['col'].replace('M','Male') |
Strip first: df['col'].str.strip().replace('M','Male') |
| NaN making ints float |
Fill then wonder why float |
df['col'].fillna(0).astype(int) |
The One Thing to Remember
If you forget everything else from this chapter, remember this:
Every data cleaning decision is a judgment call that shapes your conclusions. There is no "neutral" cleaning — every choice to delete, impute, recode, or transform changes the data and the story it tells. The ethical move is to make your choices transparently, document them in a cleaning log, and test whether different choices would lead to different conclusions. Reproducibility isn't a luxury — it's the price of being trustworthy.