Case Study: When Cleaning Decisions Changed the COVID-19 Count
The Setup
In early 2020, as COVID-19 began spreading across the United States, every county health department in the country faced the same urgent question: How many cases do we have?
The answer should have been simple. It wasn't.
County health departments — like Maya Chen's — received case reports from hospitals, clinics, testing sites, and labs. These reports arrived as spreadsheets, faxes, electronic records, and even phone calls. Some labs reported positive tests. Some reported both positive and negative tests. Some reported tests by the patient's name, others by a case ID, and others by an anonymized code.
Merging all of that into a single, accurate case count required data cleaning. And the cleaning decisions that individual health departments made — decisions that most people would consider boring, technical, and invisible — turned out to have enormous consequences for public policy, resource allocation, and public trust.
This is a case study about how the most "mundane" step in data analysis became the most important.
The Duplication Problem
The first major cleaning challenge was duplicates — and they were everywhere.
A patient might get tested at a drive-through site on Monday, then go to their doctor on Wednesday for confirmation. That's two positive tests for one case. A lab might report results to the county health department and to the state, which then sends the report back down to the county. That's the same case counted twice through different channels.
Some early county-level case counts were inflated by 10-25% due to duplicate records. In a county tracking 5,000 cases, that's 500-1,250 phantom cases. Those inflated numbers triggered emergency orders, hospital surge plans, and supply chain requests — all based on overestimates.
The Deduplication Dilemma
Removing duplicates sounds straightforward — just use .drop_duplicates(), right? But think about what you'd deduplicate on:
- Patient name? Names have typos, middle names appear sometimes, and married names change. "John Smith" and "J. Smith" and "John M. Smith" might all be the same person — or three different people.
- Date of birth? Helps, but isn't unique. In a city of 500,000, hundreds of people share the same birthday.
- Test date? A patient tested multiple times should only count as one case, but each test is a legitimate test. Deduplicating on test date would merge different patients tested on the same day.
- Combination of name + DOB + zip code? Better, but still imperfect. And what about patients who moved, or whose information was entered differently at different testing sites?
Health departments that used aggressive deduplication (strict matching rules) removed real cases along with the duplicates — undercounting. Those that used conservative deduplication (loose matching) left duplicates in — overcounting. There was no perfect answer, only trade-offs.
Connection to Chapter 7: This is the
.drop_duplicates(subset=['patient_id'])problem from Section 7.4, but at a scale where getting it wrong affects millions of people. Thesubsetandkeepparameters aren't just coding choices — they're epidemiological decisions.
The "What Counts as a Case?" Problem
Even after deduplication, another cleaning decision loomed: what counts as a COVID-19 case?
- A positive PCR lab test? Almost everyone agreed on this.
- A positive rapid antigen test? Early in the pandemic, many counties didn't count these. Later, most did. Changing the definition mid-stream meant the case counts from January and June weren't comparable — like changing the rules of the game at halftime.
- A "probable" case based on symptoms and known exposure, but without a test? The CDC recommended counting these, but not all counties did.
- A positive antibody test (which shows past infection, not current)? Some labs reported these alongside active-case tests, and some health departments initially included them in their case counts before realizing the error and correcting — which made it look like cases had suddenly dropped.
Each of these is a recoding decision — how do you classify this observation? And each choice changed the count.
The Florida Controversy
In May 2020, the Florida Department of Health drew national attention when data manager Rebekah Jones alleged that she was asked to manipulate the state's COVID-19 dashboard data. While the full story is complex and contested, one undisputed fact stands out: different data cleaning and reporting choices — which tests to include, how to count "probable" cases, whether to use the date of test or the date of report — produced substantially different numbers from the same underlying data.
The state reported data one way. Jones argued it should be reported another way. Both sides were making data cleaning decisions — and those decisions had starkly different implications for whether Florida's case trajectory appeared to be rising or falling.
Connection to Theme 6 (Ethical Data Practice): This is the principle from Section 7.10 made vivid: when data cleaning decisions are not transparent and documented, they become vulnerable to accusations of manipulation — even when the analysts are acting in good faith. A public cleaning log, showing exactly which records were included and excluded and why, is the best defense against both actual bias and perceived bias.
The Missing Data Disaster
COVID-19 data also had enormous missing data problems. Consider the variable "race/ethnicity":
- In the early months of the pandemic, many states had race/ethnicity data missing for 40-80% of COVID-19 cases.
- This wasn't random. Testing sites in lower-income communities of color often had fewer resources for data collection. Patients in overwhelmed emergency departments had their demographic information partially or completely skipped.
- The result: early analyses of COVID-19's racial impact were based on the 20-60% of cases that did have race data — a sample that was almost certainly not representative.
If you analyzed only the complete cases (listwise deletion), you might underestimate the disproportionate impact on communities of color. If you imputed race with the mode (the most common race in the area), you'd wash out the actual racial disparities.
The APM Research Lab's Approach
The APM Research Lab, which tracked COVID-19 racial data across all 50 states, took a different approach. Instead of imputing or deleting, they:
- Reported the percentage of cases with known race/ethnicity alongside the racial breakdown
- Flagged states where missing race data exceeded 30%
- Published their methodology transparently, so anyone could evaluate their decisions
- Provided both "worst case" and "best case" scenarios for racial disparities based on different assumptions about the missing data
This is sensitivity analysis in action — and it's exactly the kind of transparent documentation described in Section 7.10.
The Lessons
Lesson 1: Scale Makes Everything Harder
Cleaning a 500-row dataset for a homework assignment is manageable. Cleaning millions of records arriving daily from hundreds of sources, in different formats, with life-or-death stakes, is a fundamentally different challenge. But the principles are the same: identify the issues, choose a strategy, document your decisions, and test whether different choices lead to different conclusions.
Lesson 2: "Clean" Is Not Objective
Two analysts looking at the same raw COVID-19 data could produce legitimately different cleaned datasets — and therefore different case counts — based on different but defensible cleaning decisions. This doesn't mean one is right and the other is wrong. It means data cleaning involves judgment, and reasonable people can disagree.
The ethical obligation is transparency: show your work, explain your choices, and acknowledge the alternatives.
Lesson 3: Missing Data Is a Social Issue
The pattern of missing race/ethnicity data in COVID-19 records wasn't random — it reflected structural inequities in healthcare infrastructure, staffing, and technology. Communities that needed the most attention had the least complete data. When you clean data, you need to ask: whose data is most likely to be missing, and what does that mean for my conclusions?
This is Theme 2 (human stories behind the data) at its most urgent: missing data means missing people, and missing people means invisible suffering.
Lesson 4: Cleaning Logs Save Lives (and Careers)
The counties and states that documented their data processing decisions — which records were included, which were excluded, how duplicates were handled, how "case" was defined — were able to defend their numbers, correct errors publicly, and maintain public trust. Those that didn't document their decisions faced accusations of manipulation, even when their choices were defensible.
A cleaning log isn't just good practice. In high-stakes contexts, it's essential infrastructure.
Discussion Questions
-
A county health department discovers that 15% of COVID-19 case records are duplicates. They remove them, and the case count drops significantly. A journalist writes a headline: "County Admits Overcounting COVID Cases by 15%." Is this headline accurate? What nuance is it missing?
-
The CDC recommended counting "probable" cases (based on symptoms and exposure) alongside "confirmed" cases (based on positive tests). Some states did this; others didn't. How does this inconsistency in coding rules across states complicate national-level analysis? What's the data cleaning equivalent?
-
A state health department has race/ethnicity data missing for 55% of COVID-19 cases. A researcher proposes imputing the missing values using the racial demographics of each patient's zip code. Evaluate this approach: what are its strengths, weaknesses, and potential biases?
-
Reframe the Florida data controversy as a data cleaning problem. What specific cleaning decisions were at stake? How could transparency (a public cleaning log) have changed the public's understanding of the situation?
-
How does this case study illustrate the idea that "every data cleaning decision is a study design decision in disguise" (from Section 7.10)?
Connection to the Progressive Project
As you clean your own dataset in the Chapter 7 project checkpoint, keep the COVID-19 case in mind. Your stakes are lower, but the principles are identical:
- Document every decision. If someone asked you "why did you drop those rows?" you should have a written answer.
- Think about who's missing. Are the rows you're deleting systematically different from the rows you're keeping?
- Test your sensitivity. Does your conclusion change if you use a different imputation method? If so, acknowledge that in your analysis.
- Be transparent about what you don't know. It's better to say "vaccination data was 10% missing and likely MNAR" than to pretend your dataset is complete.