Case Study: Data Types in Electronic Health Records — When Classification Is Life or Death
The Setup
In 2009, the United States embarked on one of the largest data standardization projects in history. The HITECH Act (Health Information Technology for Economic and Clinical Health) committed over $35 billion in incentives to encourage hospitals and clinics to adopt Electronic Health Records (EHRs) — digital systems that replace paper charts with structured databases.
By 2023, over 96% of U.S. hospitals had adopted certified EHR systems. That means nearly every time you visit a doctor, your information — diagnoses, medications, lab results, vital signs, demographic information — is entered into a database. Billions of rows of data, across hundreds of thousands of healthcare facilities, describing hundreds of millions of patients.
This is the world Dr. Maya Chen works in. Her disease surveillance system draws on EHR data from hospitals across her state. And the way variables are classified in those systems has direct consequences for public health decisions.
The Data Classification Challenge
An EHR system for a single hospital might contain over 1,000 distinct variables per patient. Here's a simplified slice:
| Variable | Example Values | Type | Notes |
|---|---|---|---|
| Patient MRN (Medical Record Number) | 00847291 | Nominal | Unique ID; leading zeros matter |
| Date of Birth | 1987-03-15 | Continuous (date) | Used to calculate age |
| Sex at Birth | Male, Female, Unknown | Nominal | Distinct from gender identity |
| Race/Ethnicity | 21 possible categories (OMB standards) | Nominal | Self-reported; allows multiple selections |
| Primary Diagnosis (ICD-10) | J11.1 (Influenza with pneumonia) | Nominal | Over 70,000 possible codes |
| Blood Pressure — Systolic | 120 mmHg | Continuous (ratio) | Measured, not counted |
| Blood Pressure — Diastolic | 80 mmHg | Continuous (ratio) | Measured, not counted |
| Heart Rate | 72 bpm | Discrete (ratio) | Counted beats per minute |
| Temperature | 98.6°F | Continuous (interval) | Measured; 0°F ≠ no temperature |
| Pain Score | 0-10 scale | Ordinal | Patient self-report; distances unequal |
| Lab Value: Hemoglobin A1c | 6.2% | Continuous (ratio) | Measured; 0% = no glycated hemoglobin |
| Medication Prescribed | Tamiflu 75mg capsule | Nominal | Categorical; specific drug name |
| Smoking Status | Current, Former, Never, Unknown | Ordinal/Nominal | Could be ordinal (by exposure) or nominal |
| Number of Previous Admissions | 0, 1, 2, ... | Discrete (ratio) | Counted; 0 = no prior admissions |
| Triage Category | 1 (Resuscitation) to 5 (Non-urgent) | Ordinal | Lower number = more critical |
| Insurance Type | Medicare, Medicaid, Private, Uninsured | Nominal | Administrative category |
| Length of Stay (days) | 0, 1, 2, ... , 365 | Discrete/Continuous | Often treated as continuous despite integer values |
Notice how many classification decisions are embedded in this single table. Each one reflects choices made by standards committees, hospital administrators, and software developers — and each one shapes what analysis is possible.
Three Stories About What Can Go Wrong
Story 1: The Zip Code Average
In 2014, a hospital system ran an analysis to identify which communities had the highest rates of diabetes-related emergency visits. An analyst calculated the "average zip code" of diabetic patients and mapped the result to a location.
The number was meaningless. A zip code of 48221 (Detroit) and a zip code of 90210 (Beverly Hills) don't average to zip code 69215 (a rural area in Nebraska). The analyst treated a nominal variable as numerical, and the result was geographic nonsense.
The correct approach was to count the frequency of visits per zip code — an operation that's valid for nominal data — and then map those frequencies. When the team redid the analysis correctly, they found that three specific zip codes accounted for 40% of diabetes-related ER visits. This finding led to targeted outreach programs in those communities.
Lesson: The wrong variable classification doesn't just produce wrong numbers — it hides the real story. The frequency analysis revealed a pattern; the averaging obscured it.
Story 2: Pain Scores and the Opioid Crisis
Pain is one of the most important variables in healthcare — and one of the hardest to classify.
The standard 0-10 pain scale is ordinal: a patient reports their pain level, and a nurse records the number. But for decades, many hospital quality metrics treated pain scores as continuous numerical data, calculating ward averages and tracking trends as if the numbers were measurements rather than self-reported categories.
This matters because treating pain scores as continuous led hospitals to set targets like "reduce average pain score from 4.2 to 3.0." Clinicians responded by prescribing more opioid pain medication to bring the numbers down. Multiple analyses, including a 2017 study in the New England Journal of Medicine, identified this numeric-target-chasing as one of several institutional factors that contributed to the opioid epidemic.
The underlying statistical error: treating ordinal data as if the distances between values were equal and meaningful. Is the gap between pain level 2 and pain level 4 really the same as the gap between 6 and 8? A patient who reports "4" might mean something very different from another patient who also reports "4." Averaging these numbers and setting numeric targets created a false precision that had devastating consequences.
Lesson: The level of measurement constrains what operations are valid. Treating ordinal data as continuous can produce misleading summaries that drive harmful decisions.
Story 3: Race, Categories, and Missing People
When Dr. Chen analyzes disease outbreak data, she often breaks results down by race and ethnicity. But the racial categories available in EHR systems have changed over time, and they vary across systems.
Before 1997, the standard U.S. racial categories (set by the Office of Management and Budget) were: White, Black, Asian/Pacific Islander, American Indian/Alaska Native. In 1997, "Asian" and "Native Hawaiian/Pacific Islander" were split into separate categories, and respondents were allowed to select more than one race for the first time.
This seemingly small change in nominal variable categories had huge analytical consequences:
-
Before 1997: Researchers studying health disparities between Asian Americans and Pacific Islanders could not do so — the data didn't distinguish them. Native Hawaiians, who face significantly different health challenges than East Asian Americans, were statistically invisible.
-
After 1997: The split revealed that Native Hawaiians and Pacific Islanders had substantially higher rates of diabetes, obesity, and cardiovascular disease than Asian Americans as an aggregate — patterns that had been hidden by the combined category for decades.
-
Multiracial individuals: Before "select all that apply" was an option, people who identified with multiple races were forced into a single category — or into "Other." This meant that research on multiracial health outcomes was essentially impossible.
Dr. Chen sees a modern version of this problem with the "Hispanic/Latino" ethnicity field. In most EHR systems, "Hispanic/Latino" is treated as an ethnicity that can overlay any racial category — so a patient might be recorded as both "White" and "Hispanic." But many patients and healthcare workers treat it as a racial category, leading to inconsistent data entry. When Dr. Chen tries to analyze flu outcomes by race and ethnicity, she has to navigate these inconsistencies in every dataset.
Lesson: The categories we create for nominal variables determine what stories the data can tell — and whose stories get erased. Data classification is never purely technical; it's also social and political.
Connection to This Chapter
These three stories illustrate core lessons from Chapter 2:
| Chapter 2 Concept | EHR Illustration |
|---|---|
| Nominal vs. numerical | Zip code averaging error (Story 1) |
| Ordinal vs. continuous | Pain score targets driving opioid prescribing (Story 2) |
| Categories shape stories | Racial categories hiding health disparities (Story 3) |
| Data dictionaries | Without one, zip codes look numerical and pain scores look continuous |
| Human stories behind data | Every misclassification affects real patients and real communities |
Discussion Questions
-
A hospital administrator wants to track "patient satisfaction" and is debating between a 1-5 Likert scale and an open-ended text response. What are the advantages and disadvantages of each approach, in terms of data type and analyzability?
-
The pain score example shows how treating ordinal data as continuous led to real-world harm. Can you think of another situation (inside or outside healthcare) where treating ordinal data as numerical might produce misleading results?
-
Dr. Chen needs to report flu outcomes by race and ethnicity. A colleague suggests combining all racial categories into just three groups (White, Black, Other) to simplify the analysis. What information would be lost? When, if ever, might this simplification be justified?
-
EHR systems typically record "Sex at Birth" (Male/Female) and increasingly also "Gender Identity" (with more categories). Why are these separate variables? How does this illustrate that classification decisions reflect evolving social understanding?
-
ICD-10 diagnosis codes (like J11.1 for "Influenza with pneumonia") are nominal variables with over 70,000 possible values. What challenges does this create for analysis? How might a researcher simplify this variable while preserving useful information?
Mini-Project
Find a real data dictionary for a publicly available health dataset. Good options include: - CDC BRFSS Codebook (search "BRFSS codebook [year]") - CMS Medicare Provider Data (data.cms.gov) - WHO Global Health Observatory Data Dictionary
Choose five variables from the data dictionary. For each one: 1. Record the official variable name, description, and coded values 2. Classify it using the Chapter 2 system (nominal, ordinal, discrete, continuous) 3. Identify any coded values that could be misinterpreted (e.g., 77 = "Don't know") 4. Note any cases where the classification is debatable
Write a one-paragraph reflection on what surprised you about the data dictionary.
Sources: Office of the National Coordinator for Health Information Technology (2023). Non-federal Acute Care Hospital Electronic Health Record Adoption. HealthIT.gov. Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity (OMB, 1997). Darnall, B. D., et al. (2017). Patient-centered prescribing of opioids. NEJM, 376(17). Institute of Medicine (2009). Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement.