Case Study: Data Types in Electronic Health Records — When Classification Is Life or Death

Contributors

Case Study: Data Types in Electronic Health Records — When Classification Is Life or Death

The Setup

In 2009, the United States embarked on one of the largest data standardization projects in history. The HITECH Act (Health Information Technology for Economic and Clinical Health) committed over $35 billion in incentives to encourage hospitals and clinics to adopt Electronic Health Records (EHRs) — digital systems that replace paper charts with structured databases.

By 2023, over 96% of U.S. hospitals had adopted certified EHR systems. That means nearly every time you visit a doctor, your information — diagnoses, medications, lab results, vital signs, demographic information — is entered into a database. Billions of rows of data, across hundreds of thousands of healthcare facilities, describing hundreds of millions of patients.

This is the world Dr. Maya Chen works in. Her disease surveillance system draws on EHR data from hospitals across her state. And the way variables are classified in those systems has direct consequences for public health decisions.

The Data Classification Challenge

An EHR system for a single hospital might contain over 1,000 distinct variables per patient. Here's a simplified slice:

Variable	Example Values	Type	Notes
Patient MRN (Medical Record Number)	00847291	Nominal	Unique ID; leading zeros matter
Date of Birth	1987-03-15	Continuous (date)	Used to calculate age
Sex at Birth	Male, Female, Unknown	Nominal	Distinct from gender identity
Race/Ethnicity	21 possible categories (OMB standards)	Nominal	Self-reported; allows multiple selections
Primary Diagnosis (ICD-10)	J11.1 (Influenza with pneumonia)	Nominal	Over 70,000 possible codes
Blood Pressure — Systolic	120 mmHg	Continuous (ratio)	Measured, not counted
Blood Pressure — Diastolic	80 mmHg	Continuous (ratio)	Measured, not counted
Heart Rate	72 bpm	Discrete (ratio)	Counted beats per minute
Temperature	98.6°F	Continuous (interval)	Measured; 0°F ≠ no temperature
Pain Score	0-10 scale	Ordinal	Patient self-report; distances unequal
Lab Value: Hemoglobin A1c	6.2%	Continuous (ratio)	Measured; 0% = no glycated hemoglobin
Medication Prescribed	Tamiflu 75mg capsule	Nominal	Categorical; specific drug name
Smoking Status	Current, Former, Never, Unknown	Ordinal/Nominal	Could be ordinal (by exposure) or nominal
Number of Previous Admissions	0, 1, 2, ...	Discrete (ratio)	Counted; 0 = no prior admissions
Triage Category	1 (Resuscitation) to 5 (Non-urgent)	Ordinal	Lower number = more critical
Insurance Type	Medicare, Medicaid, Private, Uninsured	Nominal	Administrative category
Length of Stay (days)	0, 1, 2, ... , 365	Discrete/Continuous	Often treated as continuous despite integer values

Notice how many classification decisions are embedded in this single table. Each one reflects choices made by standards committees, hospital administrators, and software developers — and each one shapes what analysis is possible.

Three Stories About What Can Go Wrong

Story 1: The Zip Code Average

In 2014, a hospital system ran an analysis to identify which communities had the highest rates of diabetes-related emergency visits. An analyst calculated the "average zip code" of diabetic patients and mapped the result to a location.

The number was meaningless. A zip code of 48221 (Detroit) and a zip code of 90210 (Beverly Hills) don't average to zip code 69215 (a rural area in Nebraska). The analyst treated a nominal variable as numerical, and the result was geographic nonsense.

The correct approach was to count the frequency of visits per zip code — an operation that's valid for nominal data — and then map those frequencies. When the team redid the analysis correctly, they found that three specific zip codes accounted for 40% of diabetes-related ER visits. This finding led to targeted outreach programs in those communities.

Lesson: The wrong variable classification doesn't just produce wrong numbers — it hides the real story. The frequency analysis revealed a pattern; the averaging obscured it.

Story 2: Pain Scores and the Opioid Crisis

Pain is one of the most important variables in healthcare — and one of the hardest to classify.

The standard 0-10 pain scale is ordinal: a patient reports their pain level, and a nurse records the number. But for decades, many hospital quality metrics treated pain scores as continuous numerical data, calculating ward averages and tracking trends as if the numbers were measurements rather than self-reported categories.

This matters because treating pain scores as continuous led hospitals to set targets like "reduce average pain score from 4.2 to 3.0." Clinicians responded by prescribing more opioid pain medication to bring the numbers down. Multiple analyses, including a 2017 study in the New England Journal of Medicine, identified this numeric-target-chasing as one of several institutional factors that contributed to the opioid epidemic.

The underlying statistical error: treating ordinal data as if the distances between values were equal and meaningful. Is the gap between pain level 2 and pain level 4 really the same as the gap between 6 and 8? A patient who reports "4" might mean something very different from another patient who also reports "4." Averaging these numbers and setting numeric targets created a false precision that had devastating consequences.

Lesson: The level of measurement constrains what operations are valid. Treating ordinal data as continuous can produce misleading summaries that drive harmful decisions.

Story 3: Race, Categories, and Missing People

When Dr. Chen analyzes disease outbreak data, she often breaks results down by race and ethnicity. But the racial categories available in EHR systems have changed over time, and they vary across systems.

Before 1997, the standard U.S. racial categories (set by the Office of Management and Budget) were: White, Black, Asian/Pacific Islander, American Indian/Alaska Native. In 1997, "Asian" and "Native Hawaiian/Pacific Islander" were split into separate categories, and respondents were allowed to select more than one race for the first time.

This seemingly small change in nominal variable categories had huge analytical consequences:

Before 1997: Researchers studying health disparities between Asian Americans and Pacific Islanders could not do so — the data didn't distinguish them. Native Hawaiians, who face significantly different health challenges than East Asian Americans, were statistically invisible.
After 1997: The split revealed that Native Hawaiians and Pacific Islanders had substantially higher rates of diabetes, obesity, and cardiovascular disease than Asian Americans as an aggregate — patterns that had been hidden by the combined category for decades.
Multiracial individuals: Before "select all that apply" was an option, people who identified with multiple races were forced into a single category — or into "Other." This meant that research on multiracial health outcomes was essentially impossible.

Dr. Chen sees a modern version of this problem with the "Hispanic/Latino" ethnicity field. In most EHR systems, "Hispanic/Latino" is treated as an ethnicity that can overlay any racial category — so a patient might be recorded as both "White" and "Hispanic." But many patients and healthcare workers treat it as a racial category, leading to inconsistent data entry. When Dr. Chen tries to analyze flu outcomes by race and ethnicity, she has to navigate these inconsistencies in every dataset.

Lesson: The categories we create for nominal variables determine what stories the data can tell — and whose stories get erased. Data classification is never purely technical; it's also social and political.

Connection to This Chapter

These three stories illustrate core lessons from Chapter 2:

Chapter 2 Concept	EHR Illustration
Nominal vs. numerical	Zip code averaging error (Story 1)
Ordinal vs. continuous	Pain score targets driving opioid prescribing (Story 2)
Categories shape stories	Racial categories hiding health disparities (Story 3)
Data dictionaries	Without one, zip codes look numerical and pain scores look continuous
Human stories behind data	Every misclassification affects real patients and real communities

Discussion Questions

A hospital administrator wants to track "patient satisfaction" and is debating between a 1-5 Likert scale and an open-ended text response. What are the advantages and disadvantages of each approach, in terms of data type and analyzability?
The pain score example shows how treating ordinal data as continuous led to real-world harm. Can you think of another situation (inside or outside healthcare) where treating ordinal data as numerical might produce misleading results?
Dr. Chen needs to report flu outcomes by race and ethnicity. A colleague suggests combining all racial categories into just three groups (White, Black, Other) to simplify the analysis. What information would be lost? When, if ever, might this simplification be justified?
EHR systems typically record "Sex at Birth" (Male/Female) and increasingly also "Gender Identity" (with more categories). Why are these separate variables? How does this illustrate that classification decisions reflect evolving social understanding?
ICD-10 diagnosis codes (like J11.1 for "Influenza with pneumonia") are nominal variables with over 70,000 possible values. What challenges does this create for analysis? How might a researcher simplify this variable while preserving useful information?

Mini-Project

Find a real data dictionary for a publicly available health dataset. Good options include: - CDC BRFSS Codebook (search "BRFSS codebook [year]") - CMS Medicare Provider Data (data.cms.gov) - WHO Global Health Observatory Data Dictionary

Choose five variables from the data dictionary. For each one: 1. Record the official variable name, description, and coded values 2. Classify it using the Chapter 2 system (nominal, ordinal, discrete, continuous) 3. Identify any coded values that could be misinterpreted (e.g., 77 = "Don't know") 4. Note any cases where the classification is debatable

Write a one-paragraph reflection on what surprised you about the data dictionary.

Sources: Office of the National Coordinator for Health Information Technology (2023). Non-federal Acute Care Hospital Electronic Health Record Adoption. HealthIT.gov. Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity (OMB, 1997). Darnall, B. D., et al. (2017). Patient-centered prescribing of opioids. NEJM, 376(17). Institute of Medicine (2009). Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement.