Key Takeaways: Types of Data and the Language of Statistics

Contributors

Key Takeaways: Types of Data and the Language of Statistics

One-Sentence Summary

Every variable is either categorical (labels and groups) or numerical (measurable quantities), and correctly classifying your variables before analysis is the single most important step you can take to avoid meaningless results.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Observational unit	The individual entity each row of data describes	Defines what your dataset is "about"; must be identified first
Variable	A characteristic that varies across observational units	Each column captures one measurable or classifiable property
Categorical variable	Values are labels or group names	Use frequency counts, bar charts, mode — NOT averages
Numerical variable	Values are measurable quantities	Use averages, histograms, standard deviation — full arithmetic
Data dictionary	Documentation describing every variable in a dataset	Prevents misclassification, ensures reproducibility
Parameter	A number describing a population (usually unknown)	The "truth" you're trying to estimate
Statistic	A number describing a sample (calculated from data)	Your best estimate of the unknown parameter

The Classification System

                        Variable
                       /         \
              Categorical       Numerical
              /        \        /        \
         Nominal    Ordinal  Discrete  Continuous
         (labels)  (ranked)  (counted)  (measured)

Type	Order?	Equal spacing?	Arithmetic?	Quick examples
Nominal	No	No	No	Blood type, zip code, diagnosis
Ordinal	Yes	No	Limited	Pain scale, letter grades, Likert ratings
Discrete	Yes	Yes	Yes	Number of siblings, goals scored
Continuous	Yes	Yes	Yes	Height, weight, temperature, time

Quick Decision Flowchart

Step 1: Does the variable record a category/label, or a quantity? - Category → Categorical - Quantity → Numerical

Step 2 (Categorical): Do the categories have a natural order? - No → Nominal - Yes → Ordinal

Step 2 (Numerical): Is the variable counted (whole numbers only) or measured (any value)? - Counted → Discrete - Measured → Continuous

The Numbers Trap

Not all numbers are numerical variables. If arithmetic (averaging, subtracting) produces a meaningless result, the variable is categorical — regardless of whether the values are digits.

Looks numerical, but ISN'T	Why
Zip codes (90210)	"Average zip code" is meaningless
Phone numbers	Can't add two phone numbers
Social Security numbers	Labels, not quantities
Jersey numbers	Player #24 isn't "twice" player #12
Coded responses (1 = Yes, 2 = No)	Codes are labels in disguise

Levels of Measurement Hierarchy

Ratio     → ratios meaningful (height, income)
Interval  → differences meaningful (temperature °F, calendar year)
Ordinal   → order meaningful (pain scale, rankings)
Nominal   → equality only (blood type, zip code)

Each level up unlocks more valid operations. You can always treat higher-level data as lower-level (ratio as ordinal), but never lower as higher (nominal as ratio).

Population vs. Sample, Parameter vs. Statistic

	Population	Sample
What is it?	Everyone you want to study	The subset you actually observe
Descriptive number	Parameter (unknown)	Statistic (known)
Analogy	The bullseye	Where your dart lands

Cross-Sectional vs. Longitudinal

	Cross-Sectional	Longitudinal
Analogy	Photograph	Time-lapse video
Time points	One	Multiple
Best for	Comparing groups at one moment	Tracking change over time
Causal claims	Weak	Stronger (but still not guaranteed)

Key Connections

Forward Connection	Why It Matters
Chapter 3 (Data Toolkit)	Python's `dtypes` tells you what Python thinks the type is — you need to verify
Chapter 5 (Graphs)	Variable type determines graph choice: bar chart (categorical) vs. histogram (numerical)
Chapter 6 (Summaries)	Mean/SD for numerical; mode/frequency for categorical
Chapters 14-16 (Inference)	Test choice depends on variable type: z-test (proportions) vs. t-test (means)
Chapter 19 (Chi-Square)	Designed specifically for categorical data analysis
Chapter 22 (Regression)	Requires numerical outcome variable; categorical predictors need special handling

Anchor Example Updates

Person	What You Learned About Their Data
Dr. Maya Chen	Flu surveillance data: mix of nominal (diagnosis, zip code), ordinal (severity), and numerical (age, days to recovery) variables
Alex Rivera	StreamVibe data: "watch time" definition matters; genre classification is complex; engagement tiers are ordinal, not numerical
Prof. Washington	Risk scores: ordinal or numerical depending on construction; racial categories are nominal with deep consequences
Sam Okafor	Basketball stats: position (nominal), draft round (ordinal), points/game (ratio); shooting percentage is a statistic estimating a parameter

Common Mistakes to Avoid

Averaging zip codes, ID numbers, or codes — just because it's made of digits doesn't make it numerical
Treating ordinal as continuous without acknowledging the simplification — the average of 1-5 ratings is common but technically approximate
Ignoring the data dictionary — coded values (77 = "Don't know") can corrupt calculations
Confusing Python's dtype with statistical type — Python sees digits; you see meaning
Forgetting that classification decisions shape analysis — who gets counted, who gets categorized, and what gets measured are not neutral choices