Key Takeaways: Types of Data and the Language of Statistics
One-Sentence Summary
Every variable is either categorical (labels and groups) or numerical (measurable quantities), and correctly classifying your variables before analysis is the single most important step you can take to avoid meaningless results.
Core Concepts at a Glance
Concept
Definition
Why It Matters
Observational unit
The individual entity each row of data describes
Defines what your dataset is "about"; must be identified first
Variable
A characteristic that varies across observational units
Each column captures one measurable or classifiable property
Categorical variable
Values are labels or group names
Use frequency counts, bar charts, mode — NOT averages
Numerical variable
Values are measurable quantities
Use averages, histograms, standard deviation — full arithmetic
Data dictionary
Documentation describing every variable in a dataset
Step 1: Does the variable record a category/label, or a quantity?
- Category → Categorical
- Quantity → Numerical
Step 2 (Categorical): Do the categories have a natural order?
- No → Nominal
- Yes → Ordinal
Step 2 (Numerical): Is the variable counted (whole numbers only) or measured (any value)?
- Counted → Discrete
- Measured → Continuous
The Numbers Trap
Not all numbers are numerical variables. If arithmetic (averaging, subtracting) produces a meaningless result, the variable is categorical — regardless of whether the values are digits.
Looks numerical, but ISN'T
Why
Zip codes (90210)
"Average zip code" is meaningless
Phone numbers
Can't add two phone numbers
Social Security numbers
Labels, not quantities
Jersey numbers
Player #24 isn't "twice" player #12
Coded responses (1 = Yes, 2 = No)
Codes are labels in disguise
Levels of Measurement Hierarchy
Ratio → ratios meaningful (height, income)
Interval → differences meaningful (temperature °F, calendar year)
Ordinal → order meaningful (pain scale, rankings)
Nominal → equality only (blood type, zip code)
Each level up unlocks more valid operations. You can always treat higher-level data as lower-level (ratio as ordinal), but never lower as higher (nominal as ratio).