Key Takeaways: Types of Data and the Language of Statistics
One-Sentence Summary
Every variable is either categorical (labels and groups) or numerical (measurable quantities), and correctly classifying your variables before analysis is the single most important step you can take to avoid meaningless results.
Core Concepts at a Glance
Concept
Definition
Why It Matters
Observational unit
The individual entity each row of data describes
Defines what your dataset is "about"; must be identified first
Variable
A characteristic that varies across observational units
Each column captures one measurable or classifiable property
Categorical variable
Values are labels or group names
Use frequency counts, bar charts, mode — NOT averages
Numerical variable
Values are measurable quantities
Use averages, histograms, standard deviation — full arithmetic
Data dictionary
Documentation describing every variable in a dataset
Step 1: Does the variable record a category/label, or a quantity?
- Category → Categorical
- Quantity → Numerical
Step 2 (Categorical): Do the categories have a natural order?
- No → Nominal
- Yes → Ordinal
Step 2 (Numerical): Is the variable counted (whole numbers only) or measured (any value)?
- Counted → Discrete
- Measured → Continuous
The Numbers Trap
Not all numbers are numerical variables. If arithmetic (averaging, subtracting) produces a meaningless result, the variable is categorical — regardless of whether the values are digits.
Looks numerical, but ISN'T
Why
Zip codes (90210)
"Average zip code" is meaningless
Phone numbers
Can't add two phone numbers
Social Security numbers
Labels, not quantities
Jersey numbers
Player #24 isn't "twice" player #12
Coded responses (1 = Yes, 2 = No)
Codes are labels in disguise
Levels of Measurement Hierarchy
Ratio → ratios meaningful (height, income)
Interval → differences meaningful (temperature °F, calendar year)
Ordinal → order meaningful (pain scale, rankings)
Nominal → equality only (blood type, zip code)
Each level up unlocks more valid operations. You can always treat higher-level data as lower-level (ratio as ordinal), but never lower as higher (nominal as ratio).
Population vs. Sample, Parameter vs. Statistic
Population
Sample
What is it?
Everyone you want to study
The subset you actually observe
Descriptive number
Parameter (unknown)
Statistic (known)
Analogy
The bullseye
Where your dart lands
Cross-Sectional vs. Longitudinal
Cross-Sectional
Longitudinal
Analogy
Photograph
Time-lapse video
Time points
One
Multiple
Best for
Comparing groups at one moment
Tracking change over time
Causal claims
Weak
Stronger (but still not guaranteed)
Key Connections
Forward Connection
Why It Matters
Chapter 3 (Data Toolkit)
Python's dtypes tells you what Python thinks the type is — you need to verify
Chapter 5 (Graphs)
Variable type determines graph choice: bar chart (categorical) vs. histogram (numerical)
Chapter 6 (Summaries)
Mean/SD for numerical; mode/frequency for categorical
Chapters 14-16 (Inference)
Test choice depends on variable type: z-test (proportions) vs. t-test (means)
Chapter 19 (Chi-Square)
Designed specifically for categorical data analysis
Chapter 22 (Regression)
Requires numerical outcome variable; categorical predictors need special handling
Anchor Example Updates
Person
What You Learned About Their Data
Dr. Maya Chen
Flu surveillance data: mix of nominal (diagnosis, zip code), ordinal (severity), and numerical (age, days to recovery) variables
Alex Rivera
StreamVibe data: "watch time" definition matters; genre classification is complex; engagement tiers are ordinal, not numerical
Prof. Washington
Risk scores: ordinal or numerical depending on construction; racial categories are nominal with deep consequences
Sam Okafor
Basketball stats: position (nominal), draft round (ordinal), points/game (ratio); shooting percentage is a statistic estimating a parameter
Common Mistakes to Avoid
Averaging zip codes, ID numbers, or codes — just because it's made of digits doesn't make it numerical
Treating ordinal as continuous without acknowledging the simplification — the average of 1-5 ratings is common but technically approximate
Ignoring the data dictionary — coded values (77 = "Don't know") can corrupt calculations
Confusing Python's dtype with statistical type — Python sees digits; you see meaning
Forgetting that classification decisions shape analysis — who gets counted, who gets categorized, and what gets measured are not neutral choices
We use cookies to improve your experience and show relevant ads. Privacy Policy