Exercises: Types of Data and the Language of Statistics

Contributors

Exercises: Types of Data and the Language of Statistics

These exercises progress from concept checks to challenging applications. Estimated completion time: 1.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. Explain the difference between a categorical variable and a numerical variable in your own words. Why does it matter which one you're working with?

A.2. For each of the following, state whether it is nominal, ordinal, discrete, or continuous. Explain your reasoning.

a) Military rank (Private, Corporal, Sergeant, Lieutenant, Captain) b) Number of pets owned c) Social Security number d) Temperature in Celsius e) Letter grade on an exam (A, B, C, D, F) f) Time to run a mile (in minutes) g) Country of birth h) Number of text messages sent yesterday

A.3. A classmate says, "Zip codes are numerical because they're made of numbers." Explain why this is incorrect and what the classmate is confusing.

A.4. What is an observational unit? Why is it important to identify the observational unit before beginning an analysis?

A.5. In your own words, explain the difference between a parameter and a statistic. Give one example of each that involves Dr. Maya Chen's public health work.

A.6. What is a data dictionary, and why is it important? List at least three pieces of information that a good data dictionary includes for each variable.

A.7. Explain the difference between cross-sectional and longitudinal data. Give one example of each from a field you're interested in.

Part B: Applied Analysis ⭐⭐

B.1. Consider the following dataset from Sam Okafor's basketball analytics work:

Player	Position	Height (in)	Points/Game	Free Throw %	Draft Round	Games Played
Daria K.	Guard	68	18.4	82.1	1st	58
Marcus J.	Forward	79	12.7	74.5	2nd	62
Tomas R.	Center	83	9.2	61.3	Undrafted	45
Aliya M.	Guard	66	21.1	88.7	1st	60
Chen W.	Forward	80	14.3	76.2	2nd	55

a) What is the observational unit? b) Classify each variable (Player, Position, Height, Points/Game, Free Throw %, Draft Round, Games Played) as categorical or numerical, and then as nominal, ordinal, discrete, or continuous. c) For each variable, identify its level of measurement (nominal, ordinal, interval, or ratio). d) Which variables could Sam meaningfully average? Which could he not?

B.2. Alex Rivera is examining StreamVibe data that includes the following variables: User ID, Age, Subscription Plan (Free/Basic/Premium), Device Type (Mobile/Tablet/Desktop/Smart TV), Number of Profiles on Account, Hours Watched This Month, Content Rating Given (1-5 stars), and Date of Last Login.

a) Classify each variable. b) Alex's boss asks, "What's the average Device Type?" Explain why this question doesn't make sense. c) Alex's boss then asks, "What's the average Content Rating?" Is this question more reasonable? Discuss the nuance.

B.3. Dr. Maya Chen collects the following variables for a disease outbreak study: Patient Age, Sex (Male/Female/Non-binary), Zip Code, Diagnosis Code (ICD-10), Number of Symptoms Reported, Days from Exposure to Symptom Onset, Severity (Mild/Moderate/Severe/Critical), and Lab Test Result (Positive/Negative/Inconclusive).

a) Classify each variable. b) Build a mini data dictionary (table with columns: Variable Name, Type, Valid Values, Notes) for these variables. c) Dr. Chen wants to compare the average "number of symptoms reported" between patients classified as "Mild" vs. "Severe." Which variable is the grouping variable, and what type is it? Which is the outcome variable, and what type is it?

B.4. Professor Washington examines a dataset about defendants in a criminal justice system. The dataset contains: Defendant ID, Age, Race/Ethnicity, Charge Type, Prior Convictions (count), Algorithm Risk Score (1-10), Judge's Bail Decision (Released/Bail Set/Detained), and Neighborhood Crime Category (Low/Medium/High).

a) Classify each variable as nominal, ordinal, discrete, or continuous. b) The algorithm risk score is calculated by a mathematical formula that weighs several factors. Given this, would you treat it as ordinal or as numerical (interval/ratio)? Justify your answer. c) "Prior Convictions" counts the number of times someone has been convicted before. Is this a good variable to use in a risk algorithm? What concerns might Professor Washington raise?

B.5. A researcher hands you a spreadsheet with 2,000 rows and these column headers: ID, STATE, INCOME, EDU_LEVEL, AGE_GROUP, NUMKIDS, HEALTH_RATING, BMI, SMOKER, YEAR.

Without seeing the data, make your best guess about each variable's type. For any variable that could go either way, explain what additional information you'd need to classify it.

Part C: Skills Practice ⭐⭐

C.1. For each of the following variables, determine whether the arithmetic described is meaningful. If it's not, explain why.

a) "The average jersey number on our team is 34.5." b) "On average, patients recovered 3.2 days faster with the new treatment." c) "The average zip code in our sample is 48221." d) "The average satisfaction rating is 4.1 out of 5." e) "The average hair color is brown." f) "Patients in Group A had a 15% higher survival rate than patients in Group B."

C.2. Consider the following Python output:

>>> df.dtypes
student_id       int64
name            object
gpa            float64
major           object
credits_earned   int64
grad_year        int64
honors          object

a) For each variable, determine the correct statistical data type (nominal, ordinal, discrete, or continuous). Note any cases where Python's data type doesn't match the correct statistical classification. b) Which variables would you need to convert or reclassify before analysis? What changes would you make?

C.3. In a spreadsheet, you notice that a column called "Phone Number" is formatted as a number — so the entry 2125551234 appears as 2,125,551,234 with comma separators. Explain what went wrong and how you would fix it. What about a "Zip Code" column where 02138 appears as 2138?

C.4. Create a small dataset (at least 6 rows and 5 columns) about a topic you're interested in. The dataset must include: - At least one nominal variable - At least one ordinal variable - At least one discrete numerical variable - At least one continuous numerical variable

Write a data dictionary for your dataset.

C.5. For each pair below, identify which dataset is cross-sectional and which is longitudinal. Explain how you know.

a) Dataset 1: Blood pressure readings for 500 patients taken during a single clinic visit in January 2026. Dataset 2: Blood pressure readings for 200 patients taken every 6 months from 2020 to 2026.

b) Dataset 1: Average SAT scores by state for the year 2025. Dataset 2: Average SAT scores for the state of California from 2010 to 2025.

Part D: Synthesis & Critical Thinking ⭐⭐⭐

D.1. The following table shows real data structure from the CDC's Behavioral Risk Factor Surveillance System (BRFSS):

Variable	Code	Meaning
`_RFHLTH`	1	Good or Better Health
`_RFHLTH`	2	Fair or Poor Health
`_RFHLTH`	9	Don't know/Refused
`PHYSHLTH`	1-30	Number of days physical health not good
`PHYSHLTH`	88	None (no bad days)
`PHYSHLTH`	77	Don't know/Not sure
`PHYSHLTH`	99	Refused

a) What type of variable is _RFHLTH? What type is PHYSHLTH? b) A careless analyst might calculate the average of PHYSHLTH and get a number around 30. Why would this be wrong? (Hint: think about what the codes 77, 88, and 99 represent.) c) How does this example illustrate why data dictionaries are essential?

D.2. Sam Okafor wants to compare players across different positions (Guard, Forward, Center). He plans to use average points per game for each position.

a) What's the observational unit for this comparison? b) Could the difference in average points per game across positions be misleading? What other variables might explain the differences? (This connects to the correlation vs. causation theme from Chapter 1.) c) What additional variables would you want in the dataset to make the comparison fairer?

D.3. Alex Rivera discovers that StreamVibe's data team has been recording user satisfaction as both a numerical score (1-100) and an ordinal category (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied). The two don't always agree — a user might give a score of 65 but select "Satisfied."

a) Why might these two measurements disagree? b) Which would you use for analysis? Does it depend on what question you're answering? c) Design a simple data dictionary entry for each version of this variable.

D.4. Professor Washington argues that the very act of creating categories in criminal justice data is a political decision, not just a technical one. The category "race/ethnicity," for example, has changed across U.S. Census decades (there was no "Hispanic" category before 1970; "multiracial" wasn't an option until 2000).

a) How does changing the categories of a nominal variable change what the data can reveal? b) If a researcher combines "Asian" and "Pacific Islander" into one category to get a larger sample size, what information is lost? c) Connect this to the chapter's theme: "the human stories behind the data."

Part M: Mixed Practice (Interleaved with Chapter 1) ⭐⭐

M.1. For each scenario below, identify: (a) the population, (b) the sample, (c) one parameter, and (d) one statistic.

a) A university surveys 400 of its 12,000 students about study habits and finds they study an average of 14.2 hours per week. b) A factory tests 50 batteries from a production run of 10,000 and finds that 3 are defective. c) Dr. Chen tracks 2,000 flu patients in a county of 500,000 residents and finds that 12% required hospitalization.

M.2. In Chapter 1, you learned about the four pillars of statistical investigation: Ask, Collect, Analyze, Interpret. How does the process of classifying variables (Chapter 2) relate to each pillar? Which pillar is most affected by getting your variable types wrong?

M.3. A news headline reads: "Survey: 73% of Americans Support Stricter Environmental Regulations." Using concepts from both Chapter 1 and Chapter 2:

a) Is this descriptive or inferential statistics? b) What is the observational unit in the survey? c) What type of variable is "support for stricter environmental regulations"? Is it nominal, ordinal, or something else? (Consider: does the survey just ask yes/no, or might it use a scale?) d) Is 73% a parameter or a statistic?

M.4. Return to Sam Okafor's scenario from Chapter 1. Daria Williams shot 31% from three-point range last season (56 makes on 180 attempts) and 38% this season (25 makes on 65 attempts). Classify each of these variables:

a) "Season" (last season vs. this season) b) "Number of three-pointers attempted" c) "Three-point shooting percentage" d) "Result of each shot" (made vs. missed)

Then: is the shooting percentage a parameter or a statistic? Explain.

M.5. Consider two datasets about the same topic:

Dataset A: A snapshot of every country's GDP, population, and life expectancy in 2025. Dataset B: GDP, population, and life expectancy for 50 countries measured every year from 2000 to 2025.

a) Which is cross-sectional? Which is longitudinal? b) Which dataset would be better for answering "Does GDP growth lead to improved life expectancy over time"? Why? c) Which pillar of statistical investigation (from Chapter 1) does the choice between these two designs most affect?

Part E: Research & Extension ⭐⭐⭐⭐

E.1. Find the data dictionary (or codebook) for one of the suggested portfolio datasets (CDC BRFSS, Gapminder, U.S. College Scorecard, World Happiness Report, or NOAA Climate Data). Choose five variables and:

a) Record their official variable names, descriptions, and types b) Classify each as nominal, ordinal, discrete, or continuous c) Note any cases where the classification is ambiguous and explain your reasoning d) Identify any coded values (like 77 = "Don't know") that a careless analyst might misinterpret

Write a 1-page summary of what you found.

E.2. The debate over whether Likert scale data (1-5 or 1-7 ratings) should be treated as ordinal or numerical has generated dozens of research papers. Find one such paper or authoritative source and summarize:

a) The argument for treating Likert data as ordinal b) The argument for treating it as numerical c) Under what conditions each approach is more appropriate d) Your own position, with justification

E.3. Research the concept of "measurement error." How does it relate to the distinction between discrete and continuous variables? Can a continuous variable be measured precisely enough to be treated as discrete? Write a paragraph with an example from health care or sports analytics.

Solutions

Selected solutions in appendices/answers-to-selected.md.