Quiz: Types of Data and the Language of Statistics

Q: A researcher records the eye color (brown, blue, green, hazel) of 200 participants. Eye color is: Nominal categorical Ordinal categorical Discrete numerical Continuous numerical

A) Nominal categorical. Why A: Eye color categories have no inherent order — brown is not "more" or "higher" than blue. The values are labels, and arithmetic is meaningless. Why not B: There's no natural ranking among eye colors. Why not C/D: Eye color is not a quantity. Reference: Section 2.3

Q: "If a variable consists entirely of numbers, it must be a numerical variable."

False. Explanation: Zip codes, Social Security numbers, phone numbers, jersey numbers, and coded values (like 1 = Male, 2 = Female) consist entirely of digits but are categorical. The test is whether arithmetic produces meaningful results, not whether the values happen to be digits.

Q: "A parameter is a number calculated from a sample that estimates a population quantity."

False. Explanation: The statement describes a statistic, not a parameter. A parameter describes a population (usually unknown). A statistic describes a sample (known, calculated from data) and is used to estimate the parameter. The description has the terms reversed.

Contributors

Quiz: Types of Data and the Language of Statistics

Test your understanding before moving on. Target: 70% or higher to proceed confidently.

Section 1: Multiple Choice (1 point each)

1. A researcher records the eye color (brown, blue, green, hazel) of 200 participants. Eye color is:

A) Nominal categorical
B) Ordinal categorical
C) Discrete numerical
D) Continuous numerical

Answer

**A)** Nominal categorical. *Why A:* Eye color categories have no inherent order — brown is not "more" or "higher" than blue. The values are labels, and arithmetic is meaningless. *Why not B:* There's no natural ranking among eye colors. *Why not C/D:* Eye color is not a quantity. *Reference:* Section 2.3

2. "Number of siblings" is an example of:

A) A nominal variable
B) An ordinal variable
C) A discrete numerical variable
D) A continuous numerical variable

Answer

**C)** A discrete numerical variable. *Why C:* Number of siblings is counted in whole numbers (0, 1, 2, 3, ...). You can't have 2.7 siblings. It's a meaningful quantity — you can calculate an average. *Why not A/B:* It's a quantity, not a category. *Why not D:* You count siblings; you don't measure them on a continuous scale. *Reference:* Section 2.3

3. Which of the following is NOT a numerical variable?

A) Annual income in dollars
B) Age in years
C) Social Security number
D) Height in centimeters

Answer

**C)** Social Security number. *Why C:* SSNs are identification labels, not quantities. Averaging two SSNs produces a meaningless number. Despite being composed of digits, SSNs are categorical (nominal). *Why not A:* Income is a measured quantity (ratio level). *Why not B:* Age is a measured quantity (ratio level). *Why not D:* Height is a measured quantity (ratio level). *Reference:* Section 2.2

4. A hotel asks guests to rate their stay as "Poor," "Fair," "Good," "Very Good," or "Excellent." This variable is:

A) Nominal — the categories are just labels
B) Ordinal — the categories have a meaningful order
C) Discrete — the responses can be numbered 1 through 5
D) Continuous — satisfaction exists on a spectrum

Answer

**B)** Ordinal — the categories have a meaningful order. *Why B:* There's a clear ranking: Excellent > Very Good > Good > Fair > Poor. But the "distance" between adjacent categories isn't necessarily equal. Is the gap between "Poor" and "Fair" the same as between "Good" and "Very Good"? We don't know. *Why not A:* These categories DO have an order — nominal variables don't. *Why not C:* Even if we assign numbers (1-5), the underlying data represents ordered categories, not counted quantities. *Why not D:* The responses are distinct categories, not a continuous measurement. *Reference:* Section 2.3

5. In a dataset of patients at a clinic, each row represents one patient. The "one patient" is the:

A) Variable
B) Statistic
C) Observational unit
D) Parameter

Answer

**C)** Observational unit. *Why C:* The observational unit is the individual entity that each row describes. If each row is a patient, then "patient" is the observational unit. *Why not A:* Variables are columns (characteristics measured), not rows. *Why not B:* A statistic is a number computed from sample data. *Why not D:* A parameter is a number describing a population. *Reference:* Section 2.1

6. A researcher surveys 500 out of 20,000 employees at a company and finds that the average commute time is 28 minutes. The number 28 minutes is a:

A) Parameter
B) Statistic
C) Variable
D) Population

Answer

**B)** Statistic. *Why B:* The 28 minutes is calculated from a sample (500 employees), not the entire population (20,000 employees). Statistics are computed from samples. *Why not A:* A parameter would be the true average commute time for ALL 20,000 employees — which is unknown. *Why not C:* "Commute time" is the variable; 28 minutes is a summary number. *Why not D:* The population is the group of 20,000 employees, not the number 28. *Reference:* Section 2.4

7. Which variable type allows you to make meaningful statements like "A is twice as much as B"?

A) Nominal
B) Ordinal
C) Interval
D) Ratio

Answer

**D)** Ratio. *Why D:* Ratio variables have a true zero point, which makes ratios meaningful. "60 inches is twice as tall as 30 inches" works because 0 inches means no height. *Why not A/B:* Nominal and ordinal variables don't support arithmetic, let alone ratios. *Why not C:* Interval variables have equal spacing but no true zero. You can't say "80°F is twice as hot as 40°F" because 0°F isn't the absence of temperature. *Reference:* Section 2.6

8. A dataset contains information about countries measured at a single point in time (2025). This is an example of:

A) Longitudinal data
B) Cross-sectional data
C) Ordinal data
D) Parametric data

Answer

**B)** Cross-sectional data. *Why B:* The data captures a snapshot — many countries at one point in time. No repeated measurements over time. *Why not A:* Longitudinal data follows the same units over multiple time points. *Why not C:* "Ordinal" describes a variable type, not a data collection design. *Why not D:* "Parametric" refers to statistical methods, not data structure. *Reference:* Section 2.7

9. Python's df.dtypes command shows a column as int64. This means:

A) The column is definitely a numerical variable for statistical purposes
B) The column contains integers, but it could be categorical (like zip codes)
C) The column is ordinal
D) The column has no missing values

Answer

**B)** The column contains integers, but it could be categorical (like zip codes). *Why B:* Python sees digits and assumes they're numbers. But zip codes, ID numbers, and codes are stored as integers despite being categorical. You need to know your data to classify correctly — Python can't do it for you. *Why not A:* This is the trap the question is testing. Just because Python calls it `int64` doesn't make it a statistical numerical variable. *Why not C:* Python's dtype doesn't indicate ordinal vs. nominal. *Why not D:* `int64` says nothing about missing values (NaN values would typically make it `float64`, but the presence of `int64` alone doesn't guarantee completeness). *Reference:* Section 2.5

10. Which of the following is the best reason to create a data dictionary before analyzing a dataset?

A) It makes the dataset look more professional
B) It prevents misclassifying variables, documents assumptions, and ensures reproducibility
C) It is required by Python before you can load the data
D) It eliminates the need for data cleaning

Answer

**B)** It prevents misclassifying variables, documents assumptions, and ensures reproducibility. *Why B:* Data dictionaries serve multiple critical functions: they prevent errors (like averaging zip codes), explain coded values and missing data conventions, and allow other researchers to replicate the analysis. *Why not A:* Looking professional is a side benefit, not the main purpose. *Why not C:* Python doesn't require a data dictionary to load data. *Why not D:* Data dictionaries inform cleaning decisions but don't replace the need for cleaning. *Reference:* Section 2.5

Section 2: True/False with Justification (1 point each)

11. "If a variable consists entirely of numbers, it must be a numerical variable."

Answer

**False.** *Explanation:* Zip codes, Social Security numbers, phone numbers, jersey numbers, and coded values (like 1 = Male, 2 = Female) consist entirely of digits but are categorical. The test is whether arithmetic produces meaningful results, not whether the values happen to be digits.

12. "Ordinal variables can be meaningfully ranked, but the distances between consecutive values are not necessarily equal."

Answer

**True.** *Explanation:* This is the defining feature of ordinal data. Education levels (high school < bachelor's < master's < doctorate) have a clear order, but the "distance" between high school and bachelor's is not necessarily the same as between master's and doctorate. You can compare (greater/less) but not subtract meaningfully.

13. "Cross-sectional data is better than longitudinal data for studying how things change over time."

Answer

**False.** *Explanation:* Cross-sectional data is a snapshot at one moment — it cannot show change over time for the same individuals. Longitudinal data follows the same units over multiple time points and is specifically designed to study change. However, cross-sectional data is cheaper and faster to collect, so it has its own advantages for different questions.

14. "A parameter is a number calculated from a sample that estimates a population quantity."

Answer

**False.** *Explanation:* The statement describes a **statistic**, not a parameter. A parameter describes a population (usually unknown). A statistic describes a sample (known, calculated from data) and is used to estimate the parameter. The description has the terms reversed.

Section 3: Short Answer (2 points each)

15. A dataset contains the variable "Temperature in Fahrenheit" with values like 32, 50, 72, and 98.6. Is this variable interval or ratio? Explain your reasoning, and explain why the distinction matters for this specific variable.

Sample Answer

Temperature in Fahrenheit is an **interval** variable, not ratio. The key reason: 0°F does not mean "no temperature" — it's an arbitrary point on the Fahrenheit scale. As a result, you cannot say "80°F is twice as hot as 40°F" (this would imply 0°F means the absence of heat, which it doesn't). The distinction matters because: - You CAN say "the difference between 80°F and 60°F equals the difference between 40°F and 20°F" (both 20°F gaps) — interval-level reasoning is valid. - You CANNOT say "80°F is twice as warm as 40°F" — ratio-level reasoning is invalid for Fahrenheit. Note: Temperature in Kelvin IS ratio-level because 0 Kelvin is absolute zero (truly no thermal energy). *Rubric — full credit requires:* - Correct identification as interval - Explanation involving the lack of a true zero - Mention that ratios are not meaningful

16. Dr. Maya Chen studies 3,000 flu patients across a county and finds that the average age of hospitalized patients is 68.4 years. Meanwhile, the true average age of ALL flu patients in the county (including those she didn't track) is some unknown number.

a) Which number is the statistic? Which is the parameter? b) Why would we expect the statistic to be close to — but not exactly equal to — the parameter?

Sample Answer

a) The **statistic** is 68.4 years — it's computed from the sample of 3,000 patients Dr. Chen studied. The **parameter** is the unknown average age of ALL flu patients in the county. b) The statistic should be close to the parameter because the sample is drawn from the population, so it captures the general pattern. But it won't be exact because the sample is only a subset — it doesn't include every flu patient, and random variation means each sample would give a slightly different average. This variation from sample to sample is called sampling variability (formally introduced in [Chapter 11](../../part-04-bridge-to-inference/chapter-11-sampling-distributions-and-clt/index.md)). *Rubric — full credit requires:* - Correct identification of both terms - Mention of sampling variability or the idea that samples don't perfectly represent populations

17. Alex Rivera's boss asks him to classify the variable "Number of Minutes Watched" as discrete or continuous. Alex says continuous. His colleague says discrete because the system only records whole minutes. Who is correct, and why?

Sample Answer

Both have a reasonable point, but **Alex is more correct** in the statistical sense. Time is inherently continuous — a user could watch for 47.3 minutes or 47.382 minutes. The fact that the system rounds to whole minutes is a measurement limitation, not a property of the variable itself. In practice, this distinction often doesn't matter much. Many continuous variables are recorded in discrete units (age in whole years, time in whole minutes, weight in whole pounds). Statisticians typically treat these as continuous unless the number of possible values is very small. The key principle: the variable's *nature* (continuous, because time flows continuously) may differ from its *recorded precision* (discrete whole minutes). The correct classification depends on what the variable represents conceptually, not just how it's stored. *Rubric — full credit requires:* - Recognition that the variable is conceptually continuous - Acknowledgment that recording precision can make it appear discrete - Reasonable conclusion about classification

Section 4: Applied Scenario (3 points)

18. You are given the following dataset about coffee shops in a city:

Shop Name	Neighborhood	Type	Avg Drink Price	Yelp Rating	Seats	WiFi	Year Opened
Bean There	Downtown	Chain	$5.25	3.5	45	Yes	2019
Brew Lab	Midtown	Independent	$6.50	4.8	20	Yes	2022
Daily Grind	Suburbs	Chain	$4.75	3.2	60	No	2015
Pour Over	Arts District	Independent	$7.00	4.6	15	Yes	2023

a) Identify the observational unit. (0.5 points) b) Classify each variable as nominal, ordinal, discrete, or continuous. (1.5 points) c) Is this cross-sectional or longitudinal data? (0.5 points) d) Write one question that can be answered with descriptive statistics from this dataset, and one question that would require inferential statistics. (0.5 points)

Sample Answer

**a)** The observational unit is an individual coffee shop. **b)** | Variable | Classification | Reasoning | |----------|---------------|-----------| | Shop Name | Nominal | Labels — no order or arithmetic | | Neighborhood | Nominal | Location categories — no inherent ranking | | Type | Nominal | Two unordered categories (Chain, Independent) | | Avg Drink Price | Continuous (ratio) | Measured in dollars; 0 means free; ratios meaningful | | Yelp Rating | Ordinal* | Ordered scale; distances may not be perfectly equal | | Seats | Discrete (ratio) | Counted whole numbers; 0 seats = none | | WiFi | Nominal | Two unordered categories (Yes, No) | | Year Opened | Discrete (interval) | Year 0 isn't meaningful; you can't say "2020 is twice as recent as 2010" | *Note: Yelp Rating is a common gray area. Some analysts treat it as continuous (ratio) since it's a computed average of individual ratings. Treating it as ordinal is the more conservative choice. **c)** Cross-sectional — this is a snapshot of coffee shops at one point in time. **d)** - Descriptive: "What is the average drink price of the four coffee shops in this dataset?" (Just summarizing the data you have.) - Inferential: "Based on this sample, are independent coffee shops in this city more expensive than chain shops, on average?" (Generalizing from 4 shops to all shops in the city.) *Rubric:* | Criterion | Points | |-----------|--------| | Correct observational unit | 0.5 | | At least 6 of 8 variables correctly classified with reasoning | 1.5 | | Correct cross-sectional/longitudinal identification | 0.5 | | Valid descriptive AND inferential questions | 0.5 |

Section 5: Spaced Review from Chapter 1 (1 point each)

19. Without looking back, explain the difference between descriptive and inferential statistics. Then: which type of statistics does the variable classification system (nominal, ordinal, discrete, continuous) support? Explain.

Sample Answer

**Descriptive statistics** summarizes data you already have. **Inferential statistics** draws conclusions about a larger population based on sample data. The variable classification system supports BOTH types. For descriptive statistics, knowing the variable type tells you which summaries are appropriate (e.g., you can calculate a mean for numerical data but only a mode for nominal data). For inferential statistics, the variable type determines which test to use (e.g., t-test for numerical outcomes, chi-square for categorical outcomes — topics in Chapters 15 and 19). Credit given for demonstrating understanding of both descriptive/inferential AND how classification connects to each.

20. Name the four pillars of a statistical investigation (from Chapter 1). Which pillar is most directly impacted by the skills you learned in Chapter 2?

Sample Answer

The four pillars: (1) Ask a good question, (2) Collect or find data, (3) Analyze the data, (4) Interpret and communicate results. Chapter 2's skills most directly impact **Pillar 3 (Analyze)** because choosing the correct analysis method depends on correctly classifying your variables. But they also affect **Pillar 2 (Collect)** because designing a data collection plan requires deciding what types of variables to measure and how to record them, and **Pillar 4 (Interpret)** because understanding variable types helps you communicate results correctly (e.g., not reporting "average zip code").

Scoring & Next Steps

Score	Assessment	Recommended Action
< 50%	Needs review	Re-read sections 2.1-2.3, redo Part A exercises
50-70%	Partial	Focus on the nominal/ordinal/discrete/continuous distinctions; redo B.1 and C.2
70-85%	Solid	Ready to proceed; revisit any missed classifications
> 85%	Strong	Proceed; consider the case studies for additional challenge