Quiz: Exploring Data — Graphs and Descriptive Statistics

Contributors

Quiz: Exploring Data — Graphs and Descriptive Statistics

Test your understanding before moving on. Target: 70% or higher to proceed confidently.

Section 1: Multiple Choice (1 point each)

1. Which graph type is most appropriate for displaying the distribution of a single numerical variable?

A) Bar chart
B) Pie chart
C) Histogram
D) Scatterplot

Answer

**C)** Histogram. *Why C:* Histograms are designed specifically for numerical variables — they divide the data into equal-width bins along a continuous number line and display the count per bin as touching bars. *Why not A:* Bar charts are for categorical variables, where the x-axis displays category names rather than a numerical scale. *Why not B:* Pie charts show proportions of categories — they don't display the distribution shape of a numerical variable. *Why not D:* Scatterplots display the relationship between *two* numerical variables, not the distribution of one. *Reference:* Section 5.4

2. What is the key visual difference between a bar chart and a histogram?

A) Bar charts use colors; histograms are always gray
B) Bar charts have gaps between bars; histogram bars touch
C) Bar charts are always horizontal; histograms are always vertical
D) Histograms can only display percentages; bar charts display counts

Answer

**B)** Bar charts have gaps between bars; histogram bars touch. *Why B:* The gaps in a bar chart signal that the categories are distinct and separate. The touching bars in a histogram signal that the data is continuous — moving from one bin to the next moves along a number line with no gaps. *Why not A:* Both chart types can use any color scheme. *Why not C:* Both can be displayed horizontally or vertically. *Why not D:* Both can display either counts (frequency) or proportions (relative frequency). *Reference:* Section 5.4

3. A distribution is described as "skewed right." This means:

A) The peak of the distribution is on the right side
B) The longer tail of the distribution extends to the right
C) Most of the data is on the right side of the histogram
D) The distribution leans to the right like the Leaning Tower of Pisa

Answer

**B)** The longer tail of the distribution extends to the right. *Why B:* The skew is named for the direction of the *tail*, not the peak. In a right-skewed distribution, the bulk of the data is on the left (lower values) and a long tail stretches to the right (higher values). Income distributions are a classic example. *Why not A:* The peak is actually on the *left* side in a right-skewed distribution. *Why not C:* Most of the data is on the left (lower values) — the right side contains the tail with fewer, more extreme values. *Why not D:* Skewness describes the asymmetry of a distribution's tails, not a lean. *Reference:* Section 5.7

4. Dr. Maya Chen's histogram of flu patient ages showed two distinct peaks — one for young children and one for older adults. This distribution is best described as:

A) Skewed right
B) Uniform
C) Unimodal and symmetric
D) Bimodal

Answer

**D)** Bimodal. *Why D:* A bimodal distribution has two distinct peaks (modes). Maya's flu data had one peak at ages 0-9 and another at ages 60-69, indicating two separate groups were most affected by the flu. *Why not A:* Skewed right describes a distribution with one peak and a long tail to the right — not two separate peaks. *Why not B:* A uniform distribution has all bars at roughly the same height — no peaks at all. *Why not C:* Unimodal means one peak. Maya's data clearly had two. *Reference:* Sections 5.1, 5.7

5. What is a relative frequency?

A) The count of observations in a category
B) The proportion of observations in a category (count divided by total)
C) The average value within a category
D) The range of values within a bin

Answer

**B)** The proportion of observations in a category (count divided by total). *Why B:* Relative frequency = frequency / total number of observations. It tells you what fraction (or percentage) of the data falls in each category or bin. Relative frequencies always sum to 1 (or 100%). *Why not A:* That's frequency (the raw count), not relative frequency. *Why not C:* The average value is a measure of center, not a frequency. *Why not D:* The range within a bin is a measure of spread, not frequency. *Reference:* Section 5.4

6. Why must the vertical axis of a bar chart start at zero?

A) Because negative values are impossible in data
B) So that the visual ratios between bar heights accurately represent the ratios between frequencies
C) Because Excel requires it
D) So that all bars are visible on the screen

Answer

**B)** So that the visual ratios between bar heights accurately represent the ratios between frequencies. *Why B:* If the axis starts at a value other than zero, the visual comparison is distorted. A bar at 52 looks twice as tall as a bar at 51 if the axis starts at 50 — even though the actual difference is tiny (less than 2%). Starting at zero ensures that a bar twice as tall represents a frequency roughly twice as large. *Why not A:* Some variables can have negative values (temperature, profit/loss), but that's not why we start at zero. *Why not C:* Excel doesn't require it — in fact, Excel sometimes auto-adjusts axes away from zero, which can be misleading. *Why not D:* The bars would be visible either way; the issue is accurate visual comparison. *Reference:* Section 5.11

7. Which of the following is NOT a common graphing mistake discussed in this chapter?

A) Using 3D effects on charts
B) Starting the vertical axis above zero
C) Using too many colors
D) Using a pie chart with 15 categories

Answer

**C)** Using too many colors. *Why C:* While using excessive colors can be visually distracting, the chapter focused on six specific common mistakes: truncated axes, 3D charts, misleading bin widths, wrong graph type for the variable, pie charts with too many categories, and missing labels/titles. "Too many colors" was not among the specific mistakes discussed. *Why not A:* 3D charts create perspective distortion that makes accurate comparisons impossible — this was explicitly discussed. *Why not B:* Truncated axes (starting above zero) was discussed as the most common way graphs mislead. *Why not D:* Pie charts with too many categories make it impossible to compare similar-sized slices. *Reference:* Section 5.11

8. Sam creates a histogram of shooting percentages and notices one bar sitting far away from the rest of the distribution, separated by a gap. This isolated bar likely represents:

A) A bin with the highest frequency
B) An outlier or group of outliers
C) The mean of the distribution
D) A measurement error that must be removed

Answer

**B)** An outlier or group of outliers. *Why B:* An outlier is an observation that falls far from the rest of the data. In a histogram, outliers appear as isolated bars separated from the main body of the distribution by a gap. *Why not A:* The highest frequency bar would be the tallest bar — not necessarily isolated or separated from the others. *Why not C:* The mean is a calculated value, not a specific bar in the histogram. *Why not D:* Outliers are not automatically errors. They could be genuine extreme values, measurement anomalies, data entry errors, or even the most interesting part of the data. The correct response is to investigate, not automatically delete. *Reference:* Section 5.7

9. In a stem-and-leaf plot, the row 3 | 2 5 7 8 represents which values?

A) 3.2, 3.5, 3.7, 3.8
B) 32, 35, 37, 38
C) 3, 2, 5, 7, 8
D) 3.25, 7.8

Answer

**B)** 32, 35, 37, 38. *Why B:* In a standard stem-and-leaf plot, the stem represents the leading digit(s) and each leaf represents the trailing digit. Stem 3 with leaves 2, 5, 7, 8 means four values: 32, 35, 37, and 38. *Why not A:* These would be correct if the data were decimal values between 3 and 4, but the standard interpretation uses whole numbers. *Why not C:* The stem and leaves combine to form complete values — they're not separate numbers. *Why not D:* This misinterprets how stems and leaves combine. *Reference:* Section 5.5

10. When would you use a stem-and-leaf plot instead of a histogram?

A) When your dataset has more than 1,000 observations
B) When your data is categorical
C) When your dataset is small (roughly 10-50 values) and you want to preserve exact values
D) When you want to display relative frequencies

Answer

**C)** When your dataset is small (roughly 10-50 values) and you want to preserve exact values. *Why C:* Stem-and-leaf plots show both the shape of the distribution AND the exact data values — something histograms don't do. But they become unwieldy for large datasets, so they work best with 10-50 observations. *Why not A:* With 1,000+ observations, a stem-and-leaf plot would be impossibly long. Use a histogram instead. *Why not B:* Stem-and-leaf plots are for numerical data, just like histograms. *Why not D:* Stem-and-leaf plots display counts and individual values, not relative frequencies specifically. *Reference:* Section 5.5

Section 2: True or False (1 point each)

11. True or False: The bars in a histogram can be rearranged without changing the meaning of the graph.

Answer

**False.** The bars in a histogram represent bins along a continuous number line. Rearranging them would destroy the shape of the distribution, which is the entire point of the graph. (Bar chart bars *can* be rearranged for nominal data because categories have no inherent order — but histogram bars must stay in numerical order.) *Reference:* Section 5.4

12. True or False: A pie chart is appropriate for displaying the age distribution of employees at a company.

Answer

**False.** Age is a numerical variable, so it should be displayed with a histogram (or stem-and-leaf plot for small datasets), not a pie chart. Pie charts are for categorical variables where you want to show parts of a whole. You *could* convert ages into categories (e.g., "18-29," "30-39," etc.) and then use a pie chart, but a histogram would be more informative because it preserves the continuous scale and shows the distribution shape directly. *Reference:* Sections 5.3, 5.9

13. True or False: If a distribution has a mean of 100 and a median of 100, the distribution must be symmetric.

Answer

**False.** Equal mean and median *suggests* symmetry but doesn't guarantee it. It's possible (though uncommon) to construct a distribution where mean equals median but the distribution is not perfectly symmetric. However, a large difference between mean and median is a strong indicator of skewness. A symmetric distribution will have mean ≈ median, but the converse isn't always true. *Reference:* Section 5.7 (shape discussion)

14. True or False: When choosing the number of bins for a histogram, using more bins always gives a better picture of the distribution.

Answer

**False.** Too many bins creates a jagged, noisy histogram where random variation in the data obscures the true underlying pattern. Too few bins over-smooths the data and hides important features. The goal is a "Goldilocks" number that reveals the shape without creating artificial noise. Rules of thumb suggest 5-7 bins for small datasets and 8-25 for larger ones. *Reference:* Section 5.4

15. True or False: A distribution described as "skewed left" has its longer tail pointing toward smaller values.

Answer

**True.** In a left-skewed (negatively skewed) distribution, the longer tail extends to the left (toward smaller values). The bulk of the data is on the right (higher values). An example is exam scores on an easy test — most students score high, with a few low scores pulling the tail to the left. *Reference:* Section 5.7

Section 3: Short Answer (2 points each)

16. Explain the concept of "distribution thinking." Why is it a threshold concept in statistics? Use a specific example from this chapter to illustrate your answer.

Answer

**Distribution thinking** is the habit of seeing data as an entire distribution — with a shape, center, spread, and unusual features — rather than as individual numbers or single summary values. It's a threshold concept because it fundamentally changes how you interpret data, and once you develop it, you can't go back to thinking in terms of single numbers alone. **Example:** Maya's flu data had a mean age of 38, which sounds like flu patients are typically middle-aged. But distribution thinking prompts you to ask: "What does the whole distribution look like?" The histogram revealed a bimodal distribution with peaks at ages 0-9 and 60-69. The "average" patient of age 38 barely exists. Without distribution thinking, you'd miss the real story — that flu targets two distinct age groups, not one. *Key scoring elements:* Definition of distribution thinking (1 pt), specific example showing why a single number is insufficient (1 pt). *Reference:* Section 5.8

17. You're given a dataset with one categorical variable (favorite sport: Soccer, Basketball, Baseball, Tennis, Other) and one numerical variable (hours of exercise per week). Describe the specific graphs you would create and what each graph would tell you.

Answer

**Graph 1: Bar chart of favorite sport.** This would show the frequency (count) of respondents for each sport category. It would tell you which sport is most popular, the relative popularity of each sport, and whether one sport dominates or if preferences are spread evenly. **Graph 2: Histogram of hours of exercise per week.** This would show the distribution shape of exercise hours — whether it's symmetric or skewed (likely skewed right, since most people exercise moderate amounts while a few exercise heavily), unimodal or bimodal, where the center is, and whether there are outliers (very high exercisers). **Graph 3 (bonus): Side-by-side histograms or overlaid histograms of exercise hours by sport.** This would compare the distribution of exercise hours across sports, potentially revealing that basketball fans exercise more than tennis fans, or that soccer fans have a different distribution shape. *Key scoring elements:* Correct graph types for each variable type (1 pt), meaningful description of what each graph reveals (1 pt). *Reference:* Sections 5.2, 5.4, 5.9

18. A histogram of household income in a country shows a right-skewed distribution. Explain what this means in plain English — as if you were explaining it to someone who has never taken a statistics course. Why does income tend to be skewed right rather than symmetric?

Answer

**Plain English explanation:** If you lined up all the households from lowest to highest income and made a picture of how many households earn each amount, you'd see a big pile-up on the left side (lots of people earning moderate incomes) with a long stretched-out tail to the right (a few people earning very high incomes). Most households cluster in a similar range, but a small number of very wealthy households stretch the picture far to the right. **Why income is skewed right:** There's a natural floor on income (you can't earn less than $0 in most measurements), but there's effectively no ceiling — a few people can earn millions or billions. This creates an asymmetry: the distance from the median to the minimum is limited, but the distance from the median to the maximum can be enormous. Any variable with a hard lower bound and no upper bound tends to be skewed right. *Key scoring elements:* Clear non-technical explanation of right skew (1 pt), reasonable explanation of why income has this shape (1 pt). *Reference:* Section 5.7

19. Compare and contrast frequency distributions and relative frequency distributions. When would you prefer to use relative frequencies instead of raw frequencies?

Answer

A **frequency distribution** displays the raw count of observations in each bin or category. A **relative frequency distribution** displays the proportion (frequency / total) in each bin or category. Both organize data the same way — the only difference is whether you report counts or proportions. **When to use relative frequencies:** 1. **Comparing datasets of different sizes.** If Community A has 200 flu patients and Community B has 800, raw frequency histograms would make Community B's bars much taller in every bin, making it hard to compare distribution *shapes*. Relative frequencies put both on the same 0-to-1 (or 0%-to-100%) scale, allowing direct shape comparison. 2. **Interpreting the data as probabilities.** A relative frequency can be interpreted as the probability of a randomly chosen observation falling in that bin. 3. **Communication.** "35% of patients were children" is often more meaningful than "70 out of 200 patients were children." *Key scoring elements:* Clear distinction between frequency and relative frequency (1 pt), valid reason for preferring relative frequency with example (1 pt). *Reference:* Section 5.4

20. A classmate creates a pie chart with 10 slices, a 3D perspective, and no labels on the slices. Identify three specific problems with this visualization and explain how to fix each one.

Answer

**Problem 1: Too many slices.** With 10 categories, many slices will be thin and nearly impossible to compare visually. **Fix:** Use a bar chart instead of a pie chart, or group the smallest categories into an "Other" category to reduce to 5-6 slices maximum. **Problem 2: 3D perspective.** The 3D effect distorts the perceived size of slices — those angled toward the viewer appear larger than those angled away, even when they represent the same proportion. **Fix:** Use a flat, 2D pie chart. Never use 3D for serious data visualization. **Problem 3: Missing labels.** Without labels, the reader has no way to identify which slice represents which category or what percentage each slice contains. **Fix:** Add labels showing the category name and the percentage (or count) for each slice. In Python, use the `autopct` parameter in `plt.pie()`. *Key scoring elements:* Three distinct problems identified (1.5 pt), three practical fixes proposed (0.5 pt). *Reference:* Sections 5.3, 5.11