Exercises: Exploring Data — Graphs and Descriptive Statistics

Contributors

Exercises: Exploring Data — Graphs and Descriptive Statistics

These exercises progress from concept checks to hands-on graph creation and critical evaluation. Estimated completion time: 2.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. In your own words, explain the difference between a bar chart and a histogram. Give one specific visual feature that distinguishes them.

A.2. What does it mean for a distribution to be "skewed right"? Draw a rough sketch (or describe in words) what a skewed-right histogram looks like. Give a real-world example of a variable you'd expect to be skewed right.

A.3. A dataset has a mean of 50 and a standard deviation of 10. Can you determine whether the distribution is symmetric, skewed, unimodal, or bimodal from these two numbers alone? Why or why not?

A.4. What is the difference between frequency and relative frequency? Why are relative frequencies more useful when comparing datasets of different sizes?

A.5. An instructor gives an exam and the histogram of scores shows two distinct peaks — one around 55% and one around 85%. What term describes this distribution shape? What might explain the two peaks?

A.6. When you find an outlier in your data, why shouldn't you automatically delete it? Describe three different reasons an outlier might appear and the appropriate response for each.

A.7. Why do statisticians generally prefer bar charts to pie charts? Give one situation where a pie chart is acceptable and one where it clearly isn't.

Part B: Identifying and Creating Graphs ⭐⭐

For each variable below, (a) classify it as categorical or numerical, and (b) name the most appropriate graph type.

B.1. The political party affiliation (Democrat, Republican, Independent, Other) of 500 registered voters.

B.2. The weight (in pounds) of 200 newborn babies at a hospital.

B.3. The star rating (1 to 5 stars) given by 1,000 customers on a product review page.

B.4. The time (in seconds) it takes 150 runners to complete a 100-meter dash.

B.5. The genre of the top 100 most-streamed songs on a music platform (Pop, Hip-Hop, Country, R&B, Rock, Electronic, Other).

B.6. The annual household income (in dollars) of 2,000 residents in a city.

B.7. A professor records the number of absences for each of her 35 students during a semester (values range from 0 to 15).

B.8. The blood type (A, B, AB, O) of 300 patients at a clinic.

B.9. The commute time (in minutes) for 500 employees at a large company.

B.10. The highest degree earned (High School, Associate's, Bachelor's, Master's, Doctorate) by members of a professional organization.

Part C: Interpreting Graphs ⭐⭐

C.1. A histogram of apartment rental prices in a large city shows the following pattern: the tallest bars are in the $1,000-$1,500 range, with bars gradually getting shorter toward $3,000. Beyond $3,000, there are a few very short bars extending to $8,000, with one isolated bar at $12,000-$13,000.

(a) Describe the shape of this distribution (symmetric, skewed, unimodal, bimodal). (b) Would you expect the mean to be higher or lower than the median? Explain why. (c) Is the observation at $12,000-$13,000 an outlier? What might explain it?

C.2. A bar chart of majors declared by freshmen at a university shows the following (from tallest to shortest): Business (320), Biology (280), Psychology (265), Engineering (250), Computer Science (245), English (110), History (85), Art (65), Philosophy (40), Other (140).

(a) What percentage of freshmen declared Business? (b) Would a pie chart be appropriate here? Why or why not? (c) An administrator looks at this chart and says, "Business is by far our most popular major — it's almost four times more popular than History." Is this claim supported by the data? What caution would you add?

C.3. A back-to-back stem-and-leaf plot compares test scores for two sections of the same statistics course:

    Section A | Stem | Section B
        8 5 2 |   5  | 3 7
      9 7 4 1 |   6  | 2 5 8 9
    8 7 5 3 0 |   7  | 1 4 6 7 8 9
      9 6 4 2 |   8  | 0 3 5 8
          8 5 |   9  | 2 5 7
              |  10  | 0

(a) How many students are in each section? (b) Describe the shape of each section's distribution. (c) Which section performed better overall? Support your answer with specific observations from the plot.

C.4. Alex creates two histograms of session watch time — one for Free-tier users and one for Premium-tier users. The Free-tier histogram is sharply skewed right with most sessions between 5 and 25 minutes and a few extending to 60 minutes. The Premium-tier histogram is less skewed, with sessions more evenly spread between 15 and 90 minutes, and a moderate tail extending to 150 minutes.

(a) Describe the shape of each distribution. (b) Would you expect the mean watch time to be higher for Free or Premium users? (c) Alex's boss asks: "What's the typical session length?" Why is this question harder to answer than it sounds?

Part D: Graph Critique ⭐⭐⭐

D.1. A news article about teacher salaries includes a bar chart comparing average salaries across five states. The vertical axis starts at $48,000 instead of $0. The highest bar (New York, $52,000) appears to be about four times taller than the shortest bar (Mississippi, $48,500).

(a) What's misleading about this graph? (b) If the axis started at zero, how would the visual comparison change? (c) Is there ever a valid reason to start an axis above zero? If so, when?

D.2. A health blog publishes a 3D pie chart showing "Causes of Death in the U.S." with 12 categories. The slices near the front of the 3D perspective appear larger than those near the back.

(a) Identify at least two problems with this visualization. (b) Suggest a better graph type for this data. (c) If you had to use a pie chart, how would you improve this one?

D.3. A university's enrollment report shows two histograms side by side. The first shows enrollment in 2020 (1,500 students) and the second shows enrollment in 2024 (3,000 students). Both use frequency (raw counts) on the vertical axis with 10 bins each. The 2024 histogram's bars are all roughly twice as tall as the 2020 histogram's bars.

(a) Can you fairly compare the shapes of these two distributions? Why or why not? (b) What change would make the comparison fair? (c) If you switched both histograms to relative frequency, what would you expect to see?

D.4. A social media post claims "80% of Americans support Policy X" and shows a pie chart with a large green slice (80%) and a small red slice (20%). In the fine print, you read that the survey was conducted on the social media platform itself, with 1,200 respondents.

(a) Using concepts from Chapter 4, identify at least two problems with this data before you even evaluate the graph. (b) Even if the data were perfectly collected, what's wrong with using only two categories ("Support" vs. "Don't Support")? (c) How would you redesign both the survey and the visualization?

Part E: Python and Excel Practice ⭐⭐⭐

E.1. Using Python, create a histogram of the following dataset representing daily high temperatures (°F) for a city over 30 days:

72, 75, 68, 71, 74, 79, 82, 85, 88, 91,
73, 76, 78, 80, 83, 86, 89, 92, 71, 74,
77, 79, 81, 84, 87, 90, 93, 69, 73, 76

(a) Create the histogram with 8 bins. Include a title and axis labels. (b) Describe the shape of the distribution. (c) Change the number of bins to 4, then to 15. How does the shape appear to change? Which number of bins gives the clearest picture of the distribution?

E.2. Sam has the following data for 12 basketball players' free-throw percentages: 62, 71, 74, 76, 78, 79, 80, 82, 84, 85, 91, 95.

(a) Create a stem-and-leaf plot by hand. (b) Describe the shape of the distribution. (c) Using Python, create a histogram of this data and compare it to your stem-and-leaf plot. Which representation do you prefer for a dataset this small, and why?

E.3. Using Python (or Excel), create a bar chart from the following data about ice cream flavor preferences from a survey of 200 people:

Flavor	Count
Chocolate	68
Vanilla	52
Strawberry	31
Mint Chip	24
Cookie Dough	18
Other	7

(a) Create a bar chart sorted from most to least popular. Include proper labels and title. (b) Create a pie chart of the same data. (c) Which visualization is more effective for this data? Defend your answer. (d) Would a histogram be appropriate for this data? Why or why not?

E.4. Dr. Chen has flu case counts for three communities. Using Python, create overlaid histograms (with transparency) or side-by-side histograms that compare the distributions. Use the following simulated data:

import numpy as np
np.random.seed(42)
community_a = np.concatenate([np.random.normal(8, 3, 70),
                               np.random.normal(68, 8, 60)])
community_b = np.random.normal(35, 12, 130)
community_c = np.random.exponential(20, 100) + 5

(a) Create overlaid histograms for all three communities. (b) Describe the shape of each community's distribution. (c) For which community would the mean age be most misleading? Explain why.

Part M: Making Connections (Metacognitive) ⭐⭐⭐⭐

M.1. In Chapter 1, we introduced the distinction between descriptive and inferential statistics. Everything in this chapter falls under descriptive statistics. Design a scenario where a graph you created in this chapter would naturally lead to an inferential question. What descriptive pattern would you see, and what inferential question would it raise?

M.2. A classmate says, "I don't need to make graphs. I can just look at the numbers and understand the data." Write a short response explaining why this view is wrong. Use at least one specific example from this chapter where a graph revealed something that summary statistics alone would have missed.

M.3. Think about the concept of "distribution thinking" — seeing data as an entire distribution rather than as individual numbers or single summary values. Describe a situation from your own life (outside of this class) where distribution thinking would change how you interpret information. For example, how would it change how you read product reviews, interpret sports statistics, or evaluate a grade in a class?

M.4. Consider the ethical dimension of data visualization. In Case Study 1, you'll see examples of graphs that mislead. But misleading doesn't always mean intentionally deceptive — sometimes graphs mislead because the creator didn't know better.

(a) Describe one way a well-meaning person might accidentally create a misleading graph. (b) Describe one way someone might intentionally create a misleading graph to support their agenda. (c) What responsibility does a data analyst have when creating visualizations that will be seen by non-experts?

M.5. Connect this chapter to Chapter 4 (study design). Alex created a histogram of watch times and noticed that Premium users watch longer sessions than Free users. Can she conclude that upgrading to Premium causes people to watch more? Why or why not? What confounding variable(s) might explain the association? What study design would be needed to establish causation?

M.6. Write a "distribution diary" for one day. Record at least three situations where you encounter data (news articles, social media, sports scores, grades, prices, etc.). For each one, describe what the distribution might look like — even if you can't actually see it. Is it likely symmetric or skewed? Unimodal or bimodal? Are there likely outliers? What story does the shape probably tell?