Exercises: Numerical Summaries — Center, Spread, and Shape

Contributors

Exercises: Numerical Summaries — Center, Spread, and Shape

These exercises progress from concept checks to hands-on calculations and real-world applications. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. Explain the difference between the mean and the median in your own words. Give a real-world scenario where reporting the mean would be misleading and explain why the median would be a better choice.

A.2. What does it mean to say that the median is a "resistant measure" but the mean is not? Use a concrete numerical example — create a small dataset, show its mean and median, then add one extreme value and show how each changes.

A.3. Two datasets both have a mean of 50 and a median of 50. Dataset A has a standard deviation of 5. Dataset B has a standard deviation of 25. Without doing any calculations, describe how the histograms of these two datasets would differ.

A.4. A student says, "The standard deviation is negative for my dataset — most of my values are below the mean." Explain why this statement is impossible, and identify the student's misunderstanding.

A.5. The Empirical Rule says that about 95% of data falls within 2 standard deviations of the mean. Under what conditions does this rule apply? Give an example of a distribution where it would NOT apply, and explain why.

A.6. Explain the difference between variance and standard deviation. Why do we bother with variance at all if standard deviation is in the original units?

A.7. What is a z-score, and why is it useful? Give an example of a situation where you'd need z-scores to make a fair comparison.

Part B: Calculations by Hand ⭐⭐

B.1. A coffee shop records the number of lattes sold per hour during 9 hours of operation:

12, 15, 22, 18, 25, 14, 30, 19, 22

Calculate the following by hand (show your work): - (a) Mean - (b) Median - (c) Mode - (d) Range - (e) Is this distribution likely skewed? In which direction? Explain how you can tell from comparing the mean and median.

B.2. A professor records quiz scores (out of 20) for a class of 8 students:

14, 17, 12, 19, 15, 13, 16, 18

Calculate the following by hand (show your work): - (a) Mean - (b) Variance ($s^2$) - (c) Standard deviation ($s$) - (d) Interpret the standard deviation in a sentence.

B.3. Using the quiz scores from B.2, find: - (a) Q1, Q2 (median), and Q3 - (b) The IQR - (c) The lower and upper fences for outlier detection - (d) Are there any outliers?

B.4. A student earns the following grades:

Course	Credits	Grade
Calculus	4	B (3.0)
Chemistry	4	C (2.0)
English	3	A (4.0)
Psychology	3	A (4.0)
Music	1	B (3.0)

(a) Calculate the simple (unweighted) mean of the grade points.
(b) Calculate the weighted mean (GPA) using credits as weights.
(c) Which is higher? Explain why in terms of which courses carry more weight.

B.5. A dataset has: Q1 = 40, Median = 55, Q3 = 70, Min = 15, Max = 120. - (a) Calculate the IQR. - (b) Calculate the lower and upper fences. - (c) Are the minimum and maximum values outliers? - (d) Sketch a rough box plot based on this five-number summary. Describe what the box plot would look like.

B.6. Body temperatures of 10 healthy adults (in °F):

97.8, 98.0, 98.2, 98.2, 98.4, 98.4, 98.6, 98.6, 98.8, 99.0

(a) Find the mean and standard deviation.
(b) What percentage of values fall within 1 standard deviation of the mean? Within 2 standard deviations?
(c) How do these percentages compare to the Empirical Rule predictions? Is this dataset approximately bell-shaped?

B.7. Alex Rivera finds that StreamVibe session lengths have a mean of 28 minutes and a standard deviation of 12 minutes. Calculate the z-score for each of the following sessions: - (a) A session lasting 40 minutes - (b) A session lasting 16 minutes - (c) A session lasting 28 minutes - (d) A session lasting 64 minutes - (e) Which of these sessions would you consider "unusual" using the z-score method? Explain.

Part C: Interpretation and Application ⭐⭐

C.1. A city reports that the average home price is $450,000. A local news reporter digs into the data and finds that the *median* home price is $310,000. - (a) Which measure is higher? What does this tell you about the shape of the distribution? - (b) A family with a $300,000 budget sees the "average" price and gives up on buying a home in the city. What advice would you give them? - (c) Why might a real estate industry group prefer to report the mean, while a consumer advocacy group might prefer the median?

C.2. Two sections of an introductory biology course take the same exam. The results:

Section	Mean	Median	Std Dev
Section A	78	79	8
Section B	78	72	15

(a) Both sections have the same mean. Are the sections performing similarly? Explain.
(b) What does the difference between mean and median in Section B suggest about its distribution shape?
(c) In which section is a student's score more predictable? How do you know?
(d) A student in Section A scores 66. A student in Section B scores 66. Calculate the z-score for each. Which student's performance is more unusual relative to their section?

C.3. The five-number summary for the ages of 200 marathon runners is:

Min = 18, Q1 = 28, Median = 35, Q3 = 44, Max = 72

(a) What is the IQR? Interpret it in a sentence.
(b) Calculate the fences for outlier detection. Are any of the extreme values outliers?
(c) Would you expect the mean age to be higher or lower than the median? Explain.
(d) Approximately how many runners are between ages 28 and 44?

C.4. Professor Washington is studying traffic stop data. He finds that the average duration of traffic stops is 15 minutes, with a standard deviation of 8 minutes. The distribution is approximately bell-shaped. - (a) Using the Empirical Rule, what percentage of traffic stops last between 7 and 23 minutes? - (b) What percentage last longer than 31 minutes? - (c) A particular traffic stop lasted 45 minutes. Calculate its z-score. Would you consider this unusual? - (d) Washington notices that traffic stops of minority drivers have a different mean but similar standard deviation. Why might comparing z-scores across groups be more informative than comparing raw durations?

C.5. A tech company reports the following about its employees' salaries: mean = $125,000, median = $92,000, mode = $75,000. - (a) Describe the shape of the salary distribution. - (b) If the company wants to attract new employees, which statistic might they highlight in their job postings? Why? - (c) If a labor union is negotiating for higher wages for most employees, which statistic would best support their argument? Why? - (d) Is there anything wrong with the company reporting the mean salary? Discuss the ethical implications.

Part D: Python and Technology ⭐⭐⭐

D.1. Using Python (or your preferred tool), create the following dataset and compute all requested statistics:

import pandas as pd
import numpy as np

# Monthly electricity bills for 30 apartments (in dollars)
bills = pd.Series([
    62, 71, 78, 80, 82, 85, 87, 88, 90, 92,
    94, 95, 97, 99, 100, 102, 105, 108, 110, 112,
    115, 118, 122, 128, 135, 142, 158, 175, 210, 340
])

(a) Calculate mean, median, and standard deviation.
(b) Find the five-number summary.
(c) Create a box plot. How many outliers does the box plot identify?
(d) Calculate z-scores for the largest and smallest values. Are they outliers by the z-score method?
(e) If the apartment with the $340 bill is removed, how much do the mean and median change? What does this tell you about the resistance of each measure?

D.2. Using the Gapminder dataset (or any dataset with a numerical variable and a grouping variable):

import pandas as pd

# If using gapminder:
# pip install gapminder
# from gapminder import gapminder
# df = gapminder[gapminder['year'] == 2007]

(a) Create side-by-side box plots comparing a numerical variable across groups (e.g., life expectancy by continent).
(b) Calculate the mean, median, and standard deviation for each group.
(c) Which groups show the most spread? Which are most compact?
(d) Identify any outliers. Can you determine which specific observations they are?
(e) Write a 2-3 sentence interpretation of the box plots, as if you were presenting to a non-technical audience.

D.3. Write a Python function that takes a list of numbers and returns a complete summary report:

def summary_report(data, name="Variable"):
    """
    Print a complete numerical summary for a dataset.
    Include: mean, median, mode, std dev, variance, range,
    IQR, five-number summary, number of outliers (IQR method),
    and whether mean > median (skewed right) or mean < median (skewed left).
    """
    # Your code here
    pass

Test your function on at least two different datasets — one symmetric and one skewed.

D.4. Using the Empirical Rule, write Python code that: - (a) Generates 10,000 random values from a bell-shaped distribution (hint: np.random.normal()) - (b) Calculates the percentage of values within 1, 2, and 3 standard deviations of the mean - (c) Compares these percentages to the Empirical Rule predictions (68%, 95%, 99.7%) - (d) Repeats the process with a skewed distribution (hint: np.random.exponential()) and shows that the Empirical Rule does NOT work

Part E: Synthesis and Critical Thinking ⭐⭐⭐

E.1. A school district reports that "the average teacher salary is $65,000." A teachers' union reports that "the typical teacher earns $52,000." - (a) Both statements could be technically true. Explain how, using the concepts from this chapter. - (b) What additional information would you need to fully understand the salary distribution? - (c) How does this example connect to Theme 2 from this textbook — "averages can hide stories"? - (d) Draft a brief, honest summary of teacher salaries that avoids misleading readers with a single number.

E.2. Consider the following scenario: Sam Okafor is comparing two basketball players for the Riverside Raptors. Player A averages 20 points per game with a standard deviation of 3. Player B averages 20 points per game with a standard deviation of 10. - (a) Which player is more consistent? - (b) If the team needs exactly 20 points from a player to win a crucial game, which player would you choose? Why? - (c) If the team is down by 30 and needs a player who might score 35+ points, which player gives them a better chance? Why? - (d) Explain how this scenario illustrates that "spread is uncertainty" (Theme 4).

E.3. (Connection to Chapter 5) In Chapter 5, you described distributions using words: shape, center, spread, unusual features. Now you can describe them with numbers. For each verbal description below, suggest appropriate numerical summaries and explain your choices: - (a) "The distribution of exam scores is approximately bell-shaped and symmetric." - (b) "Home prices are heavily skewed to the right with several extreme luxury homes." - (c) "Patient waiting times show two distinct peaks — one around 10 minutes and another around 45 minutes."

E.4. (Research and ethics) The economist Thomas Piketty argued that wealth inequality is better measured by the ratio of mean to median wealth than by either number alone. Research and explain: - (a) What does a large ratio of mean to median indicate about a distribution? - (b) In the United States, the mean household wealth is roughly $1,060,000 while the median is roughly $190,000. Calculate the ratio. What does this tell you? - (c) Is it ethically problematic to report only the mean wealth? Only the median? Discuss.

Part F: Progressive Project Connection ⭐⭐⭐⭐

F.1. Return to the dataset you've been analyzing in your Data Detective Portfolio.

(a) Compute the mean, median, standard deviation, and IQR for at least two numerical variables.
(b) Create box plots for these variables. If you have a categorical grouping variable, create side-by-side box plots.
(c) Use the IQR method to identify outliers. For each outlier, investigate: is it likely a data error or a legitimate extreme value?
(d) For any approximately bell-shaped variable, verify the Empirical Rule by calculating the percentage of values within 1, 2, and 3 standard deviations.
(e) Write a 1-page summary of your findings. Address: What do the summary statistics reveal that the histograms from Chapter 5 didn't? What do the histograms reveal that the summary statistics don't?