Exercises: Probability Distributions and the Normal Curve

Contributors

Exercises: Probability Distributions and the Normal Curve

These exercises progress from concept checks through applied distribution analysis. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. In your own words, explain the difference between a discrete random variable and a continuous random variable. Give two examples of each from everyday life.

A.2. A probability distribution for a discrete random variable assigns probabilities to individual values, and these probabilities must sum to 1. For a continuous distribution, individual values have probability zero. How, then, do we find probabilities for continuous variables? What plays the role of "probability" in a continuous distribution?

A.3. Explain why the expected value $E(X) = 3.5$ for a fair die, even though you can never actually roll a 3.5. What does "expected value" actually mean?

A.4. A classmate says, "The normal distribution doesn't apply to my data because my histogram doesn't look like a perfect bell curve." How would you respond? What does George Box's quote "All models are wrong, but some are useful" mean in this context?

A.5. List the four BINS conditions for the binomial distribution. For each condition, give an example of a real-world scenario where that specific condition would be violated.

A.6. Explain the difference between a probability mass function (PMF) and a probability density function (PDF). Which applies to discrete distributions? Which applies to continuous distributions?

A.7. Why does the continuity correction matter when using a normal distribution to approximate a binomial? When can you skip it?

Part B: Binomial Distribution ⭐⭐

B.1. A multiple-choice quiz has 8 questions, each with 4 answer choices. A student who didn't study guesses randomly on every question.

(a) Verify that this scenario meets the BINS conditions for the binomial distribution.

(b) What is the probability the student gets exactly 2 questions right?

(c) What is the probability the student gets 5 or more questions right?

(d) What is the expected number of correct answers? What is the standard deviation?

B.2. Sam Okafor has tracked Daria Williams's free-throw shooting percentage at 82% over the past two seasons. In tonight's game, she attempts 15 free throws.

(a) Define the random variable $X$ and identify $n$ and $p$.

(b) Calculate $P(X = 12)$ using the binomial formula. Then verify with Python.

(c) What is the probability Daria makes at least 13 of 15 free throws?

(d) What is the probability she makes fewer than 10?

(e) If Daria made only 8 of 15 free throws in a game, would that be unusual? Calculate the probability and explain.

B.3. Dr. Maya Chen knows that approximately 18% of adults in her county smoke. She randomly selects 20 adults for a health screening.

(a) What is the probability that exactly 3 of the 20 are smokers?

(b) What is the probability that 5 or more are smokers?

(c) Calculate the expected number of smokers and the standard deviation.

(d) If Maya found that 8 of the 20 were smokers, would that be surprising? Use the expected value and standard deviation to explain.

B.4. Alex Rivera's data shows that 15% of StreamVibe users click on a recommended video. In a random sample of 25 users who see a recommendation:

(a) What is $P(X = 0)$ — the probability that none of them click? Interpret this result.

(b) What is the probability that at least one user clicks?

(c) What value of $X$ is most likely (the mode of the distribution)?

B.5. Write Python code using scipy.stats.binom to create a bar chart showing the complete binomial probability distribution for $n = 20$, $p = 0.35$. Mark the expected value on the chart with a vertical dashed line.

Part C: Normal Distribution Basics ⭐⭐

C.1. The heights of adult men in the U.S. are approximately normally distributed with $\mu = 70$ inches and $\sigma = 3$ inches.

(a) What percentage of men are taller than 76 inches (6'4")?

(b) What percentage of men are shorter than 64 inches (5'4")?

(c) What percentage of men have heights between 67 and 73 inches?

(d) How tall must a man be to be in the tallest 5%?

C.2. The Empirical Rule states that about 68% of data falls within 1 SD of the mean for a bell-shaped distribution. Using the z-table (or Python), find the exact percentage. How close is the Empirical Rule's approximation?

C.3. IQ scores are designed to follow a normal distribution with $\mu = 100$ and $\sigma = 15$.

(a) What proportion of people have IQ scores above 130?

(b) What proportion have IQ scores between 85 and 115?

(c) What IQ score marks the 99th percentile?

(d) Mensa requires an IQ at or above the 98th percentile for membership. What IQ score is required?

C.4. For each scenario, determine if a normal model would be appropriate. Explain your reasoning.

(a) The number of cars passing through a toll booth per hour

(b) The weight of cereal boxes filled by an automated machine

(c) Annual household income in the United States

(d) The time it takes a student to complete a 60-minute exam

(e) The temperature at noon in a city over the course of a year

Part D: Z-Scores and Probability ⭐⭐

D.1. A z-score of $z = -1.45$ corresponds to what cumulative probability $P(Z \leq -1.45)$? What does this probability represent visually on the standard normal curve?

D.2. Find the z-score that separates:

(a) The bottom 20% from the top 80%

(b) The top 1% from the bottom 99%

(c) The middle 90% from the two tails (find both z-scores)

D.3. On a statistics exam, the scores are approximately normally distributed with $\mu = 72$ and $\sigma = 9$.

(a) Alex scored 85. What is Alex's z-score? What percentage of students scored below Alex?

(b) Jordan scored 58. What is Jordan's z-score? What percentage scored above Jordan?

(c) The professor decides to give A's to the top 12% of students. What is the minimum score for an A?

(d) The professor also decides that students scoring below the 5th percentile will receive an F. What is the cutoff?

D.4. Two students take different standardized tests. Student A scores 680 on a test with $\mu = 500$ and $\sigma = 100$. Student B scores 28 on a test with $\mu = 21$ and $\sigma = 5$.

(a) Calculate the z-score for each student.

(b) Which student performed better relative to their peers?

(c) What percentile is each student at?

D.5. Using Python, create a visualization showing:

(a) The standard normal curve with the area to the left of $z = 1.5$ shaded

(b) The area between $z = -0.8$ and $z = 1.2$ shaded

(c) The area in the tails beyond $z = \pm 1.96$ shaded

Include the probability value for each shaded region in the title of each subplot.

Part E: Applied Normal Distribution Problems ⭐⭐

E.1. Dr. Maya Chen is analyzing systolic blood pressure readings from a community health screening. The readings follow an approximately normal distribution with $\mu = 125$ mmHg and $\sigma = 14$ mmHg.

(a) What proportion of readings fall in the "normal" range of 90-120 mmHg?

(b) What proportion of readings indicate Stage 1 hypertension (120-140 mmHg)?

(c) What blood pressure value marks the 90th percentile?

(d) A patient's blood pressure is 155 mmHg. How unusual is this? Calculate the z-score and the percentage of readings higher than this value.

E.2. Alex Rivera's team at StreamVibe finds that the time between a user opening the app and clicking on their first video follows an approximately normal distribution with $\mu = 8.3$ seconds and $\sigma = 2.1$ seconds.

(a) What proportion of users click within 5 seconds?

(b) What proportion take more than 12 seconds?

(c) The team considers users who take more than 12 seconds to be "slow engagers." What percentage of users fall in this category?

(d) The fastest 25% of users are classified as "impulse clickers." What is the cutoff time?

E.3. The weight of packages shipped by a warehouse follows a normal distribution with $\mu = 4.2$ pounds and $\sigma = 0.8$ pounds. The shipping company charges extra for packages over 5.5 pounds and rejects packages over 7 pounds.

(a) What percentage of packages incur an extra charge?

(b) What percentage of packages are rejected?

(c) If the warehouse ships 1,000 packages per day, how many do they expect to be rejected?

(d) The warehouse wants to reduce extra-charge packages to no more than 5%. Without changing the standard deviation, what would the mean weight need to be?

Part F: Assessing Normality ⭐⭐⭐

F.1. Describe what you would see on a QQ-plot for each of the following types of data. Sketch the pattern.

(a) Perfectly normal data

(b) Right-skewed data (e.g., income)

(c) Data with heavier tails than normal (e.g., stock returns)

(d) Data from a uniform distribution

F.2. For each of the following datasets, predict whether a normal model would be appropriate, then explain what the QQ-plot and Shapiro-Wilk test would likely show:

(a) The heights of 500 randomly selected adult women

(b) The annual salaries of employees at a large corporation

(c) The number of typos per page in a 300-page book

(d) The time between arrivals at an emergency room

F.3. Write Python code that:

(a) Generates 500 observations from a normal distribution with $\mu = 50$, $\sigma = 10$

(b) Creates a 2-panel figure: histogram with normal overlay (left) and QQ-plot (right)

(c) Runs the Shapiro-Wilk test and prints the result

(d) Repeats (a)-(c) for data generated from an exponential distribution

Compare the results. What does each diagnostic tool reveal about the exponential data's departures from normality?

F.4. A manufacturing plant produces ball bearings with a target diameter of 10.00 mm. Quality control measures 200 bearings and gets the following summary statistics: mean = 10.02 mm, SD = 0.03 mm, skewness = 0.15, minimum = 9.93, maximum = 10.12.

(a) Based on the summary statistics alone, does this data seem approximately normal? Why or why not?

(b) The QQ-plot shows points following the line closely except for the three largest values, which curve slightly upward. What does this suggest?

(c) The Shapiro-Wilk test gives $W = 0.991$, $p = 0.23$. What do you conclude?

(d) Given all three assessments, would you use the normal model for this data? Justify your answer.

Part G: Integration and Critical Thinking ⭐⭐⭐

G.1. Professor Washington is comparing risk scores for two demographic groups. Group A has $\mu = 38$, $\sigma = 10$, $n = 800$. Group B has $\mu = 47$, $\sigma = 12$, $n = 1{,}200$. Both distributions are approximately normal.

(a) What proportion of Group A has risk scores above 55? What proportion of Group B?

(b) The county uses 55 as the cutoff for "high risk," which triggers pretrial detention. What percentage of each group would be detained?

(c) Discuss the ethical implications of using a single cutoff score when the two groups have different distributions. Connect this to Theme 4 (the bell curve isn't destiny).

G.2. A pharmaceutical company claims that the reaction time of patients taking their new medication follows a normal distribution with $\mu = 250$ ms and $\sigma = 30$ ms. A researcher tests 50 patients and finds:

Mean reaction time: 265 ms
The Shapiro-Wilk test gives $p = 0.04$
The QQ-plot shows most points on the line but with a few outliers in the upper tail

(a) Is the researcher's sample consistent with the company's claim about the mean? (Hint: think about where 265 falls relative to the claimed distribution. We'll formalize this in Chapter 13.)

(b) What does the Shapiro-Wilk result suggest? Is this conclusive evidence against normality? Consider the sample size.

(c) What would you recommend the researcher do next?

G.3. You read a news article claiming: "Students' math scores follow a bell curve, proving that most students are average and only a few can be exceptional." Write a paragraph critiquing this claim. In your response:

(a) Explain why test scores often follow a bell curve (think about how tests are constructed)

(b) Distinguish between the distribution of scores and the distribution of ability

(c) Explain why the normal distribution is a model, not a natural law

(d) Connect your critique to the idea that "all models are wrong, but some are useful"

Part H: Python Practice ⭐⭐⭐

H.1. Write a Python function called binomial_summary(n, p) that:

(a) Prints the mean, standard deviation, and mode

(b) Prints $P(X = k)$ for all $k$ from 0 to $n$

(c) Plots the PMF as a bar chart

(d) Identifies the most likely outcome and prints: "The most likely number of successes is ___ with probability ___"

Test your function with $n = 15$, $p = 0.4$.

H.2. Write a Python function called normal_probability(mu, sigma, lower=None, upper=None) that:

(a) Calculates and prints the probability for the specified range

(b) If only lower is given: calculates $P(X > \text{lower})$

(c) If only upper is given: calculates $P(X < \text{upper})$

(d) If both are given: calculates $P(\text{lower} < X < \text{upper})$

(e) Prints the equivalent z-score(s) alongside the probability

Test with $\mu = 100$, $\sigma = 15$: (i) $P(X > 130)$, (ii) $P(X < 85)$, (iii) $P(90 < X < 110)$.

H.3. Write a Python function called normality_check(data, variable_name) that creates a comprehensive normality assessment:

(a) A 3-panel figure: histogram with normal overlay, QQ-plot, box plot

(b) The Shapiro-Wilk test result (with a warning if $n > 5000$)

(c) Skewness and kurtosis values

(d) A printed verdict: "Approximately normal" or "Evidence of non-normality" based on the Shapiro-Wilk p-value

Test your function on both normal and non-normal data.

Part I: Challenge Problems ⭐⭐⭐⭐

I.1. (Research) The binomial distribution has four conditions (BINS). For each violation below, identify which condition fails and research what alternative distribution or approach would be appropriate:

(a) Drawing 5 cards from a standard deck without replacement, counting the number of hearts

(b) Rolling a die 10 times and counting how many times each number appears

(c) Counting the number of customers arriving at a store in an hour

Write a brief paragraph for each, naming the alternative distribution and explaining why it's needed.

I.2. (Simulation) Write Python code to demonstrate the normal approximation to the binomial:

(a) For $n = 50$ and $p = 0.3$, calculate $P(X \leq 18)$ using the exact binomial CDF

(b) Calculate the same probability using the normal approximation (with and without continuity correction)

(c) Create a visualization overlaying the binomial PMF bars with the normal PDF curve

(d) Repeat for $n = 10$ and $p = 0.1$. Does the approximation work as well? Why or why not?

(e) State the general rule of thumb for when the normal approximation is reasonable.

I.3. (Research) George Box's quote "All models are wrong, but some are useful" is one of the most cited in statistics. Find the original 1976 paper (Journal of the American Statistical Association, Vol. 71, No. 356, pp. 791-799) or a summary of it.

(a) What was Box actually writing about? (Hint: it wasn't the normal distribution.)

(b) How does his argument apply to the normal distribution specifically?

(c) Identify two situations where assuming normality would be dangerously wrong (not just slightly inaccurate).

(d) Identify two situations where the normal assumption is "wrong but useful."

Solutions Guide

Selected answers appear in Appendix B (Answers to Selected Exercises). Problems with worked solutions: A.3, B.1, C.1, D.3, E.1, F.4, G.3.