Exercises: Inference for Means

Contributors

Exercises: Inference for Means

These exercises progress from conceptual understanding through applied t-tests, condition checking, robustness assessment, and Python/Excel implementation. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. Explain in your own words why we use the t-distribution instead of the normal (z) distribution when testing a claim about a population mean. What source of uncertainty does the t-distribution account for that the z-distribution does not?

A.2. A classmate says: "The t-distribution is just a worse version of the normal distribution — wider and less precise." Explain what's wrong with this characterization. Why are the heavier tails of the t-distribution a feature, not a bug?

A.3. True or false (explain each):

(a) The t-distribution with 1,000 degrees of freedom is virtually identical to the standard normal distribution.

(b) A t-test always produces a larger p-value than a z-test on the same data.

(c) If $n = 500$, it doesn't matter whether you use a z-test or a t-test.

(d) The t-distribution is symmetric around zero.

(e) Degrees of freedom for a one-sample t-test equal the sample size.

(f) The t-test requires the population to be exactly normally distributed.

A.4. Why are hypotheses always stated in terms of $\mu$ (the population mean) rather than $\bar{x}$ (the sample mean)? Give a concrete example to illustrate your answer.

A.5. A researcher reports: "The t-test showed $p = 0.03$, so the treatment increased scores by 5 points." Identify at least two problems with this statement.

A.6. Explain the relationship between a 99% confidence interval and a two-sided hypothesis test at $\alpha = 0.01$. If the 99% CI for a population mean is (82, 96), which null hypothesis values would be rejected at $\alpha = 0.01$? Which would not?

Part B: Setting Up Hypotheses ⭐

B.1. For each scenario, state the null and alternative hypotheses in symbols and indicate whether the test is one-tailed or two-tailed.

(a) A nutritionist claims that the average American eats 2,200 calories per day. A health researcher suspects the true average is higher.

(b) A city's water utility guarantees that lead levels in drinking water average no more than 15 parts per billion (ppb). An environmental group tests whether the average exceeds this standard.

(c) A battery manufacturer claims its AAA batteries last an average of 20 hours. A consumer testing lab wants to check this claim (the batteries might last longer or shorter).

(d) Sam wants to test whether the Raptors' average rebounds per game has changed from last season's 42.5.

(e) Alex wants to test whether the average loading time for StreamVibe pages is less than the industry standard of 3.0 seconds.

(f) Maya wants to test whether average hospital readmission rates differ from the national rate of 15.6%.

B.2. For each scenario in B.1, identify what $\mu$ represents, what $\mu_0$ is, and what a Type I error and a Type II error would mean in context.

Part C: Computing t-Test Statistics ⭐⭐

C.1. A sample of 40 light bulbs from a production line has a mean lifetime of $\bar{x} = 1,180$ hours with $s = 120$ hours. The manufacturer claims the mean lifetime is 1,200 hours.

(a) State the hypotheses for a two-tailed test. (b) Calculate the test statistic. (c) Find the degrees of freedom. (d) Using Python or a t-table, find the p-value. (e) At $\alpha = 0.05$, what is your conclusion?

C.2. A food safety inspector tests whether the average sodium content in a brand of soup exceeds the label claim of 800 mg per serving. A random sample of 15 cans yields $\bar{x} = 823$ mg and $s = 36$ mg.

(a) State the hypotheses. (b) Calculate the test statistic. (c) Find the p-value (one-tailed). (d) At $\alpha = 0.05$, what is your conclusion? (e) Construct a 95% confidence interval for the true mean sodium content. (f) Does the CI support your hypothesis test conclusion? Explain.

C.3. A school administrator claims the average SAT math score at her school is 520. A sample of 28 students has $\bar{x} = 535$ and $s = 55$.

(a) Test the claim that the true mean exceeds 520 at $\alpha = 0.05$. (b) Would your conclusion change at $\alpha = 0.01$? Show why. (c) Construct 95% and 99% confidence intervals. Which is wider? Why?

Part D: Checking Conditions ⭐⭐

D.1. For each scenario, determine whether the conditions for a one-sample t-test are met. If not, explain which condition is violated and what you would recommend.

(a) A researcher surveys 200 students who volunteer to participate in an online study about sleep habits.

(b) A pharmaceutical company randomly selects 12 patients and measures their cholesterol levels. The histogram shows a strong right skew with one extreme outlier.

(c) A factory randomly selects 50 products from a day's production of 400 items and measures their weights.

(d) A psychologist times how long 22 randomly selected participants take to complete a maze. The data are roughly symmetric with no outliers.

(e) An ecologist measures the mercury levels in 8 randomly caught fish from a lake. The histogram is approximately bell-shaped.

D.2. For each dataset description below, assess whether the normality condition is satisfied. State whether you would proceed with the t-test, use caution, or recommend an alternative method.

(a) $n = 10$, histogram shows slight right skew, no outliers (b) $n = 50$, histogram shows moderate right skew, no outliers (c) $n = 8$, histogram is roughly symmetric and bell-shaped (d) $n = 25$, histogram shows strong right skew with two extreme outliers (e) $n = 100$, histogram shows moderate left skew

D.3. A student performs a Shapiro-Wilk test on her sample of 200 observations and gets $p = 0.02$. She concludes that the t-test cannot be used. Evaluate her reasoning.

Part E: Applied Problems ⭐⭐

E.1. (Maya's domain) A random sample of 45 emergency department patients at a rural hospital had an average wait time of 312 minutes with $s = 87$ minutes. The state benchmark is 280 minutes.

(a) Test whether the average wait time exceeds the state benchmark at $\alpha = 0.05$. (b) Construct a 95% confidence interval for the true mean wait time. (c) Maya notes that wait time data is typically right-skewed. Should she be concerned about the normality condition? Why or why not? (d) Interpret your results in a sentence that a hospital administrator could understand.

E.2. (Alex's domain) StreamVibe randomly samples 30 user sessions and measures engagement time in minutes:

32, 48, 55, 41, 38, 67, 29, 44, 52, 36,
59, 43, 50, 35, 61, 40, 47, 53, 45, 38,
56, 42, 49, 33, 51, 46, 54, 37, 58, 44

(a) The industry benchmark for engagement time is 42 minutes. Test whether StreamVibe's engagement time differs from this benchmark. (b) Compute the 90%, 95%, and 99% confidence intervals. How does the confidence level affect the width? (c) For which of these CIs does the benchmark value of 42 fall outside the interval?

E.3. (Sam's domain) The Riverside Raptors averaged 22.3 assists per game last season. Under the new coach, a sample of 20 games shows $\bar{x} = 25.1$ and $s = 5.8$.

(a) Test whether the average assists per game has increased at $\alpha = 0.05$. (b) If the p-value had been exactly 0.050, what would your conclusion be? Explain the distinction between $p \leq \alpha$ and $p < \alpha$. (c) Sam's friend argues that a 2.8-assist increase "isn't that much." How would you respond? (Hint: think about the difference between statistical significance and practical significance, which you'll explore more in Chapter 17.)

Part F: z-Test vs. t-Test ⭐⭐

F.1. For each scenario, indicate whether a z-test or t-test is more appropriate, and explain why.

(a) Testing whether the average height of a sample of 35 adults differs from 67 inches. Population standard deviation is unknown.

(b) Testing whether the average score on a standardized exam (known $\sigma = 15$) differs from 100 for a sample of 50 students.

(c) Testing whether the average temperature in a city last month differed from the historical average of 72°F. Sample of 31 days, standard deviation estimated from the sample.

(d) A factory has records spanning 20 years establishing that $\sigma = 0.5$ mm for a particular component. You test whether a new batch's mean diameter is 10 mm.

F.2. For the scenario in F.1(b), compute both the z-test statistic and the t-test statistic (using $s = 14.2$ from the sample). Compare the p-values. How different are they? Does the choice matter here?

F.3. A statistics student asks: "If the t-test always works and the z-test only works when $\sigma$ is known, why don't we just always use the t-test?" Provide a thoughtful response.

Part G: Robustness and Violations ⭐⭐⭐

G.1. A researcher conducts a one-sample t-test with $n = 10$ on data from a strongly right-skewed population. Her p-value is 0.048. Should she trust this result? Explain your reasoning using the simulation results from Section 15.6.

G.2. Consider the following three datasets, all with $n = 20$, $\bar{x} = 50$, and $s = 10$:

Dataset A: Approximately normal, no outliers Dataset B: Moderately right-skewed, no outliers Dataset C: Approximately normal, but with one outlier at 95

(a) All three would produce the same t-statistic and p-value if tested against $H_0: \mu = 47$. But the t-test would be more trustworthy for some datasets than others. Rank them from most to least trustworthy and explain why.

(b) For Dataset C, what would happen to $\bar{x}$ and $s$ if the outlier (95) were removed? How would this affect the t-test result?

G.3. A medical researcher wants to test whether a new drug changes average blood pressure. She has two options:

Option A: Collect $n = 12$ patients and test using a t-test
Option B: Collect $n = 50$ patients and test using a t-test

Blood pressure is known to be approximately normally distributed. Compare the two options in terms of: (i) robustness, (ii) power (ability to detect a real effect), and (iii) width of the confidence interval.

G.4. Explain why the t-test is robust to bimodal distributions but not to heavy-tailed distributions, even when both violate the normality assumption. (Hint: think about what the CLT guarantees and what it doesn't protect against.)

Part H: Confidence Intervals and Interpretation ⭐⭐

H.1. A 95% confidence interval for the average commute time in a city is (28.3, 34.7) minutes, based on a random sample of 60 commuters.

(a) Interpret this interval in context. (b) Would you reject $H_0: \mu = 30$ at $\alpha = 0.05$? Explain without computing a test statistic. (c) Would you reject $H_0: \mu = 28$ at $\alpha = 0.05$? (d) A newspaper reports: "There is a 95% probability that the true average commute time is between 28.3 and 34.7 minutes." Correct this statement.

H.2. Two researchers study average sleep duration. Researcher A reports a 95% CI of (6.8, 7.4) hours based on $n = 200$. Researcher B reports a 95% CI of (5.5, 8.7) hours based on $n = 15$.

(a) Both intervals contain 7.0 hours. Does this mean the two studies found the same thing? Explain. (b) Which study provides more useful information? Why? (c) What would Researcher B need to do to get an interval as narrow as Researcher A's?

H.3. You compute a 95% CI of (102.3, 107.8) and a 99% CI of (101.1, 109.0) for the same data. Explain why the 99% interval is wider. Use the fishing net metaphor from Chapter 12 in your explanation.

Part I: Python and Excel Practice ⭐⭐

I.1. Use Python to conduct a one-sample t-test on the following data. Test $H_0: \mu = 100$ vs. $H_a: \mu \neq 100$ at $\alpha = 0.05$.

data = [105, 98, 110, 102, 99, 115, 108, 97, 103, 106,
        112, 101, 95, 109, 104, 111, 100, 107, 96, 113]

(a) Compute the sample mean, standard deviation, and standard error. (b) Use scipy.stats.ttest_1samp() to find the t-statistic and p-value. (c) Construct a 95% confidence interval. (d) Create a histogram and QQ-plot to check conditions. (e) State your conclusion in context.

I.2. Using Excel, set up a spreadsheet to perform the same test as I.1. Your spreadsheet should include: - Cells for $\bar{x}$, $s$, $n$, $\mu_0$, $df$ - The t-statistic formula - The two-tailed p-value using T.DIST.2T - The margin of error and 95% CI bounds

I.3. Write a Python function that takes a dataset and a hypothesized mean, and produces a complete analysis report including: - Summary statistics ($n$, $\bar{x}$, $s$, SE) - t-statistic, degrees of freedom, and p-value - 95% confidence interval - A normality assessment (Shapiro-Wilk test result) - A text-based conclusion (reject or fail to reject)

Test your function on at least two different datasets.

Part J: Making Connections ⭐⭐⭐

J.1. (CI-Test Duality) A 95% confidence interval for the average price of a gallon of gasoline in a state is ($3.42, $3.78).

(a) Without computing a test statistic, determine the result of testing $H_0: \mu = 3.50$ at $\alpha = 0.05$. (b) Determine the result of testing $H_0: \mu = 3.80$ at $\alpha = 0.05$. (c) Determine the result of testing $H_0: \mu = 3.40$ at $\alpha = 0.05$. (d) Could you use this CI to determine the result of a test at $\alpha = 0.01$? Why or why not?

J.2. (Connection to Sampling Distributions, Ch.11) Explain why the standard error $s/\sqrt{n}$ decreases as $n$ increases. What does this imply about the width of confidence intervals and the power of t-tests? Connect your answer to the CLT from Chapter 11.

J.3. (Connection to Study Design, Ch.4) A company wants to test whether its employees' average job satisfaction score differs from the industry average of 7.2 (on a 10-point scale). They send an optional survey to all 500 employees and receive 85 responses. Can they use a t-test on these responses? What's the concern?

J.4. (Preview of Paired t-Test) Ten students take a statistics quiz before and after a study session. Their scores:

Student	Before	After	Difference
1	62	75	13
2	78	80	2
3	55	68	13
4	85	88	3
5	70	79	9
6	63	72	9
7	90	91	1
8	58	70	12
9	72	78	6
10	67	76	9

(a) Why would it be incorrect to treat the "Before" scores and "After" scores as two independent samples? (b) Compute the mean and standard deviation of the differences. (c) Conduct a one-sample t-test on the differences to test $H_0: \mu_d = 0$ vs. $H_a: \mu_d > 0$ at $\alpha = 0.05$. (d) Interpret your result. Did the study session help?

Part K: Ethical and Critical Thinking ⭐⭐⭐

K.1. A pharmaceutical company tests whether its pain medication reduces average pain scores below 5 (on a 10-point scale). The sample of 500 patients yields $\bar{x} = 4.92$, $s = 2.1$, and $p = 0.004$.

(a) Is the result statistically significant at $\alpha = 0.05$? (b) The medication reduced pain by 0.08 points on a 10-point scale. Is this reduction practically meaningful? How would a patient respond to this finding? (c) Why did such a tiny effect produce such a small p-value? (d) What additional information would you want before recommending this medication?

K.2. An education researcher tests whether a new teaching method changes average test scores from 75. She tests 20 classrooms and finds $p = 0.043$. She publishes the result. Meanwhile, three other researchers tested similar teaching methods and found $p = 0.12$, $p = 0.28$, and $p = 0.55$. Those results were never published because they were "not significant."

(a) What is this phenomenon called? (Recall from Chapter 13.) (b) If you considered all four studies together, would you conclude the teaching method works? (c) How does publication bias affect the reliability of the published literature?

K.3. A factory manager uses a t-test daily to check whether the average weight of products meets the specification of 500 g. She uses $\alpha = 0.05$.

(a) If the process is working perfectly (products truly average 500 g), how many false alarms would she expect per year (assuming ~250 working days)? (b) What would happen if she lowered $\alpha$ to 0.01? What's the tradeoff? (c) Suggest a better approach than daily hypothesis tests. (Hint: think about confidence intervals or control charts.)

Part L: Challenge Problems ⭐⭐⭐⭐

L.1. (Mathematical) Starting from the definition of the t-statistic $t = (\bar{x} - \mu_0)/(s/\sqrt{n})$, show that the 95% confidence interval $\bar{x} \pm t^*_{0.025, n-1} \cdot s/\sqrt{n}$ is equivalent to the set of all $\mu_0$ values for which the two-tailed t-test would not reject $H_0$ at $\alpha = 0.05$. This proves the CI-test duality for means.

L.2. (Simulation Study) Write a Python simulation to verify the robustness results in Section 15.6:

(a) Generate 10,000 samples of size $n = 10$ from an exponential distribution with $\lambda = 1$ (strongly right-skewed). For each sample, conduct a t-test with $H_0: \mu = 1$ (which is true) at $\alpha = 0.05$. What proportion of tests reject $H_0$?

(b) Repeat with $n = 30$ and $n = 100$. How does the actual rejection rate compare to the nominal 0.05?

(c) Repeat the entire exercise using a normal population. Verify the rejection rate is approximately 0.05 at all sample sizes.

(d) Summarize your findings. At what sample size does the t-test become reliable for exponential data?

L.3. (Effect of Outliers) Consider a sample of $n = 20$ values drawn from a $N(100, 10^2)$ distribution. You are testing $H_0: \mu = 100$.

(a) Simulate 1,000 such samples and record the proportion of Type I errors at $\alpha = 0.05$.

(b) Now contaminate each sample: replace the largest observation with a value of 200 (an extreme outlier). Repeat the simulation. How does the outlier affect the Type I error rate?

(c) How does the outlier affect the power of the test when the true mean is 105? (You'll need to generate data from $N(105, 10^2)$ and add the outlier.)

L.4. (Sample Size Determination) Maya wants to estimate the average ED wait time in her county to within $\pm 10$ minutes with 95% confidence. Based on her preliminary data, $s \approx 45$ minutes.

(a) Use the formula $n = \left(\frac{t^* \cdot s}{E}\right)^2$ to find the required sample size. Note: since $t^*$ depends on $n$ (through $df$), you'll need to iterate or use an approximation.

(b) Start with $z^* = 1.96$ as an initial approximation. Compute $n$, then recompute using $t^*$ with $df = n - 1$. Does the result change much?

(c) What happens to the required $n$ if Maya wants the margin of error to be $\pm 5$ minutes instead? Explain the relationship between precision and required sample size.

Student	Before	After	Difference
1	62	75	13
2	78	80	2
3	55	68	13
4	85	88	3
5	70	79	9
6	63	72	9
7	90	91	1
8	58	70	12
9	72	78	6
10	67	76	9

Student	Before	After	Difference
1	62	75	13
2	78	80	2
3	55	68	13
4	85	88	3
5	70	79	9
6	63	72	9
7	90	91	1
8	58	70	12
9	72	78	6
10	67	76	9

Student	Before	After	Difference
1	62	75	13
2	78	80	2
3	55	68	13
4	85	88	3
5	70	79	9
6	63	72	9
7	90	91	1
8	58	70	12
9	72	78	6
10	67	76	9