Exercises: Hypothesis Testing: Making Decisions with Data

These exercises progress from conceptual understanding through applied hypothesis testing, p-value interpretation, and ethical reasoning about statistical significance. Estimated completion time: 3.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)


Part A: Conceptual Understanding ⭐

A.1. In your own words, explain the logic of hypothesis testing using the courtroom analogy. What corresponds to the defendant? The prosecution's evidence? The verdict?

A.2. A classmate says: "We got a p-value of 0.03, so there's only a 3% chance the null hypothesis is true." Explain what's wrong with this statement and provide the correct interpretation of $p = 0.03$.

A.3. Another classmate responds: "Fine, so the p-value of 0.03 means there's a 3% chance the results are just due to random chance." Is this correct? Why or why not?

A.4. Explain the difference between "failing to reject $H_0$" and "accepting $H_0$." Why does the distinction matter?

A.5. True or false (explain each):

(a) A p-value of 0.001 means the effect is very large.

(b) If $p > 0.05$, the null hypothesis is true.

(c) If $p < 0.05$ in two separate studies, the two studies found the same effect.

(d) The significance level $\alpha$ should be chosen before looking at the data.

(e) A smaller p-value provides stronger evidence against $H_0$.

(f) A p-value can be negative.

A.6. Explain why hypotheses are always stated in terms of population parameters ($\mu$, $p$) rather than sample statistics ($\bar{x}$, $\hat{p}$).

A.7. A researcher sets $\alpha = 0.05$ and obtains $p = 0.052$. She writes: "The result was marginally significant and suggests a real effect." Evaluate this statement. What are the issues?

A.8. Explain the relationship between a 95% confidence interval and a two-sided hypothesis test at $\alpha = 0.05$. If the 95% CI for a mean is (42, 58), which null hypothesis values would be rejected? Which would not?


Part B: Setting Up Hypotheses ⭐

B.1. For each of the following scenarios, state the null and alternative hypotheses in words and symbols. Indicate whether the test is one-tailed or two-tailed.

(a) A manufacturer claims that its light bulbs last an average of 1,200 hours. A consumer group suspects the bulbs don't last as long as claimed.

(b) A school principal believes that a new teaching method increases average test scores above the current average of 72.

(c) A pharmaceutical company wants to show that its new drug lowers cholesterol more than the standard treatment, which lowers it by an average of 25 mg/dL.

(d) Sam wants to test whether Daria's free throw percentage has changed from her career rate of 78%.

(e) Alex wants to test whether more than 15% of StreamVibe users cancel within the first month.

(f) Professor Washington wants to test whether the algorithm's risk scores differ between defendants who reoffend and those who don't.

B.2. A researcher examines her data and finds that the average test score in her sample is 68. She then writes: "$H_0: \mu = 72$ and $H_a: \mu < 72$." What's wrong with her process? What should she have done differently?

B.3. For Sam's test about Daria's three-point shooting ($H_0: p = 0.31$, $H_a: p > 0.31$), explain why it would be inappropriate to use a two-tailed test ($H_a: p \neq 0.31$) if Sam's only concern is whether Daria has improved.


Part C: Computing Test Statistics ⭐⭐

C.1. A quality control manager inspects a random sample of 50 circuit boards and finds that the mean resistance is $\bar{x} = 102.3$ ohms. The boards are supposed to have a resistance of $\mu_0 = 100$ ohms, and from extensive testing, $\sigma = 8$ ohms.

(a) State the hypotheses for a two-tailed test. (b) Calculate the test statistic. (c) Is the sample mean significantly different from the target? (Just assess based on the test statistic — does it seem large or small?)

C.2. In a random sample of 200 voters, 114 support a ballot measure. The supporters claim that a majority (more than 50%) of voters support the measure.

(a) State the hypotheses. (b) Calculate the test statistic. (c) Is the sample proportion significantly different from 0.50 in the hypothesized direction?

C.3. Sam observes Daria take 100 three-point attempts this season and she makes 37 of them ($\hat{p} = 0.37$). Her career average is $p_0 = 0.31$.

(a) Calculate the test statistic for a one-tailed test ($H_a: p > 0.31$). (b) Compare this test statistic to the one from Chapter 11 (where $n = 65$). What happened and why? (c) Without computing the exact p-value, will the p-value be smaller or larger than the one from the 65-attempt analysis? Explain.

C.4. Dr. Maya Chen tests whether the average BMI in a community exceeds the national average of 26.5. She surveys 80 randomly selected adults and finds $\bar{x} = 27.8$ with $\sigma = 6.2$ (from national data).

(a) Calculate the test statistic. (b) Interpret the test statistic in context: how many standard errors above the national average is the community's mean?


Part D: P-Values and Decisions ⭐⭐

D.1. For each test statistic and alternative hypothesis, compute the p-value and make a decision at $\alpha = 0.05$.

(a) $z = 2.34$, $H_a: \mu > \mu_0$ (one-tailed right) (b) $z = -1.87$, $H_a: \mu < \mu_0$ (one-tailed left) (c) $z = 1.52$, $H_a: \mu \neq \mu_0$ (two-tailed) (d) $z = -2.85$, $H_a: \mu \neq \mu_0$ (two-tailed) (e) $z = 1.645$, $H_a: \mu > \mu_0$ (one-tailed right)

D.2. Complete Exercise C.1 by computing the p-value, making a decision at $\alpha = 0.05$, and stating your conclusion in context. Is the mean resistance significantly different from the target of 100 ohms?

D.3. Complete Exercise C.2 by computing the p-value, making a decision at $\alpha = 0.05$, and stating your conclusion in context. Can the supporters conclude that a majority of voters support the measure?

D.4. Complete Exercise C.3 by computing the p-value. At $\alpha = 0.05$, can Sam conclude that Daria has improved? How does this compare to the conclusion from the 65-attempt analysis?

D.5. A researcher reports $p = 0.048$. At $\alpha = 0.05$, the result is "statistically significant." At $\alpha = 0.01$, it is not. Discuss the implications. Should the researcher celebrate or be cautious?


Part E: Type I and Type II Errors ⭐⭐

E.1. For each scenario in B.1, describe in context: (a) What a Type I error would mean (b) What a Type II error would mean (c) Which error would be more costly

E.2. A pregnancy test has a false positive rate (Type I error rate) of 2% and a false negative rate (Type II error rate) of 8%.

(a) What does a 2% false positive rate mean in context? (b) What does an 8% false negative rate mean in context? (c) If you could improve one of these rates (at the expense of the other), which would you prioritize and why? (d) How does this connect to the base rate / positive predictive value concepts from Chapter 9?

E.3. Professor Washington is testing whether an algorithm's false positive rate exceeds 10% for a particular demographic group. He sets $\alpha = 0.01$.

(a) What is the consequence of a Type I error in this context? (b) What is the consequence of a Type II error? (c) Washington chose a very conservative $\alpha$. What does this tell you about which error he considers more costly? Do you agree with his choice? (d) If Washington fails to reject $H_0$, can he conclude the algorithm is fair? Explain.

E.4. Fill in the blanks with "increases," "decreases," or "stays the same":

(a) If $\alpha$ decreases from 0.05 to 0.01, the probability of a Type I error __. (b) If $\alpha$ decreases from 0.05 to 0.01, the probability of a Type II error _ (for a given sample size). (c) If $n$ increases, the probability of a Type II error (for a given $\alpha$). (d) If the true effect is very large, the probability of a Type II error ___.


Part F: One-Tailed vs. Two-Tailed Tests ⭐⭐

F.1. For each of the following, state whether a one-tailed or two-tailed test is more appropriate. Justify your answer.

(a) Testing whether a new medication changes blood pressure (b) Testing whether a new tutoring program improves exam scores (c) Testing whether a coin is biased (d) Testing whether a new manufacturing process reduces defect rates (e) Testing whether average commute times have changed since pre-pandemic

F.2. A researcher plans a two-tailed test and obtains $p = 0.08$. She notices the data go in the direction she expected, so she switches to a one-tailed test, obtaining $p = 0.04$, and reports a "significant" result.

(a) What is wrong with this approach? (b) What should she have done instead? (c) How does this relate to p-hacking?

F.3. Suppose you're testing $H_0: \mu = 50$ and you compute $z = 1.80$.

(a) Find the p-value for $H_a: \mu > 50$. (b) Find the p-value for $H_a: \mu \neq 50$. (c) At $\alpha = 0.05$, which test(s) lead to rejection of $H_0$? (d) Is it possible for a one-tailed test to reject $H_0$ while the two-tailed test does not? Under what conditions?


Part G: Interpretation and Critical Thinking ⭐⭐⭐

G.1. A study of 50,000 people finds that drinking coffee is associated with a 0.3% increase in average lifespan. The result is highly statistically significant ($p < 0.001$).

(a) Is this finding practically significant? Explain. (b) Why was the study able to achieve such a small p-value for such a tiny effect? (c) What additional information would you want before concluding that coffee extends lifespan? (d) How does this example illustrate the distinction between statistical and practical significance?

G.2. Two studies test the same drug for lowering cholesterol:

Study $n$ Effect (mg/dL reduction) p-value
Study A 30 15.2 0.08
Study B 3,000 2.1 0.002

(a) Which study found a statistically significant result (at $\alpha = 0.05$)? (b) Which study found a more practically meaningful effect? (c) Explain the apparent contradiction: how can a larger effect be "not significant" while a tiny effect is "highly significant"? (d) What does this tell you about the limitations of p-values?

G.3. A news headline reads: "Scientists PROVE that chocolate prevents heart disease (p < 0.05)."

(a) What's wrong with the word "prove"? (b) What's wrong with the implication that $p < 0.05$ establishes the claim? (c) Rewrite the headline to be more statistically accurate. (d) What questions would you want answered before taking this claim seriously?

G.4. Sam computes a p-value of 0.111 for Daria's three-point improvement and tells the coach: "There's an 88.9% chance Daria has improved." Correct Sam's interpretation and provide the proper one.


Part H: Ethics and P-Hacking ⭐⭐⭐

H.1. A market researcher tests 20 different product features for their effect on customer satisfaction. One feature shows $p = 0.03$. She reports this feature as having a "statistically significant effect."

(a) What is the probability of finding at least one significant result (at $\alpha = 0.05$) among 20 tests when none of the features truly affects satisfaction? (b) Does the single significant result provide strong evidence? Why or why not? (c) What should the researcher do instead? (d) If she applies a Bonferroni correction (using $\alpha' = 0.05/20 = 0.0025$ for each individual test), would the result still be significant?

H.2. Describe three specific ways a researcher might p-hack — intentionally or unintentionally. For each, explain why it inflates the false positive rate.

H.3. The American Statistical Association's 2016 statement lists six principles about p-values. In your own words, explain why the ASA felt the need to make such a statement. What was happening in scientific practice that prompted it?

H.4. Debate Exercise: "Should we abandon p-values entirely?"

Prepare arguments for BOTH sides:

For abandoning p-values: - The ASA's 2016 statement on p-value misuse - The role of p-values in the replication crisis - Bayesian alternatives - Effect sizes and confidence intervals as better tools

Against abandoning p-values: - The value of a standardized decision framework - P-values have clear meaning when used correctly - The problem is misuse, not the tool itself - No replacement would be immune to misuse

After considering both sides, state your own position with justification.


Part I: Python and Computation ⭐⭐⭐

I.1. Using Python, conduct a complete hypothesis test for the following scenario:

A university claims that the average SAT score of its admitted students is at least 1200. A random sample of 45 students has a mean SAT score of 1178 with a standard deviation of 85.

(a) State hypotheses and conduct the test using scipy.stats.ttest_1samp(). (b) Find the p-value and make a decision at $\alpha = 0.05$. (c) Construct a 95% confidence interval and verify it's consistent with the hypothesis test. (d) Create a visualization showing the test statistic on the sampling distribution.

I.2. Write a Python simulation that demonstrates the effect of sample size on hypothesis testing. Specifically:

(a) Set the true population mean to $\mu = 52$ and $\sigma = 20$. (b) Test $H_0: \mu = 50$ vs. $H_a: \mu > 50$ using sample sizes of $n = 10, 30, 50, 100, 500, 1000$. (c) For each sample size, simulate 10,000 samples and compute the proportion of times $H_0$ is rejected (at $\alpha = 0.05$). This proportion is the power of the test (preview of Chapter 17). (d) Plot power vs. sample size and discuss the relationship.

I.3. Replicate the p-hacking simulation from Section 13.13, but vary the number of tests per study (1, 5, 10, 20, 50, 100). Plot the false positive rate against the number of tests. At what point does the false positive rate exceed 50%?


Part J: Synthesis and Application ⭐⭐⭐⭐

J.1. Return to the replication crisis case study from Chapter 1. Now that you understand p-values and hypothesis testing:

(a) Explain in detail how p-hacking contributed to the crisis. (b) Explain how publication bias interacts with p-values to produce a misleading scientific literature. (c) Explain why small sample sizes amplify both problems. (d) What reforms would you recommend to address these issues? (Consider: pre-registration, effect size reporting, replication requirements, Bayesian methods.)

J.2. Consider the following real-world scenario: A tech company runs 200 A/B tests per year. Using $\alpha = 0.05$ for each test:

(a) How many false positives would you expect per year, assuming none of the tested changes have a real effect? (b) If 20% of the tested changes actually do improve outcomes, and the tests have 80% power, how many of the 200 tests would you expect to correctly identify real improvements? How many would miss real improvements? (c) Of all the tests that produce $p < 0.05$, what fraction are actually false positives? (Hint: think about the total number of rejections vs. the number of true positives among them.) (d) How does this relate to the positive predictive value concept from Chapter 9?

J.3. Maya, Alex, Sam, and Professor Washington each conduct a hypothesis test this week. For each person, describe a plausible test they might run, identify the appropriate hypotheses, explain the consequences of Type I and Type II errors, and recommend an appropriate $\alpha$ level.