Exercises: Inference for Proportions

These exercises progress from conceptual understanding through applied proportion inference, polling interpretation, and condition checking. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)


Part A: Conceptual Understanding ⭐

A.1. In the z-test for a proportion, the standard error formula uses $p_0$ (the hypothesized value) rather than $\hat{p}$ (the sample proportion). In a confidence interval, the standard error uses $\hat{p}$. In your own words, explain why the formulas differ.

A.2. A student checks the success-failure condition for a hypothesis test and writes: "$n\hat{p} = 50 \times 0.42 = 21 \geq 10$ ✓." What mistake did the student make? What should they have used instead of $\hat{p}$?

A.3. True or false (explain each):

(a) If the 95% confidence interval for a proportion is (0.45, 0.55), then a two-sided hypothesis test of $H_0: p = 0.50$ at $\alpha = 0.05$ would fail to reject $H_0$.

(b) The margin of error in a poll measures only random sampling error, not bias.

(c) A sample of 4,000 people always gives a more accurate poll result than a sample of 1,000.

(d) If the success-failure condition fails, no inference about proportions is possible.

(e) A p-value of 0.001 in a proportion test means the sample proportion is very far from $p_0$.

A.4. A news report says: "A new poll shows 48% of voters support the candidate, with a margin of error of ±4 points." A commentator says, "So the candidate is losing — they're below 50%." Evaluate this claim using what you've learned about confidence intervals and margins of error.

A.5. Explain the difference between the Wald interval and the Wilson interval. When does the Wald interval perform poorly, and why does the Wilson interval fix these problems?

A.6. The plus-four method adds 2 "successes" and 2 "failures" to the data before computing the confidence interval. Why does adding fake observations improve the interval? Under what circumstances is this improvement most noticeable?


Part B: Setting Up Tests ⭐

B.1. For each scenario, state the null and alternative hypotheses in symbols, identify whether the test is one-tailed or two-tailed, and identify what constitutes a "success."

(a) A school district claims that 85% of its students graduate on time. A concerned parent suspects the rate is lower.

(b) A tech company wants to test whether more than 10% of app users use the premium features.

(c) A factory's quality control standard requires that fewer than 2% of products are defective. An inspector wants to check whether the defect rate has increased.

(d) Sam wants to test whether Daria's free throw percentage has changed (in either direction) from her career rate of 78%.

(e) A vaccine manufacturer claims at least 95% effectiveness. A regulatory agency wants to verify this claim.

(f) Maya wants to test whether the proportion of county residents who smoke differs from the national rate of 12.5%.

B.2. For each scenario in B.1, identify the population, the parameter of interest, and what a Type I error and Type II error would mean in practical terms. Which error type seems more serious in each case?


Part C: Checking Conditions ⭐⭐

C.1. For each scenario, determine whether the conditions for the one-sample z-test for proportions are met. If a condition fails, state which one and suggest an alternative approach.

(a) A random sample of 500 voters in a city of 200,000 finds that 265 support a ballot measure. Test whether support exceeds 50%.

(b) A convenience sample of 1,200 Instagram users finds that 960 post at least once per week. Test whether the proportion exceeds 75%.

(c) A random sample of 40 microchips from a production run of 300 finds 1 defective chip. Test whether the defect rate exceeds 1%.

(d) A random sample of 80 patients in a clinical trial finds that 72 experience symptom relief. Test whether the relief rate exceeds 85%.

(e) A survey of 15 students in a statistics class finds that 12 prefer morning exams. Test whether a majority prefers morning exams.

C.2. The success-failure condition requires $np_0 \geq 10$ and $n(1-p_0) \geq 10$. For each of the following null hypothesis values, determine the minimum sample size needed to meet this condition.

(a) $p_0 = 0.50$ (b) $p_0 = 0.10$ (c) $p_0 = 0.01$ (d) $p_0 = 0.95$

What pattern do you notice? What does this tell you about testing hypotheses involving very small or very large proportions?


Part D: Computing and Interpreting ⭐⭐

D.1. A university claims that 70% of its graduates find employment within six months. A student newspaper surveys a random sample of 200 recent graduates and finds that 126 have found employment.

(a) State the hypotheses for testing whether the employment rate is lower than claimed. (b) Check all three conditions. (c) Calculate the test statistic. (d) Find the p-value. (e) State your conclusion at $\alpha = 0.05$ in context. (f) Construct a 95% confidence interval for the true employment rate. Does it agree with your test result?

D.2. A pharmaceutical company tests a new allergy medication. In a random sample of 350 allergy sufferers, 245 report significant symptom relief. The company wants to claim that more than 65% of patients will experience relief.

(a) Conduct a full hypothesis test at $\alpha = 0.05$. (b) Construct a 95% CI. (c) The company's marketing team wants to advertise "7 out of 10 patients experience relief." Based on your analysis, is this claim supported by the data?

D.3. Sam's colleague, Tanya, tracks a different Riverside Raptors player — point guard Marcus Johnson. Marcus has made 42 out of 120 three-point attempts this season (35.0%). His career three-point percentage is 33%.

(a) Test whether Marcus's shooting has improved at $\alpha = 0.05$. (b) Compare this to Sam's analysis of Daria. Marcus has nearly twice as many attempts. How does the larger sample size affect the test? (c) Construct 95% CIs for both Daria and Marcus. Which interval is narrower? Why?

D.4. In a random sample of 1,200 U.S. adults, 684 say they support expanding renewable energy subsidies.

(a) Test whether a majority (more than 50%) supports the subsidies. Use $\alpha = 0.01$. (b) Construct a 99% confidence interval. (c) A politician says, "57% of Americans support this policy — that's a clear mandate." Evaluate this claim statistically.


Part E: Polling and Elections ⭐⭐

E.1. A poll of 1,500 likely voters shows Candidate A at 48% and Candidate B at 46%, with 6% undecided. The margin of error is ±2.5 percentage points.

(a) Construct 95% CIs for each candidate's true support level (ignoring undecided voters for now). (b) Do the confidence intervals overlap? What does this tell you about the race? (c) Can you conclude from this poll that Candidate A is ahead? Why or why not? (d) A news headline says "Candidate A Leads by 2 Points." Write a more statistically accurate headline.

E.2. Two polls are conducted one week apart:

Poll Sample Size Candidate A Support Margin of Error
Poll 1 1,000 47% ±3.1%
Poll 2 1,000 51% ±3.1%

A TV commentator says: "Candidate A has surged 4 points in a single week! The momentum has clearly shifted!"

(a) Is a 4-point change within the combined margin of error for two polls? (Hint: think about the uncertainty in both polls.) (b) What would you need to see to be confident that a real shift occurred? (c) This is a common issue in election coverage. Why do news organizations tend to over-interpret poll movements?

E.3. A polling organization wants to estimate the proportion of voters who support a ballot initiative with a margin of error of no more than ±2 percentage points at 95% confidence.

(a) Using the conservative estimate $\hat{p} = 0.5$, what sample size is needed? (b) If a preliminary poll suggests support is around 30%, what sample size is needed? Why is it smaller? (c) The organization has a budget for surveying 1,000 people. What margin of error can they achieve?


Part F: Python Practice ⭐⭐

F.1. Write Python code to conduct a two-tailed z-test for the following scenario: A factory produces bolts, and the acceptable defect rate is 3%. In a random sample of 500 bolts, 22 are defective. Test whether the defect rate differs from 3% at $\alpha = 0.05$. Include both the manual calculation and statsmodels approach.

F.2. Using statsmodels, compute Wald, Wilson, and exact (Clopper-Pearson) confidence intervals for the following data. Compare the results and explain any differences.

(a) $X = 8$, $n = 10$ (b) $X = 80$, $n = 100$ (c) $X = 800$, $n = 1000$

What happens to the difference between the methods as $n$ increases?

F.3. Write a Python simulation that demonstrates the coverage probability of the Wald interval versus the Wilson interval. Specifically:

  • Set the true proportion to $p = 0.05$ (a scenario where the Wald interval struggles)
  • Generate 10,000 random samples of size $n = 50$ from $\text{Binomial}(50, 0.05)$
  • For each sample, compute both the 95% Wald CI and the 95% Wilson CI
  • Count how many of each type of interval contain the true $p = 0.05$
  • Report the actual coverage for each method

The Wald interval should have coverage notably below 95%. The Wilson interval should be much closer.


Part G: Real-World Applications ⭐⭐⭐

G.1. Public Health (Maya's Domain): The CDC reports that 19.0% of U.S. adults are current cigarette smokers. Maya surveys a random sample of 600 adults in her county and finds that 132 are current smokers.

(a) Test whether her county's smoking rate differs from the national rate. Use the full five-step procedure. (b) Construct a 95% Wilson confidence interval for the county's smoking rate. (c) The county health board asks Maya: "Should we allocate extra resources for smoking cessation programs?" Based on your analysis, what should she recommend?

G.2. Criminal Justice (Washington's Domain): Professor Washington is examining a predictive policing algorithm. The algorithm classified 1,800 individuals as "high risk." Of these, 540 actually committed a new offense within two years (a recidivism rate of 30%).

(a) The algorithm was designed with an expected recidivism rate of 35% among those flagged as "high risk." Test whether the actual rate is lower than expected. (b) If the algorithm's recidivism rate is lower than expected, what might this mean? (Hint: think about false positives — the algorithm may be flagging too many people.) (c) Connect this to the positive predictive value concept from Chapter 9. If the algorithm flags people as "high risk" but only 30% actually reoffend, what does that mean for the 70% who were flagged but didn't reoffend?

G.3. A/B Testing (Alex's Domain): Alex runs an A/B test on StreamVibe's homepage. She randomly assigns 5,000 users to the new design and 5,000 to the old design. With the new design, 450 users click the "Sign Up" button (9.0%). With the old design, 375 users click (7.5%).

For now, focus on the new design only (the full two-sample comparison comes in Chapter 16):

(a) The old design's conversion rate of 7.5% is well-established from years of data. Test whether the new design's rate is higher than 7.5%. (b) Construct a 95% CI for the new design's true conversion rate. (c) Alex's boss asks: "Is the new design better?" What should Alex say?


Part H: Critical Thinking and Ethics ⭐⭐⭐

H.1. A medical researcher tests a new drug and reports: "The proportion of patients who recovered was statistically significantly higher than the placebo rate ($p = 0.04$)." However, the study used a convenience sample of patients who volunteered for the trial. Discuss at least two concerns with this conclusion.

H.2. A social media company surveys its users and finds that 92% are "satisfied" with the platform. The company reports this with a margin of error of ±1.2%.

(a) What types of bias might affect this result? (Think about who responds to surveys from a company they use.) (b) Even if the margin of error is technically correct, why might the true satisfaction rate be different? (c) How would you design a better study to measure true user satisfaction?

H.3. Consider two studies about vaccine hesitancy:

  • Study A: Survey of 200 randomly selected adults in a rural county. Finds 38% are vaccine-hesitant.
  • Study B: Online poll of 10,000 Twitter users who clicked on a vaccine-related article. Finds 55% are vaccine-hesitant.

(a) Compute the margin of error for each study. (b) Which study gives a more reliable estimate of vaccine hesitancy in the general population? Why? (c) A news outlet reports: "55% of Americans are vaccine-hesitant, according to a new poll with a margin of error of just ±1%." What's wrong with this reporting?

H.4. The Reproducibility Question. A researcher finds that 24 out of 100 participants in a psychology study exhibit a particular behavior ($\hat{p} = 0.24$). She tests $H_0: p = 0.20$ vs. $H_a: p > 0.20$ and gets $p = 0.17$. She writes in her paper: "The proportion was directionally consistent with our hypothesis but did not reach significance ($p = 0.17$)." She then tests several subgroups:

  • Women only: 14 out of 45 ($\hat{p} = 0.311$, $p = 0.04$)
  • Men only: 10 out of 55 ($\hat{p} = 0.182$, $p = 0.61$)

She reports: "The effect was significant among women ($p = 0.04$)."

(a) What's wrong with this approach? Connect to p-hacking from Chapter 13. (b) If the original test was not significant, why might a subgroup test be? Does this make the finding more or less reliable? (c) What should the researcher have done differently?


Part I: Synthesis and Extension ⭐⭐⭐⭐

I.1. Sample Size Planning. Sam's boss wants to know definitively whether Daria's shooting has improved. Sam determines that Daria needs to take enough additional shots so that if her true shooting percentage is 38% (her current sample proportion), he would reject $H_0: p = 0.31$ at $\alpha = 0.05$.

(a) What sample size would give a margin of error of ±5 percentage points at 95% confidence? (Use $\hat{p} = 0.38$.) (b) At this sample size, compute the z-statistic that would result from $\hat{p} = 0.38$. Would you reject $H_0$? (c) This is a preview of power analysis (Chapter 17). In your own words, explain the relationship between sample size, margin of error, and the ability to detect a real effect.

I.2. Comparing CI Methods at Scale. Write a Python simulation that:

(a) For each of several true proportions ($p = 0.01, 0.05, 0.10, 0.30, 0.50$), generates 10,000 samples of size $n = 40$. (b) For each sample, computes Wald, Wilson, and plus-four 95% CIs. (c) Records whether each CI contains the true $p$. (d) Plots the actual coverage probability vs. the nominal 95% for each method and each $p$ value. (e) Summarizes: for which values of $p$ does the Wald interval perform worst? Does the Wilson or plus-four method fix the problem?

I.3. The Election Night Challenge. On election night, the following results come in for a governor's race with two candidates:

Time Votes Counted Candidate A Candidate B
8:00 PM 50,000 52.1% 47.9%
9:00 PM 200,000 50.8% 49.2%
10:00 PM 500,000 49.5% 50.5%
11:00 PM 800,000 49.8% 50.2%

(a) At each time point, can you conclude that either candidate is winning? (Treat the counted votes as a random sample from the total electorate of 2 million.) (b) Notice that the race tightened over time. What might cause this pattern? (Think about which precincts report first.) (c) At 11:00 PM, a commentator says: "With 800,000 votes counted, it's clear this is going to be a photo finish." Is "photo finish" supported by the margin of error? (d) This scenario illustrates why statisticians warn against treating partial vote counts as random samples. Why aren't they truly random?