Exercises: The Bootstrap and Simulation-Based Inference

Contributors

Exercises: The Bootstrap and Simulation-Based Inference

These exercises progress from conceptual understanding through bootstrap confidence intervals, permutation tests, comparison of methods, and Python implementation. Estimated completion time: 3.5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. Explain in your own words why the bootstrap samples from the data with replacement rather than without replacement. What would happen if you sampled without replacement?

A.2. A friend says: "The bootstrap creates fake data to make up for not having enough real data." Explain why this statement is misleading. What does the bootstrap actually do?

A.3. True or false (explain each):

(a) The bootstrap distribution is centered at the population parameter.

(b) Increasing the number of bootstrap samples ($B$) from 1,000 to 100,000 will make the confidence interval narrower.

(c) A permutation test can be used to construct a confidence interval.

(d) The bootstrap requires that the original data come from a normal distribution.

(e) In a permutation test, the group labels are shuffled with replacement.

(f) If the bootstrap and formula-based CIs for a mean agree closely, this confirms that both methods are valid.

A.4. In one sentence each, state the key difference between: (a) A bootstrap sample and the original sample (b) The bootstrap distribution and the sampling distribution (c) The bootstrap and the permutation test (d) The percentile method and the basic (reverse percentile) method

A.5. Why is it important that a bootstrap sample is the same size ($n$) as the original sample, rather than larger or smaller? What would go wrong if you drew bootstrap samples of size $2n$?

A.6. Explain why the bootstrap is sometimes called a "plug-in" approach. What is being "plugged in" for what?

Part B: Bootstrap Mechanics ⭐

B.1. Consider the sample: {3, 5, 7, 9, 11}.

(a) List all possible bootstrap samples of size 5 in which the value 7 appears exactly three times. (Don't list all possibilities — describe how many there are and give two examples.)

(b) What is the probability that a single bootstrap sample contains the value 3 at least once?

(c) What is the probability that a single bootstrap sample is identical to the original sample {3, 5, 7, 9, 11}?

B.2. A researcher has a sample of $n = 20$ observations and generates 10,000 bootstrap samples.

(a) How many bootstrap statistics does the researcher have?

(b) What is the approximate probability that any single bootstrap sample contains the first observation at least once?

(c) On average, how many of the 20 original observations appear in a single bootstrap sample?

B.3. Sketch (by hand or describe) what the bootstrap distribution would look like for the median of the following samples:

(a) {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} — a uniform sample

(b) {1, 1, 1, 1, 1, 1, 1, 1, 1, 100} — a sample with one extreme outlier

(c) {2, 2, 2, 2, 2} — a sample with no variability

Part C: Bootstrap Confidence Intervals ⭐⭐

C.1. A marine biologist measures the lengths (in cm) of 15 fish caught in a lake:

18.2, 22.5, 19.8, 25.1, 20.3, 31.7, 17.6, 23.4, 21.0, 24.8, 19.5, 26.3, 28.9, 20.1, 22.7

(a) Compute the sample median.

(b) Explain why a $t$-interval would not be appropriate for the median.

(c) Using the bootstrap_ci function from this chapter (or writing your own), compute a 95% bootstrap CI for the median fish length.

(d) Interpret the CI in context.

C.2. A consumer researcher collects ratings (on a 1-10 scale) from 25 customers about a new product:

8, 7, 9, 6, 8, 10, 7, 5, 8, 9, 7, 8, 6, 9, 8, 7, 10, 8, 9, 7, 6, 8, 9, 7, 8

(a) Compute the sample mean and the sample standard deviation.

(b) Compute a 95% $t$-interval for the population mean rating.

(c) Compute a 95% bootstrap CI for the population mean rating.

(d) Compare the two CIs. Are they similar? Would you expect them to be? Explain.

(e) Now compute a 95% bootstrap CI for the population standard deviation. Could you have done this with a formula-based approach?

C.3. A sociologist is studying income inequality and collects annual incomes (in thousands of dollars) from 30 randomly selected households:

32, 45, 28, 55, 38, 42, 150, 36, 48, 29, 41, 52, 35, 44, 39, 62, 33, 46, 37, 51, 340, 40, 47, 34, 43, 50, 31, 56, 38, 49

(a) Compute the mean and median. Why are they so different?

(b) The sociologist wants a CI for the median income. Explain why the bootstrap is the right tool here.

(c) Compute the 95% bootstrap CI for the median.

(d) Compute the 95% bootstrap CI for the IQR (interquartile range). Interpret it.

(e) Would you trust a $t$-interval for the mean with this data? Why or why not?

C.4. A data scientist wants a 90% bootstrap CI instead of a 95% CI. What percentiles of the bootstrap distribution should they use? How would this affect the width of the interval?

Part D: Permutation Tests ⭐⭐

D.1. A teacher wants to test whether a new teaching method improves test scores. She randomly assigns 10 students to the new method and 10 to the traditional method:

New method: 85, 78, 92, 88, 76, 95, 83, 90, 87, 81
Traditional: 72, 80, 68, 75, 82, 70, 77, 74, 79, 71

(a) State the null and alternative hypotheses.

(b) Compute the observed difference in means.

(c) Describe in words how you would conduct a permutation test.

(d) Conduct the permutation test with 10,000 shuffles. Report the p-value.

(e) Compare to the Welch's $t$-test p-value. Do the conclusions agree?

D.2. An ecologist measures the growth (in cm) of plants under two conditions — sunlight and shade:

Sunlight: 12.1, 15.3, 11.8, 14.2, 13.5, 16.0, 12.9
Shade: 8.4, 10.2, 9.1, 7.8, 11.0, 9.5, 8.9

(a) These are small samples ($n_1 = n_2 = 7$). Why might a permutation test be preferable to a $t$-test here?

(b) Conduct a permutation test. Report the two-sided p-value.

(c) What does the permutation null distribution look like? Is it approximately normal?

(d) Given the small sample sizes, how confident are you in the result?

D.3. A social scientist wants to test whether median household income differs between two neighborhoods. Why might a permutation test on the difference in medians (rather than the difference in means) be more appropriate? Conduct the test using the following data:

Neighborhood A (n=12): 42, 55, 38, 65, 48, 52, 45, 210, 50, 43, 57, 47
Neighborhood B (n=10): 35, 42, 38, 40, 105, 37, 44, 39, 41, 36

D.4. In a permutation test with $n_1 = 8$ and $n_2 = 8$, how many distinct permutations are possible? (Hint: think about $\binom{16}{8}$.) Why do we typically use 10,000 random shuffles rather than all possible permutations?

Part E: Comparing Methods ⭐⭐

E.1. For each scenario, indicate whether you would use: (i) a formula-based method, (ii) the bootstrap, (iii) a permutation test, or (iv) any of the above. Justify your choice.

(a) A 95% CI for the mean exam score with $n = 200$ and approximately normal data.

(b) A 95% CI for the median salary with $n = 50$ and right-skewed data.

(c) Testing whether two groups differ in mean response time with $n_1 = 12$, $n_2 = 14$, and non-normal data.

(d) A 95% CI for the correlation between height and weight with $n = 30$.

(e) Testing whether the proportion of defective items differs between two factories with $n_1 = 500$ and $n_2 = 500$.

(f) A 95% CI for the ratio of two group means with $n_1 = 25$ and $n_2 = 30$.

E.2. A researcher computes both a $t$-interval and a bootstrap CI for the mean of a large ($n = 500$), approximately normal dataset. The $t$-interval is (45.2, 48.8) and the bootstrap CI is (45.1, 48.9).

(a) Should the researcher be concerned that the intervals aren't identical?

(b) What does the close agreement tell you?

(c) If the intervals were very different, what might that indicate?

E.3. Consider the following dataset of 8 observations: {2, 3, 5, 8, 12, 15, 45, 120}.

(a) Compute the $t$-interval for the mean.

(b) Compute the bootstrap CI for the mean.

(c) Compute the bootstrap CI for the median.

(d) Which of these three intervals best describes the "center" of this data? Why?

(e) What does this example illustrate about the limitations of formula-based methods?

Part F: Python Implementation ⭐⭐⭐

F.1. Write a Python function permutation_test(group1, group2, n_perm=10000, alternative='two-sided') that: - Computes the observed difference in means - Generates the null distribution by shuffling - Returns a dictionary with 'observed_diff', 'p_value', and 'null_distribution' - Supports three alternatives: 'two-sided', 'greater', 'less'

Test your function on the teaching method data from D.1.

F.2. Write a Python function bootstrap_ci_two_groups(group1, group2, stat_func, n_bootstrap=10000, ci_level=0.95) that computes a bootstrap CI for the difference in a statistic between two groups. The stat_func should accept an array and return a number (e.g., np.mean, np.median).

Use your function to compute: (a) A 95% CI for the difference in means between the plant groups in D.2.

(b) A 95% CI for the difference in medians between the neighborhoods in D.3.

F.3. Simulation study: Compare the coverage of the $t$-interval and the bootstrap CI.

(a) Generate 1,000 samples of size $n = 15$ from an exponential distribution with $\lambda = 1$ (i.e., np.random.exponential(1, 15)).

(b) For each sample, compute a 95% $t$-interval for the mean and a 95% bootstrap CI for the mean (use $B = 2{,}000$ for speed).

(c) Check how many of the 1,000 $t$-intervals contain the true mean ($\mu = 1$). Check how many bootstrap CIs contain the true mean.

(d) Which method achieves closer to 95% coverage? Why?

Part G: Conceptual Challenges ⭐⭐⭐

G.1. A dataset has $n = 100$ observations. You generate 10,000 bootstrap samples. Your bootstrap CI for the mean is (24.3, 28.7). Your colleague tells you: "10,000 isn't enough. If you used 100,000 bootstrap samples, the CI would be narrower." Explain why your colleague is wrong. What would make the CI narrower?

G.2. A medical researcher has a sample of 8 patients. She computes a bootstrap CI for the median survival time and gets (3.2, 18.7) months — an extremely wide interval.

(a) Is this a failure of the bootstrap method, or is the bootstrap telling the researcher something important?

(b) What would you recommend the researcher do?

(c) If the researcher increased $B$ from 10,000 to 1,000,000, would the CI get narrower?

G.3. Consider a permutation test where the observed difference in means is 5.2, and 3 out of 10,000 permuted differences exceed 5.2. A student reports the p-value as $3/10{,}000 = 0.0003$.

(a) Is this p-value exact or approximate?

(b) How would the p-value change if we used 100,000 permutations instead? Would it be meaningfully different?

(c) Some statisticians recommend reporting the p-value as $(k + 1)/(B + 1)$ rather than $k/B$. Why? (Hint: think about what happens when $k = 0$.)

G.4. Explain the connection between the permutation test and Fisher's exact test (which you may or may not have encountered). Both involve considering all possible ways data could be arranged under $H_0$. How are they related?

Part H: Applications and Interpretation ⭐⭐⭐

H.1. (Maya's follow-up) Maya wants to compare median wait times between two shifts — the morning shift and the evening shift. She has the following data:

Morning (n=20): 12, 18, 15, 45, 22, 8, 35, 14, 25, 11, 19, 28, 16, 52, 20, 13, 31, 17, 24, 10
Evening (n=22): 25, 38, 42, 18, 55, 30, 22, 48, 35, 28, 62, 20, 45, 33, 27, 50, 15, 40, 32, 58, 26, 44

(a) Compute the observed difference in medians (evening - morning).

(b) Conduct a permutation test on the difference in medians (not means). Report the p-value.

(c) Compute a bootstrap CI for the difference in medians.

(d) Write a paragraph for Maya's hospital report interpreting the results. Address whether median wait times differ significantly between shifts.

H.2. (Alex's follow-up) Alex's StreamVibe team wants to estimate the ratio of session lengths between premium and free users.

Premium (n=25): mean = 52.3 min, data available
Free (n=30): mean = 34.8 min, data available

(a) Why would a CI for the ratio (premium/free) be more useful than a CI for the difference in this context?

(b) Compute a bootstrap CI for the ratio of mean session lengths.

(c) If the CI is (1.2, 1.8), what does this tell Alex in business terms?

H.3. (Sam's follow-up) Sam wants to compare Daria's shooting consistency across seasons, not just her average. He defines consistency as the standard deviation of game-by-game shooting percentages.

Last season SD: 12.3 percentage points (across 25 games)
This season SD: 9.8 percentage points (across 18 games)

(a) Explain why there's no simple formula-based CI for the ratio of two standard deviations (without assuming normality).

(b) How would you use the bootstrap to address Sam's question?

(c) What would a bootstrap CI of (0.55, 1.05) for the ratio $SD_{this}/SD_{last}$ tell Sam about Daria's consistency?

Part I: Synthesis ⭐⭐⭐⭐

I.1. (Research design) Design a study where the bootstrap would be clearly superior to formula-based methods. Specify: - The research question - The population and sampling method - The statistic of interest - Why formula-based methods would fail or be unreliable - How you would implement the bootstrap analysis

I.2. (Historical context) The bootstrap was introduced in 1979 but didn't become widely used until the 1990s. Why? What technological development was necessary for the bootstrap to become practical? How has the increasing availability of computing power changed the way statisticians think about inference?

I.3. (Connections) The bootstrap, the CLT (Chapter 11), and the law of large numbers (Chapter 8) all involve repeated sampling. Write a one-paragraph comparison explaining: - What each method "repeats" and why - What convergence result each method relies on - How each contributes to the framework of statistical inference

I.4. (Critical thinking) A data scientist claims: "Now that we have the bootstrap, there's no reason to teach formula-based inference. It's a relic of the pre-computer era." Write a two-paragraph response either agreeing or disagreeing with this claim, drawing on specific examples from this chapter and previous chapters.