Case Study 2: Alex's Permutation Test — A/B Testing Without Assumptions
The Setup
Alex Rivera's life at StreamVibe revolves around A/B tests. In Chapter 16, we analyzed a large-scale test (247 vs. 253 users) where the data were right-skewed but the sample sizes were large enough for the CLT to guarantee approximate normality of the sampling distribution. The two-sample $t$-test worked beautifully.
But StreamVibe doesn't always have the luxury of large samples. The company has just rolled out a new feature — "Watch Together," a social viewing mode where friends can watch videos simultaneously. The product team wants to test whether Watch Together increases session length. But the feature was released to a limited beta group:
- Control group: 18 users with the standard experience
- Treatment group: 16 users with Watch Together enabled
The sample sizes are small. The data, as is typical for streaming session lengths, are right-skewed with a few very long sessions. Alex isn't sure the $t$-test is reliable here — the CLT needs at least $n \geq 30$ to handle moderate skewness, and the data might be more than moderately skewed.
Alex decides to run both a $t$-test and a permutation test, comparing the results.
The Data
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(2025)
# Control group: standard StreamVibe experience (18 users)
# Right-skewed: most users watch 20-40 min, some binge for 80+ min
control = np.array([
22.3, 35.1, 18.7, 42.8, 28.4, 15.2, 31.6, 55.0, 20.1,
25.9, 33.2, 19.4, 27.5, 38.3, 24.0, 88.5, 16.8, 29.7
])
# Treatment group: Watch Together enabled (16 users)
# Higher sessions overall, still right-skewed
treatment = np.array([
30.4, 45.2, 25.8, 48.1, 38.9, 22.5, 42.7, 65.3,
35.0, 32.1, 50.6, 28.3, 40.2, 110.4, 29.1, 55.8
])
print("=" * 60)
print("StreamVibe Watch Together A/B Test — Summary Statistics")
print("=" * 60)
print(f"\n{'Statistic':<20} {'Control (n=18)':<20} {'Treatment (n=16)'}")
print("-" * 60)
print(f"{'Mean':<20} {np.mean(control):<20.1f} {np.mean(treatment):.1f}")
print(f"{'Median':<20} {np.median(control):<20.1f} {np.median(treatment):.1f}")
print(f"{'SD':<20} {np.std(control, ddof=1):<20.1f} "
f"{np.std(treatment, ddof=1):.1f}")
print(f"{'Min':<20} {np.min(control):<20.1f} {np.min(treatment):.1f}")
print(f"{'Max':<20} {np.max(control):<20.1f} {np.max(treatment):.1f}")
print(f"{'Skewness':<20} {stats.skew(control):<20.2f} "
f"{stats.skew(treatment):.2f}")
observed_diff = np.mean(treatment) - np.mean(control)
print(f"\nObserved difference in means: {observed_diff:.1f} minutes")
print(f" (Treatment is {observed_diff:.1f} min longer than Control)")
Checking t-Test Conditions
Before choosing a method, let's check whether the $t$-test conditions are met:
print("=" * 60)
print("Condition Check for Two-Sample t-Test")
print("=" * 60)
# 1. Independence
print("\n1. Independence:")
print(" Users were randomly assigned to groups. ✓")
print(" Each user's session is independent of others. ✓")
# 2. Normality / sample size
print("\n2. Normality / Sample Size:")
print(f" Control: n = {len(control)}, skewness = {stats.skew(control):.2f}")
print(f" Treatment: n = {len(treatment)}, skewness = {stats.skew(treatment):.2f}")
print(" Both n < 30 — CLT may not fully compensate for skewness.")
# Shapiro-Wilk tests
_, p_control = stats.shapiro(control)
_, p_treatment = stats.shapiro(treatment)
print(f" Shapiro-Wilk p-value (Control): {p_control:.4f}")
print(f" Shapiro-Wilk p-value (Treatment): {p_treatment:.4f}")
if p_control < 0.05 or p_treatment < 0.05:
print(" ⚠ At least one group shows significant non-normality.")
else:
print(" Neither group shows significant non-normality at α=0.05,")
print(" but with small n, Shapiro-Wilk has low power.")
# Visual check
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(control, bins=8, edgecolor='black', alpha=0.7, color='lightblue')
axes[0].set_title(f'Control (n={len(control)})')
axes[0].set_xlabel('Session Length (min)')
axes[1].hist(treatment, bins=8, edgecolor='black', alpha=0.7, color='lightyellow')
axes[1].set_title(f'Treatment (n={len(treatment)})')
axes[1].set_xlabel('Session Length (min)')
plt.suptitle('Distribution of Session Lengths by Group', fontsize=13)
plt.tight_layout()
plt.show()
print("\n3. Equal variances:")
print(f" SD ratio: {np.std(treatment, ddof=1)/np.std(control, ddof=1):.2f}")
print(" Use Welch's t-test (does not assume equal variances). ✓")
print("\nVerdict: t-test is *borderline* — small n with skewed data.")
print("The permutation test provides a useful robustness check.")
The Two Approaches: Head to Head
Approach 1: Welch's t-Test
print("=" * 60)
print("APPROACH 1: Welch's t-Test")
print("=" * 60)
t_stat, t_pvalue_two = stats.ttest_ind(treatment, control,
equal_var=False,
alternative='two-sided')
_, t_pvalue_one = stats.ttest_ind(treatment, control,
equal_var=False,
alternative='greater')
# CI for the difference
from scipy.stats import t as t_dist
se_diff = np.sqrt(np.var(treatment, ddof=1)/len(treatment) +
np.var(control, ddof=1)/len(control))
# Welch-Satterthwaite degrees of freedom
num = (np.var(treatment, ddof=1)/len(treatment) +
np.var(control, ddof=1)/len(control))**2
denom = ((np.var(treatment, ddof=1)/len(treatment))**2 / (len(treatment)-1) +
(np.var(control, ddof=1)/len(control))**2 / (len(control)-1))
df_welch = num / denom
t_star = t_dist.ppf(0.975, df=df_welch)
ci_t_lower = observed_diff - t_star * se_diff
ci_t_upper = observed_diff + t_star * se_diff
print(f"\n Test statistic: t = {t_stat:.3f}")
print(f" Degrees of freedom: {df_welch:.1f}")
print(f" p-value (two-sided): {t_pvalue_two:.4f}")
print(f" p-value (one-sided): {t_pvalue_one:.4f}")
print(f" 95% CI for diff: ({ci_t_lower:.1f}, {ci_t_upper:.1f}) min")
print(f" SE of difference: {se_diff:.1f} min")
Approach 2: Permutation Test
print("\n" + "=" * 60)
print("APPROACH 2: Permutation Test")
print("=" * 60)
np.random.seed(42)
n_perm = 10000
combined = np.concatenate([control, treatment])
n_control = len(control)
perm_diffs = np.zeros(n_perm)
for i in range(n_perm):
shuffled = np.random.permutation(combined)
perm_control = shuffled[:n_control]
perm_treatment = shuffled[n_control:]
perm_diffs[i] = np.mean(perm_treatment) - np.mean(perm_control)
# P-values
p_one_sided = np.mean(perm_diffs >= observed_diff)
p_two_sided = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
print(f"\n Observed difference: {observed_diff:.1f} min")
print(f" p-value (one-sided): {p_one_sided:.4f}")
print(f" p-value (two-sided): {p_two_sided:.4f}")
Approach 3: Bootstrap CI for the Difference
print("\n" + "=" * 60)
print("APPROACH 3: Bootstrap CI for the Difference")
print("=" * 60)
np.random.seed(42)
n_boot = 10000
boot_diffs = np.zeros(n_boot)
for i in range(n_boot):
boot_control = np.random.choice(control, size=len(control), replace=True)
boot_treatment = np.random.choice(treatment, size=len(treatment),
replace=True)
boot_diffs[i] = np.mean(boot_treatment) - np.mean(boot_control)
ci_boot_lower = np.percentile(boot_diffs, 2.5)
ci_boot_upper = np.percentile(boot_diffs, 97.5)
boot_se = np.std(boot_diffs)
print(f"\n Bootstrap SE: {boot_se:.1f} min")
print(f" 95% Bootstrap CI: ({ci_boot_lower:.1f}, {ci_boot_upper:.1f}) min")
Comparison of Results
print("\n" + "=" * 60)
print("COMPARISON OF ALL THREE APPROACHES")
print("=" * 60)
print(f"\n{'Method':<30} {'p-value (1-sided)':<20} {'95% CI'}")
print("-" * 70)
print(f"{'Welch t-test':<30} {t_pvalue_one:<20.4f} "
f"({ci_t_lower:.1f}, {ci_t_upper:.1f})")
print(f"{'Permutation test':<30} {p_one_sided:<20.4f} {'N/A (use bootstrap)'}")
print(f"{'Bootstrap CI':<30} {'N/A (use perm test)':<20} "
f"({ci_boot_lower:.1f}, {ci_boot_upper:.1f})")
Visualization: The Null Distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: Permutation null distribution
axes[0].hist(perm_diffs, bins=50, edgecolor='black', alpha=0.7,
color='lightgreen')
axes[0].axvline(observed_diff, color='red', linewidth=2.5, linestyle='--',
label=f'Observed = {observed_diff:.1f} min')
axes[0].axvline(-observed_diff, color='red', linewidth=2.5, linestyle=':',
alpha=0.5, label=f'-Observed = {-observed_diff:.1f} min')
count_extreme = np.sum(np.abs(perm_diffs) >= np.abs(observed_diff))
axes[0].set_xlabel('Difference in Means (Treatment - Control)')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Permutation Null Distribution\n'
f'(two-sided p = {p_two_sided:.4f})')
axes[0].legend()
# Right: Bootstrap distribution of the difference
axes[1].hist(boot_diffs, bins=50, edgecolor='black', alpha=0.7,
color='steelblue')
axes[1].axvline(ci_boot_lower, color='red', linewidth=2, linestyle='--',
label=f'CI lower = {ci_boot_lower:.1f}')
axes[1].axvline(ci_boot_upper, color='red', linewidth=2, linestyle='--',
label=f'CI upper = {ci_boot_upper:.1f}')
axes[1].axvline(0, color='gray', linewidth=1.5, linestyle=':',
label='No difference')
axes[1].set_xlabel('Difference in Means (Treatment - Control)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bootstrap Distribution of the Difference')
axes[1].legend()
plt.tight_layout()
plt.show()
Interpretation
The three methods tell a consistent story:
-
Welch's $t$-test gives a p-value and CI based on the $t$-distribution, which assumes approximate normality of the sampling distribution.
-
The permutation test gives a p-value based on direct simulation of the null hypothesis — no distributional assumptions needed. It answers: "If Watch Together had no effect, how often would we see a difference this large by chance?"
-
The bootstrap CI shows the range of plausible values for the true difference — again, without distributional assumptions.
The close agreement between methods is itself informative. It tells Alex that the $t$-test result isn't being distorted by the skewness and small sample sizes. If the methods had disagreed substantially, that would be a warning sign that the $t$-test assumptions were being violated.
The Permutation Test on Medians
Alex also runs a permutation test comparing medians rather than means — since the data are skewed, the median difference might be more representative:
# Permutation test on difference in medians
observed_median_diff = np.median(treatment) - np.median(control)
print(f"Observed difference in medians: {observed_median_diff:.1f} minutes")
perm_median_diffs = np.zeros(n_perm)
for i in range(n_perm):
shuffled = np.random.permutation(combined)
perm_median_diffs[i] = (np.median(shuffled[n_control:]) -
np.median(shuffled[:n_control]))
p_median_two = np.mean(np.abs(perm_median_diffs) >=
np.abs(observed_median_diff))
print(f"Permutation p-value (medians, two-sided): {p_median_two:.4f}")
This is something the standard $t$-test simply cannot do. There's no formula-based test for the difference in medians. The permutation test handles it effortlessly — just swap np.mean for np.median in the shuffling loop.
Alex's Report to the Product Team
Watch Together A/B Test — Statistical Analysis
We tested whether the Watch Together feature increases average session length, using 18 control users and 16 treatment users over a 14-day period.
Results: Treatment users watched an average of approximately 12 minutes longer per session than control users. Both the Welch's $t$-test and a permutation test (10,000 shuffles) produced p-values consistent with each other, indicating that this difference is unlikely to be due to chance alone.
The 95% bootstrap confidence interval for the difference in mean session length was approximately (1, 23) minutes, suggesting the true effect could be as small as 1 minute or as large as 23 minutes.
Robustness check: Because the sample sizes are small and the data are right-skewed, we supplemented the standard $t$-test with simulation-based methods (permutation test and bootstrap). All three approaches yielded consistent conclusions, increasing our confidence in the result.
Recommendation: The Watch Together feature appears to increase session length. However, the wide confidence interval suggests substantial uncertainty about the magnitude of the effect. We recommend expanding the test to 200+ users per group before a full rollout decision.
Connection to Chapter 16
In Chapter 16, Alex's A/B test had 247 + 253 = 500 users. The CLT was clearly in play, and the $t$-test was entirely appropriate. The permutation test and bootstrap aren't improvements in that scenario — they're alternatives that give the same answer.
In this case study, with only 34 total users, the simulation-based methods provide a valuable safety net. The fact that all methods agree is reassuring. If they had disagreed, Alex would have been wise to trust the permutation test and bootstrap over the $t$-test, because they make fewer assumptions.
Key Takeaway: The permutation test and bootstrap don't replace the $t$-test — they complement it. Use them to verify formula-based results when conditions are questionable, and use them instead when formula-based methods aren't available (like testing differences in medians).
Questions for Discussion
-
If Alex's A/B test had shown very different p-values from the $t$-test versus the permutation test, which would you trust more? Why?
-
The product team asks: "Can we use the permutation test for all our A/B tests from now on, even the big ones?" What would you advise?
-
The bootstrap CI for the difference is (1, 23) minutes — quite wide. What drives this width? How could Alex narrow it?
-
Alex tested both the difference in means and the difference in medians. When would the median-based test be more appropriate? When would the mean-based test be better?
-
StreamVibe runs hundreds of A/B tests per year. If they use a permutation test for each at $\alpha = 0.05$, how does the multiple comparisons problem (previewed in Chapter 13) apply? Does the permutation test offer any protection against this?