Case Study 2: Alex's Permutation Test — A/B Testing Without Assumptions

Contributors

Case Study 2: Alex's Permutation Test — A/B Testing Without Assumptions

The Setup

Alex Rivera's life at StreamVibe revolves around A/B tests. In Chapter 16, we analyzed a large-scale test (247 vs. 253 users) where the data were right-skewed but the sample sizes were large enough for the CLT to guarantee approximate normality of the sampling distribution. The two-sample $t$-test worked beautifully.

But StreamVibe doesn't always have the luxury of large samples. The company has just rolled out a new feature — "Watch Together," a social viewing mode where friends can watch videos simultaneously. The product team wants to test whether Watch Together increases session length. But the feature was released to a limited beta group:

Control group: 18 users with the standard experience
Treatment group: 16 users with Watch Together enabled

The sample sizes are small. The data, as is typical for streaming session lengths, are right-skewed with a few very long sessions. Alex isn't sure the $t$-test is reliable here — the CLT needs at least $n \geq 30$ to handle moderate skewness, and the data might be more than moderately skewed.

Alex decides to run both a $t$-test and a permutation test, comparing the results.

The Data

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(2025)

# Control group: standard StreamVibe experience (18 users)
# Right-skewed: most users watch 20-40 min, some binge for 80+ min
control = np.array([
    22.3, 35.1, 18.7, 42.8, 28.4, 15.2, 31.6, 55.0, 20.1,
    25.9, 33.2, 19.4, 27.5, 38.3, 24.0, 88.5, 16.8, 29.7
])

# Treatment group: Watch Together enabled (16 users)
# Higher sessions overall, still right-skewed
treatment = np.array([
    30.4, 45.2, 25.8, 48.1, 38.9, 22.5, 42.7, 65.3,
    35.0, 32.1, 50.6, 28.3, 40.2, 110.4, 29.1, 55.8
])

print("=" * 60)
print("StreamVibe Watch Together A/B Test — Summary Statistics")
print("=" * 60)
print(f"\n{'Statistic':<20} {'Control (n=18)':<20} {'Treatment (n=16)'}")
print("-" * 60)
print(f"{'Mean':<20} {np.mean(control):<20.1f} {np.mean(treatment):.1f}")
print(f"{'Median':<20} {np.median(control):<20.1f} {np.median(treatment):.1f}")
print(f"{'SD':<20} {np.std(control, ddof=1):<20.1f} "
      f"{np.std(treatment, ddof=1):.1f}")
print(f"{'Min':<20} {np.min(control):<20.1f} {np.min(treatment):.1f}")
print(f"{'Max':<20} {np.max(control):<20.1f} {np.max(treatment):.1f}")
print(f"{'Skewness':<20} {stats.skew(control):<20.2f} "
      f"{stats.skew(treatment):.2f}")

observed_diff = np.mean(treatment) - np.mean(control)
print(f"\nObserved difference in means: {observed_diff:.1f} minutes")
print(f"  (Treatment is {observed_diff:.1f} min longer than Control)")

Checking t-Test Conditions

Before choosing a method, let's check whether the $t$-test conditions are met:

print("=" * 60)
print("Condition Check for Two-Sample t-Test")
print("=" * 60)

# 1. Independence
print("\n1. Independence:")
print("   Users were randomly assigned to groups. ✓")
print("   Each user's session is independent of others. ✓")

# 2. Normality / sample size
print("\n2. Normality / Sample Size:")
print(f"   Control:   n = {len(control)}, skewness = {stats.skew(control):.2f}")
print(f"   Treatment: n = {len(treatment)}, skewness = {stats.skew(treatment):.2f}")
print("   Both n < 30 — CLT may not fully compensate for skewness.")

# Shapiro-Wilk tests
_, p_control = stats.shapiro(control)
_, p_treatment = stats.shapiro(treatment)
print(f"   Shapiro-Wilk p-value (Control):   {p_control:.4f}")
print(f"   Shapiro-Wilk p-value (Treatment): {p_treatment:.4f}")
if p_control < 0.05 or p_treatment < 0.05:
    print("   ⚠ At least one group shows significant non-normality.")
else:
    print("   Neither group shows significant non-normality at α=0.05,")
    print("   but with small n, Shapiro-Wilk has low power.")

# Visual check
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(control, bins=8, edgecolor='black', alpha=0.7, color='lightblue')
axes[0].set_title(f'Control (n={len(control)})')
axes[0].set_xlabel('Session Length (min)')
axes[1].hist(treatment, bins=8, edgecolor='black', alpha=0.7, color='lightyellow')
axes[1].set_title(f'Treatment (n={len(treatment)})')
axes[1].set_xlabel('Session Length (min)')
plt.suptitle('Distribution of Session Lengths by Group', fontsize=13)
plt.tight_layout()
plt.show()

print("\n3. Equal variances:")
print(f"   SD ratio: {np.std(treatment, ddof=1)/np.std(control, ddof=1):.2f}")
print("   Use Welch's t-test (does not assume equal variances). ✓")

print("\nVerdict: t-test is *borderline* — small n with skewed data.")
print("The permutation test provides a useful robustness check.")

The Two Approaches: Head to Head

Approach 1: Welch's t-Test

print("=" * 60)
print("APPROACH 1: Welch's t-Test")
print("=" * 60)

t_stat, t_pvalue_two = stats.ttest_ind(treatment, control,
                                        equal_var=False,
                                        alternative='two-sided')
_, t_pvalue_one = stats.ttest_ind(treatment, control,
                                   equal_var=False,
                                   alternative='greater')

# CI for the difference
from scipy.stats import t as t_dist

se_diff = np.sqrt(np.var(treatment, ddof=1)/len(treatment) +
                  np.var(control, ddof=1)/len(control))

# Welch-Satterthwaite degrees of freedom
num = (np.var(treatment, ddof=1)/len(treatment) +
       np.var(control, ddof=1)/len(control))**2
denom = ((np.var(treatment, ddof=1)/len(treatment))**2 / (len(treatment)-1) +
         (np.var(control, ddof=1)/len(control))**2 / (len(control)-1))
df_welch = num / denom

t_star = t_dist.ppf(0.975, df=df_welch)
ci_t_lower = observed_diff - t_star * se_diff
ci_t_upper = observed_diff + t_star * se_diff

print(f"\n  Test statistic:   t = {t_stat:.3f}")
print(f"  Degrees of freedom: {df_welch:.1f}")
print(f"  p-value (two-sided): {t_pvalue_two:.4f}")
print(f"  p-value (one-sided): {t_pvalue_one:.4f}")
print(f"  95% CI for diff:   ({ci_t_lower:.1f}, {ci_t_upper:.1f}) min")
print(f"  SE of difference:  {se_diff:.1f} min")

Approach 2: Permutation Test

print("\n" + "=" * 60)
print("APPROACH 2: Permutation Test")
print("=" * 60)

np.random.seed(42)
n_perm = 10000
combined = np.concatenate([control, treatment])
n_control = len(control)

perm_diffs = np.zeros(n_perm)
for i in range(n_perm):
    shuffled = np.random.permutation(combined)
    perm_control = shuffled[:n_control]
    perm_treatment = shuffled[n_control:]
    perm_diffs[i] = np.mean(perm_treatment) - np.mean(perm_control)

# P-values
p_one_sided = np.mean(perm_diffs >= observed_diff)
p_two_sided = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))

print(f"\n  Observed difference: {observed_diff:.1f} min")
print(f"  p-value (one-sided): {p_one_sided:.4f}")
print(f"  p-value (two-sided): {p_two_sided:.4f}")

Approach 3: Bootstrap CI for the Difference

print("\n" + "=" * 60)
print("APPROACH 3: Bootstrap CI for the Difference")
print("=" * 60)

np.random.seed(42)
n_boot = 10000
boot_diffs = np.zeros(n_boot)

for i in range(n_boot):
    boot_control = np.random.choice(control, size=len(control), replace=True)
    boot_treatment = np.random.choice(treatment, size=len(treatment),
                                       replace=True)
    boot_diffs[i] = np.mean(boot_treatment) - np.mean(boot_control)

ci_boot_lower = np.percentile(boot_diffs, 2.5)
ci_boot_upper = np.percentile(boot_diffs, 97.5)
boot_se = np.std(boot_diffs)

print(f"\n  Bootstrap SE:        {boot_se:.1f} min")
print(f"  95% Bootstrap CI:    ({ci_boot_lower:.1f}, {ci_boot_upper:.1f}) min")

Comparison of Results

print("\n" + "=" * 60)
print("COMPARISON OF ALL THREE APPROACHES")
print("=" * 60)

print(f"\n{'Method':<30} {'p-value (1-sided)':<20} {'95% CI'}")
print("-" * 70)
print(f"{'Welch t-test':<30} {t_pvalue_one:<20.4f} "
      f"({ci_t_lower:.1f}, {ci_t_upper:.1f})")
print(f"{'Permutation test':<30} {p_one_sided:<20.4f} {'N/A (use bootstrap)'}")
print(f"{'Bootstrap CI':<30} {'N/A (use perm test)':<20} "
      f"({ci_boot_lower:.1f}, {ci_boot_upper:.1f})")

Visualization: The Null Distribution

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Permutation null distribution
axes[0].hist(perm_diffs, bins=50, edgecolor='black', alpha=0.7,
             color='lightgreen')
axes[0].axvline(observed_diff, color='red', linewidth=2.5, linestyle='--',
                label=f'Observed = {observed_diff:.1f} min')
axes[0].axvline(-observed_diff, color='red', linewidth=2.5, linestyle=':',
                alpha=0.5, label=f'-Observed = {-observed_diff:.1f} min')
count_extreme = np.sum(np.abs(perm_diffs) >= np.abs(observed_diff))
axes[0].set_xlabel('Difference in Means (Treatment - Control)')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Permutation Null Distribution\n'
                   f'(two-sided p = {p_two_sided:.4f})')
axes[0].legend()

# Right: Bootstrap distribution of the difference
axes[1].hist(boot_diffs, bins=50, edgecolor='black', alpha=0.7,
             color='steelblue')
axes[1].axvline(ci_boot_lower, color='red', linewidth=2, linestyle='--',
                label=f'CI lower = {ci_boot_lower:.1f}')
axes[1].axvline(ci_boot_upper, color='red', linewidth=2, linestyle='--',
                label=f'CI upper = {ci_boot_upper:.1f}')
axes[1].axvline(0, color='gray', linewidth=1.5, linestyle=':',
                label='No difference')
axes[1].set_xlabel('Difference in Means (Treatment - Control)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bootstrap Distribution of the Difference')
axes[1].legend()

plt.tight_layout()
plt.show()

Interpretation

The three methods tell a consistent story:

Welch's $t$-test gives a p-value and CI based on the $t$-distribution, which assumes approximate normality of the sampling distribution.
The permutation test gives a p-value based on direct simulation of the null hypothesis — no distributional assumptions needed. It answers: "If Watch Together had no effect, how often would we see a difference this large by chance?"
The bootstrap CI shows the range of plausible values for the true difference — again, without distributional assumptions.

The close agreement between methods is itself informative. It tells Alex that the $t$-test result isn't being distorted by the skewness and small sample sizes. If the methods had disagreed substantially, that would be a warning sign that the $t$-test assumptions were being violated.

The Permutation Test on Medians

Alex also runs a permutation test comparing medians rather than means — since the data are skewed, the median difference might be more representative:

# Permutation test on difference in medians
observed_median_diff = np.median(treatment) - np.median(control)
print(f"Observed difference in medians: {observed_median_diff:.1f} minutes")

perm_median_diffs = np.zeros(n_perm)
for i in range(n_perm):
    shuffled = np.random.permutation(combined)
    perm_median_diffs[i] = (np.median(shuffled[n_control:]) -
                             np.median(shuffled[:n_control]))

p_median_two = np.mean(np.abs(perm_median_diffs) >=
                        np.abs(observed_median_diff))
print(f"Permutation p-value (medians, two-sided): {p_median_two:.4f}")

This is something the standard $t$-test simply cannot do. There's no formula-based test for the difference in medians. The permutation test handles it effortlessly — just swap np.mean for np.median in the shuffling loop.

Alex's Report to the Product Team

Watch Together A/B Test — Statistical Analysis

We tested whether the Watch Together feature increases average session length, using 18 control users and 16 treatment users over a 14-day period.

Results: Treatment users watched an average of approximately 12 minutes longer per session than control users. Both the Welch's $t$-test and a permutation test (10,000 shuffles) produced p-values consistent with each other, indicating that this difference is unlikely to be due to chance alone.

The 95% bootstrap confidence interval for the difference in mean session length was approximately (1, 23) minutes, suggesting the true effect could be as small as 1 minute or as large as 23 minutes.

Robustness check: Because the sample sizes are small and the data are right-skewed, we supplemented the standard $t$-test with simulation-based methods (permutation test and bootstrap). All three approaches yielded consistent conclusions, increasing our confidence in the result.

Recommendation: The Watch Together feature appears to increase session length. However, the wide confidence interval suggests substantial uncertainty about the magnitude of the effect. We recommend expanding the test to 200+ users per group before a full rollout decision.

Connection to Chapter 16

In Chapter 16, Alex's A/B test had 247 + 253 = 500 users. The CLT was clearly in play, and the $t$-test was entirely appropriate. The permutation test and bootstrap aren't improvements in that scenario — they're alternatives that give the same answer.

In this case study, with only 34 total users, the simulation-based methods provide a valuable safety net. The fact that all methods agree is reassuring. If they had disagreed, Alex would have been wise to trust the permutation test and bootstrap over the $t$-test, because they make fewer assumptions.

Key Takeaway: The permutation test and bootstrap don't replace the $t$-test — they complement it. Use them to verify formula-based results when conditions are questionable, and use them instead when formula-based methods aren't available (like testing differences in medians).

Questions for Discussion

If Alex's A/B test had shown very different p-values from the $t$-test versus the permutation test, which would you trust more? Why?
The product team asks: "Can we use the permutation test for all our A/B tests from now on, even the big ones?" What would you advise?
The bootstrap CI for the difference is (1, 23) minutes — quite wide. What drives this width? How could Alex narrow it?
Alex tested both the difference in means and the difference in medians. When would the median-based test be more appropriate? When would the mean-based test be better?
StreamVibe runs hundreds of A/B tests per year. If they use a permutation test for each at $\alpha = 0.05$, how does the multiple comparisons problem (previewed in Chapter 13) apply? Does the permutation test offer any protection against this?