Case Study 1: Alex's A/B Test — Was It Practically Significant?

Contributors

Case Study 1: Alex's A/B Test — Was It Practically Significant?

The Setup

Alex Rivera is standing in front of StreamVibe's executive team. The A/B test results are in: the new recommendation algorithm increased average watch time by 4.5 minutes per session, and the result was statistically significant ($p = 0.012$). The data science team is excited. The engineering VP has already drafted a deployment timeline.

But Alex has just finished Chapter 17, and something doesn't feel right. Before recommending a full deployment to 12 million users, Alex wants to answer a question the p-value can't: is 4.5 minutes actually a big deal?

The Analysis

Alex pulls up the full A/B test data and runs a comprehensive analysis.

import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt

np.random.seed(42)

# ============================================================
# StreamVibe A/B Test — Complete Effect Size Analysis
# ============================================================

# Simulate the same data as Chapter 16
old_algo = np.random.gamma(shape=5.2, scale=8.13, size=247)
new_algo = np.random.gamma(shape=4.9, scale=9.55, size=253)

# Descriptive statistics
mean_old = np.mean(old_algo)
mean_new = np.mean(new_algo)
sd_old = np.std(old_algo, ddof=1)
sd_new = np.std(new_algo, ddof=1)
n_old = len(old_algo)
n_new = len(new_algo)

print("=" * 60)
print("STREAMVIBE A/B TEST — COMPREHENSIVE ANALYSIS")
print("=" * 60)

# 1. Point estimate and CI
diff = mean_new - mean_old
se = np.sqrt(sd_old**2/n_old + sd_new**2/n_new)
ci_low = diff - 1.96 * se
ci_high = diff + 1.96 * se

print(f"\n1. POINT ESTIMATE AND CONFIDENCE INTERVAL")
print(f"   Mean difference: {diff:.2f} minutes")
print(f"   95% CI: ({ci_low:.2f}, {ci_high:.2f}) minutes")
print(f"   Interpretation: The true difference is plausibly")
print(f"   as small as {ci_low:.1f} min or as large as {ci_high:.1f} min")

# 2. Effect sizes
s_pooled = np.sqrt(((n_old - 1) * sd_old**2 + (n_new - 1) * sd_new**2) /
                    (n_old + n_new - 2))
d = diff / s_pooled

t_stat, p_value = stats.ttest_ind(new_algo, old_algo, equal_var=False)
df_approx = n_old + n_new - 2
r_sq = t_stat**2 / (t_stat**2 + df_approx)

print(f"\n2. EFFECT SIZES")
print(f"   Cohen's d: {d:.3f}")
print(f"   Interpretation: {'Small' if abs(d) < 0.5 else 'Medium' if abs(d) < 0.8 else 'Large'} effect")
print(f"   r²: {r_sq:.4f} ({r_sq * 100:.1f}% of variance explained)")
print(f"   Pooled SD: {s_pooled:.2f} minutes")

# 3. Significance test
print(f"\n3. SIGNIFICANCE TEST")
print(f"   t-statistic: {t_stat:.3f}")
print(f"   p-value: {p_value:.4f}")
print(f"   Significant at α=0.05? {'Yes' if p_value < 0.05 else 'No'}")

# 4. Power analysis
power_analysis = TTestIndPower()
achieved_power = power_analysis.solve_power(
    effect_size=abs(d), nobs1=n_old, alpha=0.05,
    ratio=n_new/n_old, alternative='two-sided'
)

print(f"\n4. POWER ANALYSIS")
print(f"   Achieved power: {achieved_power:.3f} ({achieved_power*100:.1f}%)")
print(f"   Adequate (≥80%)? {'Yes' if achieved_power >= 0.80 else 'No'}")

# 5. Minimum detectable effect with this sample
min_d = power_analysis.solve_power(
    nobs1=n_old, alpha=0.05, power=0.80,
    ratio=n_new/n_old, alternative='two-sided'
)
print(f"   Minimum detectable effect (80% power): d = {min_d:.3f}")
print(f"   In minutes: {min_d * s_pooled:.2f} minutes")

The Business Context

Alex prepares a table that translates the statistical findings into business language:

# ============================================================
# BUSINESS IMPACT ANALYSIS
# ============================================================

print(f"\n{'='*60}")
print("BUSINESS IMPACT TRANSLATION")
print(f"{'='*60}")

active_users = 12_000_000
sessions_per_week = 3
weeks_per_year = 52

# Using the point estimate
annual_minutes_pe = diff * active_users * sessions_per_week * weeks_per_year
# Using the CI bounds
annual_minutes_low = ci_low * active_users * sessions_per_week * weeks_per_year
annual_minutes_high = ci_high * active_users * sessions_per_week * weeks_per_year

print(f"\nScenario: 12M users × 3 sessions/week × 52 weeks")
print(f"\nAdditional viewing minutes per year:")
print(f"  Point estimate: {annual_minutes_pe/1e9:.2f} billion minutes")
print(f"  Lower bound:    {annual_minutes_low/1e9:.2f} billion minutes")
print(f"  Upper bound:    {annual_minutes_high/1e9:.2f} billion minutes")

# Revenue impact (hypothetical: $0.02 per viewing minute in ad revenue)
rev_per_minute = 0.02
print(f"\nEstimated annual revenue impact (at ${rev_per_minute} per minute):")
print(f"  Point estimate: ${annual_minutes_pe * rev_per_minute / 1e6:.1f}M")
print(f"  Lower bound:    ${annual_minutes_low * rev_per_minute / 1e6:.1f}M")
print(f"  Upper bound:    ${annual_minutes_high * rev_per_minute / 1e6:.1f}M")

# Deployment cost (hypothetical)
deployment_cost = 2_000_000  # $2M
print(f"\nDeployment cost: ${deployment_cost/1e6:.1f}M")
print(f"\nROI (point estimate): {(annual_minutes_pe * rev_per_minute - deployment_cost) / deployment_cost * 100:.0f}%")
print(f"ROI (lower bound):    {(annual_minutes_low * rev_per_minute - deployment_cost) / deployment_cost * 100:.0f}%")

The Visualization

Alex creates a visualization that communicates both the statistical and practical significance:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Distribution overlap
x = np.linspace(0, 120, 300)
y_old = stats.gaussian_kde(old_algo)(x)
y_new = stats.gaussian_kde(new_algo)(x)

axes[0].fill_between(x, y_old, alpha=0.3, color='steelblue', label='Old Algorithm')
axes[0].fill_between(x, y_new, alpha=0.3, color='coral', label='New Algorithm')
axes[0].plot(x, y_old, color='steelblue', linewidth=2)
axes[0].plot(x, y_new, color='coral', linewidth=2)
axes[0].axvline(mean_old, color='steelblue', linestyle='--', linewidth=1.5)
axes[0].axvline(mean_new, color='coral', linestyle='--', linewidth=1.5)
axes[0].set_xlabel('Watch Time (minutes)', fontsize=12)
axes[0].set_ylabel('Density', fontsize=12)
axes[0].set_title(f'Distribution Overlap (d = {abs(d):.2f})', fontsize=13)
axes[0].legend(fontsize=11)
axes[0].annotate(f'{diff:.1f} min\ndifference',
                 xy=((mean_old + mean_new)/2, max(max(y_old), max(y_new)) * 0.8),
                 fontsize=11, ha='center',
                 bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow'))

# Panel 2: Effect size with CI
axes[1].barh(['Difference\n(minutes)'], [diff], xerr=[[diff - ci_low], [ci_high - diff]],
             color='coral', alpha=0.7, capsize=8, height=0.3, edgecolor='darkred')
axes[1].axvline(0, color='black', linestyle='-', linewidth=1)
axes[1].set_xlabel('Difference in Watch Time (minutes)', fontsize=12)
axes[1].set_title(f'Effect Size with 95% CI\np = {p_value:.3f}, d = {abs(d):.2f}', fontsize=13)
axes[1].set_xlim(-2, 12)

# Add annotations
axes[1].annotate(f'CI: ({ci_low:.1f}, {ci_high:.1f})',
                 xy=(diff, 0.2), fontsize=11, ha='center',
                 bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow'))

plt.tight_layout()
plt.savefig('alex_effect_size_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Alex's Presentation

Armed with these numbers, Alex presents the following to the executive team:

"The new algorithm produces a statistically significant increase in watch time ($p = 0.012$). However, the effect is small by statistical standards — Cohen's d = 0.23, explaining about 1.3% of the variance in watch time. Individual users won't notice a difference.

"But here's the key: at our scale of 12 million users, even a small per-user effect translates to billions of additional viewing minutes and tens of millions in potential revenue. The effect is small at the individual level but massive at the aggregate level.

"My recommendation: deploy the new algorithm, but in phases. Start with 10% of users and monitor for two weeks. Look for unintended side effects — reduced content diversity, increased viewer fatigue, or drops in user satisfaction scores. If the effect holds and no problems emerge, roll out to all users over the next month.

"One caveat: our study had 76% power, slightly below the 80% convention. This means there's a 24% chance we would have missed the effect if it exists. The confidence interval for the difference ranges from 1.0 to 8.0 minutes — that's a wide range. The true benefit could be as small as 1 minute per session (modest but still valuable at scale) or as large as 8 minutes (transformative). We should design the phased rollout to narrow this estimate."

The Key Lesson

Alex's analysis demonstrates a crucial nuance: practical significance depends on context.

A Cohen's d of 0.23 is "small" by generic benchmarks. In a clinical trial, it might be too small to justify the side effects of a new drug. In an educational intervention, it might be too small to justify the cost of retraining teachers.

But at StreamVibe's scale, it's worth millions of dollars. The same effect size can be trivial in one context and transformative in another. This is why Cohen himself warned against applying his benchmarks mechanically.

The complete statistical report — effect size, confidence interval, power analysis, and business impact — gives decision-makers the information they need. The p-value alone ($p = 0.012$) would have told them almost nothing useful.

Discussion Questions

How would your recommendation change if the confidence interval were (0.1, 8.9) minutes instead of (1.0, 8.0)? What if it were (3.5, 5.5)?
Alex's analysis focused on watch time. What other metrics should StreamVibe track before and during the phased rollout? (Think about content diversity, user satisfaction, and potential harms.)
If the phased rollout reveals that the effect is real for Premium users ($d = 0.4$) but nearly zero for Free users ($d = 0.05$), how should Alex report this? Is the overall $d = 0.23$ misleading?
A competitor reports a $d = 0.6$ effect for their algorithm change, but they tested it on only 50 users. Would you trust their result more or less than Alex's? Why?
Alex used Cohen's d, which is designed for normally distributed data. StreamVibe's watch time data is right-skewed (gamma distribution). Does this affect the interpretation of the effect size? What alternatives might be more appropriate?

Connection to Chapter 1

In Chapter 1, we introduced Alex Rivera as a marketing analyst at StreamVibe who wanted to know if a new recommendation algorithm increased watch time. At the time, we described the A/B test in general terms and promised that statistical tools would provide the answer.

In Chapter 16, we ran the hypothesis test and got the p-value. But now we see that the p-value was only part of the answer. The complete answer requires the effect size, the confidence interval, the power analysis, and the business context. Alex's question was never really "Is the difference statistically significant?" It was "Should we deploy this algorithm?" — and that question requires the full toolkit of this chapter.