Case Study 1: Alex's A/B Test — The Full Analysis

Contributors

Case Study 1: Alex's A/B Test — The Full Analysis

The Setup

This is the moment we've been building toward since Chapter 1.

Alex Rivera, marketing analyst at StreamVibe, has been running an A/B test for the past six weeks. StreamVibe's data science team developed a new recommendation algorithm — one that uses collaborative filtering combined with a content-based model — and they want to know whether it increases average watch time compared to the old algorithm.

The stakes are high. StreamVibe serves 12 million active users. A change in average watch time of even a few minutes translates to millions of dollars in advertising revenue and subscription retention. But changing the algorithm also carries risk: if the new algorithm is worse, or if it introduces unexpected side effects (like reducing content diversity), the rollback cost is substantial.

Alex needs the analysis to be airtight.

The Study Design

The engineering team randomly assigned 500 users to two groups:

Control group (old algorithm): 247 users continued to see recommendations from the existing algorithm
Treatment group (new algorithm): 253 users received recommendations from the new algorithm

Random assignment was stratified by user tier (Free, Basic, Premium) to ensure each tier was equally represented in both groups. Users were not told which algorithm they were seeing (single-blind design). The primary metric: average watch time per session, measured over 30 days.

Connection to Chapter 4: This is a randomized controlled experiment with single-blinding and stratified randomization — arguably the gold standard of study design. Because users were randomly assigned, any confounding variables (device type, time zone, viewing preferences, account age) should be approximately balanced between the two groups. This means that if we find a significant difference, we can attribute it causally to the algorithm change.

The Data

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# StreamVibe A/B test results
np.random.seed(42)

# Simulate realistic streaming session data (right-skewed)
# Old algorithm: gamma(shape=3.5, scale=12.1) ≈ mean 42.3, SD 18.5
old_algo = np.random.gamma(shape=5.2, scale=8.13, size=247)
# New algorithm: gamma(shape=3.8, scale=12.3) ≈ mean 46.8, SD 21.2
new_algo = np.random.gamma(shape=4.9, scale=9.55, size=253)

print("=" * 50)
print("StreamVibe A/B Test — Summary Statistics")
print("=" * 50)
print(f"\nOld Algorithm (Control):")
print(f"  n = {len(old_algo)}")
print(f"  Mean = {np.mean(old_algo):.2f} minutes")
print(f"  Median = {np.median(old_algo):.2f} minutes")
print(f"  SD = {np.std(old_algo, ddof=1):.2f} minutes")
print(f"  Min = {np.min(old_algo):.1f}, Max = {np.max(old_algo):.1f}")

print(f"\nNew Algorithm (Treatment):")
print(f"  n = {len(new_algo)}")
print(f"  Mean = {np.mean(new_algo):.2f} minutes")
print(f"  Median = {np.median(new_algo):.2f} minutes")
print(f"  SD = {np.std(new_algo, ddof=1):.2f} minutes")
print(f"  Min = {np.min(new_algo):.1f}, Max = {np.max(new_algo):.1f}")

print(f"\nObserved Difference (New - Old):")
print(f"  {np.mean(new_algo) - np.mean(old_algo):.2f} minutes")

Step 1: Explore and Visualize

Before running any test, Alex examines the distributions.

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

# Overlaid histograms
axes[0].hist(old_algo, bins=25, alpha=0.6, color='steelblue',
             label=f'Old (n={len(old_algo)})', density=True)
axes[0].hist(new_algo, bins=25, alpha=0.6, color='coral',
             label=f'New (n={len(new_algo)})', density=True)
axes[0].axvline(np.mean(old_algo), color='steelblue', linestyle='--',
                linewidth=2)
axes[0].axvline(np.mean(new_algo), color='coral', linestyle='--',
                linewidth=2)
axes[0].set_xlabel('Watch Time (minutes)')
axes[0].set_ylabel('Density')
axes[0].set_title('Distribution of Watch Times')
axes[0].legend()

# Side-by-side box plots
axes[1].boxplot([old_algo, new_algo],
                labels=['Old Algorithm', 'New Algorithm'])
axes[1].set_ylabel('Watch Time (minutes)')
axes[1].set_title('Box Plot Comparison')

# QQ-plots for normality check
stats.probplot(old_algo, dist="norm", plot=axes[2])
axes[2].set_title('QQ-Plot: Old Algorithm')

plt.tight_layout()
plt.show()

# Skewness check
print(f"Old algorithm skewness: {stats.skew(old_algo):.3f}")
print(f"New algorithm skewness: {stats.skew(new_algo):.3f}")

Key observations: - Both distributions are right-skewed (positive skewness), as expected for streaming session times — most sessions are moderate, with some very long binge-watching sessions pulling the right tail. - The new algorithm's distribution appears shifted slightly to the right. - There are some high outliers in both groups, but nothing extreme.

Does the skewness matter? No. With $n_1 = 247$ and $n_2 = 253$, the Central Limit Theorem guarantees that the sampling distribution of the difference in means is approximately normal, regardless of the skewness in the individual distributions. This is exactly the robustness we discussed in Chapter 15 — and it's why tech companies can run t-tests on heavily skewed session data without worry.

Step 2: Check All Conditions

print("=== Condition Checks ===")
print()

# 1. Independence between groups
print("1. Independence (between groups):")
print("   Users were RANDOMLY ASSIGNED to algorithms. ✓")
print("   (Stratified by tier — Free/Basic/Premium)")
print()

# 2. Independence within groups
total_users = 12_000_000
print("2. Independence (within groups):")
print(f"   10% condition: {len(old_algo)} < {total_users * 0.10:,.0f} ✓")
print(f"   10% condition: {len(new_algo)} < {total_users * 0.10:,.0f} ✓")
print(f"   Each user counted once (one session average per user). ✓")
print()

# 3. Normality
print("3. Normality (sampling distribution):")
print(f"   n_old = {len(old_algo)} ≥ 30 ✓ (CLT applies)")
print(f"   n_new = {len(new_algo)} ≥ 30 ✓ (CLT applies)")
print(f"   Both groups have skewness < 2 ({stats.skew(old_algo):.2f}, "
      f"{stats.skew(new_algo):.2f}). ✓")
print()
print("All conditions met. Welch's two-sample t-test is appropriate.")

Step 3: Conduct the Test

# Welch's two-sample t-test
result = stats.ttest_ind(new_algo, old_algo, equal_var=False)

print("=== Welch's Two-Sample t-Test ===")
print()
print(f"Hypotheses:")
print(f"  H₀: μ_new − μ_old = 0 (no difference)")
print(f"  Hₐ: μ_new − μ_old ≠ 0 (algorithms differ)")
print()
print(f"Test statistic: t = {result.statistic:.4f}")
print(f"P-value (two-tailed): p = {result.pvalue:.4f}")
print()

# Also try one-tailed (does new algorithm INCREASE watch time?)
result_one = stats.ttest_ind(new_algo, old_algo, equal_var=False,
                              alternative='greater')
print(f"One-tailed test (Hₐ: μ_new > μ_old):")
print(f"  p-value = {result_one.pvalue:.4f}")

Step 4: Confidence Interval

# 95% CI for the difference in means
diff = np.mean(new_algo) - np.mean(old_algo)
se = np.sqrt(np.var(old_algo, ddof=1)/len(old_algo)
             + np.var(new_algo, ddof=1)/len(new_algo))

# Welch-Satterthwaite df
num = (np.var(old_algo, ddof=1)/len(old_algo)
       + np.var(new_algo, ddof=1)/len(new_algo))**2
denom = ((np.var(old_algo, ddof=1)/len(old_algo))**2 / (len(old_algo)-1)
         + (np.var(new_algo, ddof=1)/len(new_algo))**2 / (len(new_algo)-1))
df_welch = num / denom

t_star = stats.t.ppf(0.975, df_welch)
margin = t_star * se
ci_lower = diff - margin
ci_upper = diff + margin

print(f"\n=== Confidence Interval ===")
print(f"Observed difference: {diff:.2f} minutes")
print(f"Standard error: {se:.4f}")
print(f"Welch df: {df_welch:.1f}")
print(f"t* (95%): {t_star:.4f}")
print(f"Margin of error: {margin:.2f} minutes")
print(f"95% CI for μ_new − μ_old: ({ci_lower:.2f}, {ci_upper:.2f})")
print(f"CI contains 0? {'Yes' if ci_lower <= 0 <= ci_upper else 'No'}")

Step 5: Interpret the Results

Statistical conclusion: At $\alpha = 0.05$, we reject $H_0$. There is statistically significant evidence that average watch time differs between the two algorithms. The new algorithm produced higher average watch time.

Effect size: The observed difference is approximately 4.5 minutes per session. The 95% confidence interval suggests the true difference is plausibly between about 1 and 8 minutes.

The Business Analysis

Here's where statistics becomes a superpower for business decisions.

# Business impact analysis
daily_sessions_per_user = 3
total_users = 12_000_000
ad_revenue_per_minute = 0.0033  # $/minute/user (illustrative)

# Using CI bounds for range of impacts
scenarios = {
    'Conservative (CI lower)': ci_lower,
    'Point estimate': diff,
    'Optimistic (CI upper)': ci_upper
}

print("=== Business Impact Analysis ===")
print(f"Active users: {total_users:,}")
print(f"Average sessions/user/day: {daily_sessions_per_user}")
print(f"Ad revenue per user-minute: ${ad_revenue_per_minute}")
print()

for label, minutes in scenarios.items():
    daily_minutes = minutes * daily_sessions_per_user * total_users
    monthly_revenue = daily_minutes * ad_revenue_per_minute * 30
    print(f"{label}: +{minutes:.1f} min/session")
    print(f"  → {daily_minutes/1e6:.0f}M additional minutes/day")
    print(f"  → ${monthly_revenue/1e6:.1f}M additional revenue/month")
    print()

Alex's recommendation:

"The data are clear: the new recommendation algorithm increases average watch time by approximately 4.5 minutes per session, with a 95% confidence interval of roughly 1 to 8 minutes. This translates to between $0.4M and $2.9M in additional monthly ad revenue, with our best estimate at $1.6M/month.

I recommend a phased rollout: 1. Deploy to 20% of users for two weeks to monitor for unforeseen issues 2. If metrics remain positive, expand to 50% for another two weeks 3. Full deployment after confirming no negative impacts on content diversity metrics

The statistical evidence supports the change, but responsible deployment requires monitoring for effects we didn't measure in the initial test — such as whether the new algorithm reduces the diversity of content users discover, which could have long-term engagement consequences."

The Deeper Lesson: What Good A/B Testing Looks Like

Alex's analysis illustrates the complete A/B testing pipeline that Chapter 4 previewed:

Step	What Alex Did	Why It Matters
Design	Random assignment, stratified by tier, single-blind	Ensures causal interpretation
Sample size	500 users (250/group), 30-day measurement	Adequate power for meaningful effects
Conditions	Checked independence, CLT for normality	Validates the test's assumptions
Test	Welch's two-sample t-test	Doesn't assume equal variances
CI	95% CI for the difference	Quantifies uncertainty about effect size
Business context	Revenue projections using CI bounds	Translates statistics into decisions
Caution	Phased rollout, monitoring plan	Acknowledges limitations of the test

This is what it looks like when statistics serves as a genuine decision-making tool. Not a p-value chasing exercise. Not a rubber stamp for a predetermined conclusion. A rigorous framework for answering "should we make this change?" with appropriate uncertainty.

What the A/B Test Can't Tell You

Even with perfect execution, Alex's test has limitations:

Short-term vs. long-term effects. The test measured 30 days. The algorithm might increase watch time initially through novelty effects, then the advantage might fade. Or it might grow over time as the algorithm learns user preferences better.
Average vs. subgroup effects. The overall increase of 4.5 minutes might mask heterogeneity — the algorithm might help Premium users a lot and hurt Free users slightly. Subgroup analysis would reveal this.
Watch time isn't the only metric that matters. Users might watch more but enjoy less. Content creators might get fewer views on niche content. The algorithm might create filter bubbles that reduce content discovery.
External validity. The test was run during a specific period. Results might differ during holiday seasons, major content launches, or economic downturns.

These limitations don't invalidate the test — they contextualize it. The two-sample t-test answered the specific question it was asked: "Is there a difference in average watch time?" The answer is yes. But responsible data science asks follow-up questions too.

Discussion Questions

Alex used a two-tailed test. Would a one-tailed test ($H_a: \mu_{\text{new}} > \mu_{\text{old}}$) have been more appropriate? What are the tradeoffs?
Suppose the test had yielded $p = 0.08$. Should Alex still recommend the algorithm change? Why or why not?
If StreamVibe wanted to detect a minimum meaningful difference of 2 minutes with 90% power, approximately how many users would they need per group? (Preview for Chapter 17.)
How would you design a follow-up study to test whether the new algorithm's benefits persist over 6 months?
Alex's revenue projections used the CI bounds. Why is this better than using only the point estimate of 4.5 minutes?