Case Study 2: Alex's Watch Time and the A/B Testing Pipeline — How Tech Companies Use t-Tests at Scale

Contributors

Case Study 2: Alex's Watch Time and the A/B Testing Pipeline — How Tech Companies Use t-Tests at Scale

The Setup

Alex Rivera's day at StreamVibe starts with a dashboard.

The dashboard shows real-time metrics for the streaming platform: average watch time per session, engagement rate, click-through rate on recommendations, and dozens of other numbers. But today, one metric catches Alex's attention: average watch time per session appears to have dipped below the industry benchmark of 45 minutes.

Is this a real drop, or just random fluctuation?

This question — asked dozens of times per day across thousands of tech companies — is answered by the same one-sample t-test you learned in this chapter. But at tech companies, the t-test doesn't happen in isolation. It's embedded in a larger infrastructure called the A/B testing pipeline.

The A/B Testing Pipeline

At StreamVibe, the data science team has built an automated system that runs t-tests continuously. Here's how it works:

Stage 1: Question Formulation

A product manager asks: "Is our average watch time different from the 45-minute industry benchmark?"

In formal terms: $$H_0: \mu = 45 \text{ minutes}$$ $$H_a: \mu \neq 45 \text{ minutes}$$

Stage 2: Data Collection

The system pulls a random sample of recent sessions. Today's sample: $n = 500$ sessions from the past 24 hours.

import numpy as np
from scipy import stats

# Simulating StreamVibe's session data
np.random.seed(2026)
# Real session times tend to be right-skewed (some very long sessions)
session_times = np.random.gamma(shape=4, scale=11, size=500)

print(f"n = {len(session_times)}")
print(f"Mean = {np.mean(session_times):.2f} minutes")
print(f"Median = {np.median(session_times):.2f} minutes")
print(f"Std Dev = {np.std(session_times, ddof=1):.2f} minutes")
print(f"Min = {np.min(session_times):.1f}, Max = {np.max(session_times):.1f}")

Output:

n = 500
Mean = 43.72 minutes
Median = 41.85 minutes
Std Dev = 21.54 minutes
Min = 4.2, Max = 134.7

Stage 3: Condition Checking (Automated)

The pipeline automatically checks conditions:

# Automated condition checks
def check_conditions(data, alpha=0.05):
    n = len(data)
    checks = {}

    # 1. Sample size
    checks['n'] = n
    checks['n_sufficient'] = n >= 30

    # 2. Normality assessment
    # With n=500, CLT handles this, but we check anyway
    skewness = stats.skew(data)
    checks['skewness'] = skewness

    # 3. Outlier check (IQR method)
    Q1, Q3 = np.percentile(data, [25, 75])
    IQR = Q3 - Q1
    outlier_count = np.sum((data < Q1 - 3*IQR) | (data > Q3 + 3*IQR))
    checks['extreme_outliers'] = outlier_count

    # Assessment
    if n >= 30 and outlier_count == 0:
        checks['verdict'] = 'PASS: Large n, no extreme outliers'
    elif n >= 30 and outlier_count <= n * 0.01:
        checks['verdict'] = 'PASS (with note): Large n, few extreme outliers'
    else:
        checks['verdict'] = 'WARNING: Review manually'

    return checks

conditions = check_conditions(session_times)
for key, val in conditions.items():
    print(f"  {key}: {val}")

Output:

  n: 500
  n_sufficient: True
  skewness: 0.87
  extreme_outliers: 0
  verdict: PASS: Large n, no extreme outliers

The data are right-skewed (skewness = 0.87) — watch time data almost always is. A few users binge-watch for hours while most watch for shorter periods. But with $n = 500$, the CLT makes the t-test robust to this skewness.

Stage 4: The Test

# One-sample t-test
result = stats.ttest_1samp(session_times, popmean=45)

print(f"\n=== One-Sample t-Test Results ===")
print(f"H₀: μ = 45 minutes")
print(f"Hₐ: μ ≠ 45 minutes")
print(f"")
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value (two-tailed): {result.pvalue:.4f}")
print(f"df: {len(session_times) - 1}")
print(f"")

# Confidence interval
n = len(session_times)
x_bar = np.mean(session_times)
s = np.std(session_times, ddof=1)
se = s / np.sqrt(n)
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * se

print(f"95% CI: ({x_bar - margin:.2f}, {x_bar + margin:.2f}) minutes")
print(f"")

# Decision
alpha = 0.05
if result.pvalue <= alpha:
    print(f"Decision: REJECT H₀ (p = {result.pvalue:.4f} ≤ {alpha})")
    print(f"Watch time appears to differ from 45-minute benchmark.")
else:
    print(f"Decision: FAIL TO REJECT H₀ (p = {result.pvalue:.4f} > {alpha})")
    print(f"No significant difference from 45-minute benchmark.")

Output:

=== One-Sample t-Test Results ===
H₀: μ = 45 minutes
Hₐ: μ ≠ 45 minutes

t-statistic: -1.3289
p-value (two-tailed): 0.1845
df: 499

95% CI: (41.83, 45.61) minutes

Decision: FAIL TO REJECT H₀ (p = 0.1845 > 0.05)
No significant difference from 45-minute benchmark.

Stage 5: Interpretation and Action

The result: $t_{499} = -1.33$, $p = 0.184$. No significant evidence that watch time differs from 45 minutes. The 95% CI of (41.83, 45.61) contains 45.

Alex's automated report to the product team reads:

Daily Watch Time Check (2026-03-15) Average session time: 43.72 min (95% CI: 41.83 to 45.61) Status: Within benchmark range. No action needed. Note: Mean is 1.28 min below benchmark but difference is not statistically significant (p = 0.184).

The Scale of the Problem

Here's what makes this interesting: StreamVibe doesn't run one t-test per day. They run hundreds.

Every product team has metrics they're monitoring. The recommendation algorithm team checks click-through rates. The content team checks completion rates. The UX team checks navigation efficiency. The advertising team checks ad engagement. Each team runs multiple tests daily.

                StreamVibe Daily Statistical Tests
                ═══════════════════════════════════

                Watch time vs. benchmark      ──── t-test
                Engagement rate vs. target     ──── t-test
                Load time vs. threshold        ──── t-test
                Bounce rate vs. industry avg   ──── t-test
                Revenue per user vs. target    ──── t-test
                Session length vs. last month  ──── two-sample t-test (Ch.16)
                New vs. old algorithm           ──── A/B test (Ch.16)
                User segment comparisons       ──── ANOVA (Ch.20)
                ... and dozens more

The Multiple Testing Problem

If StreamVibe runs 100 independent t-tests per day, each at $\alpha = 0.05$, how many false alarms do they expect?

$$\text{Expected false alarms} = 100 \times 0.05 = 5 \text{ per day}$$

Five times per day, a test will produce a "significant" result even when nothing has changed. Over a week, that's 35 false alarms. Over a month, that's about 150.

This is the multiple testing problem from Chapter 13, applied at industrial scale. StreamVibe's solution: they use adjusted significance thresholds (Bonferroni correction or false discovery rate control) and require any "significant" result to be replicated before taking action.

Alex has learned the hard way: a single significant t-test result is the beginning of an investigation, not the end.

Beyond the One-Sample Test: What's Coming Next

Alex's one-sample t-test answers: "Is our watch time at the benchmark?" But the more important questions at StreamVibe are comparative:

"Did the new recommendation algorithm increase watch time compared to the old one?" (Two-sample t-test, Chapter 16)
"Is the effect large enough to justify the engineering cost of deploying the new algorithm?" (Effect size, Chapter 17)
"If there is a real 2-minute increase, how many users do we need in the test to detect it reliably?" (Power analysis, Chapter 17)

These comparative questions require the two-sample t-test and paired t-test, which build directly on the one-sample t-test you've learned in this chapter. The logic is identical — you're still computing a test statistic, checking conditions, and finding a p-value. The formulas just get a bit more elaborate.

The Robustness Question in Practice

Alex's data were right-skewed (skewness = 0.87). In a statistics class, a student might worry: "The data aren't normal! Can we trust the t-test?"

At StreamVibe, with $n = 500$, nobody worries. The CLT guarantees that $\bar{x}$ has an approximately normal sampling distribution. The simulation studies from Section 15.6 confirm that the t-test's actual Type I error rate is virtually 0.05 for $n = 500$, regardless of the population shape.

But Alex's colleague on the fraud detection team tells a different story. He's testing whether the average transaction amount for a group of 12 suspicious accounts exceeds a threshold. His data:

$15, $22, $18, $12, $25, $8,430, $20, $14, $28, $16, $19, $23

That $8,430 value is an extreme outlier. For this small, outlier-contaminated dataset, the t-test would be unreliable. The fraud analyst uses the Wilcoxon signed-rank test (Chapter 21) instead, which isn't affected by a single extreme value.

The lesson: robustness depends on context. The same procedure that works beautifully for Alex's 500-session analysis can fail spectacularly for a 12-observation dataset with an outlier. Always check your conditions.

How AI Systems Use t-Tests

The t-test isn't just a tool that humans use to evaluate AI — AI systems use t-tests internally.

Automated model monitoring. When StreamVibe deploys a machine learning model, the system continuously monitors the model's prediction accuracy. If the average prediction error drifts above a threshold (detected by a t-test), the system triggers an alert: "Model degradation detected."

Feature importance testing. When building a recommendation model, the system tests whether adding a new feature (like time of day) significantly improves prediction accuracy. This is essentially a paired t-test: compare prediction error with the feature vs. without it.

Anomaly detection. StreamVibe's security system uses a variant of the t-test to detect unusual patterns. If the average login frequency for a user suddenly differs from their historical baseline, the system flags the account for review.

In each case, the underlying logic is the same: compute a test statistic that measures how far an observed mean is from an expected value, relative to the estimated variability.

Discussion Questions

StreamVibe runs 100 t-tests per day. At $\alpha = 0.05$, they expect 5 false alarms. What would happen if they used $\alpha = 0.01$ instead? What's the tradeoff?
Alex's data were right-skewed but the t-test was valid because $n = 500$. At what sample size would Alex need to start worrying about the skewness? What would you recommend if Alex only had 10 sessions to analyze?
The automated pipeline checks conditions programmatically. What are the limitations of automated condition checking? Can a computer fully replace human judgment about whether a t-test is appropriate?
A product manager sees that the 95% CI is (41.83, 45.61) and says: "The lower bound is 41.83 — that's more than 3 minutes below our benchmark! We need to take action!" How would you respond?
The fraud analyst's data contained a single outlier of $8,430 among values mostly in the $12-$28 range. What would happen to the mean, standard deviation, and t-statistic if this outlier were included? Compute approximately.
StreamVibe's model monitoring system uses t-tests to detect "model drift." If the model's average error has genuinely increased by 0.5%, but the monitoring system uses $n = 50$ observations and the error standard deviation is 5%, would the t-test likely detect this drift? What could be done to improve detection?