Case Study 2: Alex's Watch Time and the A/B Testing Pipeline — How Tech Companies Use t-Tests at Scale
The Setup
Alex Rivera's day at StreamVibe starts with a dashboard.
The dashboard shows real-time metrics for the streaming platform: average watch time per session, engagement rate, click-through rate on recommendations, and dozens of other numbers. But today, one metric catches Alex's attention: average watch time per session appears to have dipped below the industry benchmark of 45 minutes.
Is this a real drop, or just random fluctuation?
This question — asked dozens of times per day across thousands of tech companies — is answered by the same one-sample t-test you learned in this chapter. But at tech companies, the t-test doesn't happen in isolation. It's embedded in a larger infrastructure called the A/B testing pipeline.
The A/B Testing Pipeline
At StreamVibe, the data science team has built an automated system that runs t-tests continuously. Here's how it works:
Stage 1: Question Formulation
A product manager asks: "Is our average watch time different from the 45-minute industry benchmark?"
In formal terms: $$H_0: \mu = 45 \text{ minutes}$$ $$H_a: \mu \neq 45 \text{ minutes}$$
Stage 2: Data Collection
The system pulls a random sample of recent sessions. Today's sample: $n = 500$ sessions from the past 24 hours.
import numpy as np
from scipy import stats
# Simulating StreamVibe's session data
np.random.seed(2026)
# Real session times tend to be right-skewed (some very long sessions)
session_times = np.random.gamma(shape=4, scale=11, size=500)
print(f"n = {len(session_times)}")
print(f"Mean = {np.mean(session_times):.2f} minutes")
print(f"Median = {np.median(session_times):.2f} minutes")
print(f"Std Dev = {np.std(session_times, ddof=1):.2f} minutes")
print(f"Min = {np.min(session_times):.1f}, Max = {np.max(session_times):.1f}")
Output:
n = 500
Mean = 43.72 minutes
Median = 41.85 minutes
Std Dev = 21.54 minutes
Min = 4.2, Max = 134.7
Stage 3: Condition Checking (Automated)
The pipeline automatically checks conditions:
# Automated condition checks
def check_conditions(data, alpha=0.05):
n = len(data)
checks = {}
# 1. Sample size
checks['n'] = n
checks['n_sufficient'] = n >= 30
# 2. Normality assessment
# With n=500, CLT handles this, but we check anyway
skewness = stats.skew(data)
checks['skewness'] = skewness
# 3. Outlier check (IQR method)
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
outlier_count = np.sum((data < Q1 - 3*IQR) | (data > Q3 + 3*IQR))
checks['extreme_outliers'] = outlier_count
# Assessment
if n >= 30 and outlier_count == 0:
checks['verdict'] = 'PASS: Large n, no extreme outliers'
elif n >= 30 and outlier_count <= n * 0.01:
checks['verdict'] = 'PASS (with note): Large n, few extreme outliers'
else:
checks['verdict'] = 'WARNING: Review manually'
return checks
conditions = check_conditions(session_times)
for key, val in conditions.items():
print(f" {key}: {val}")
Output:
n: 500
n_sufficient: True
skewness: 0.87
extreme_outliers: 0
verdict: PASS: Large n, no extreme outliers
The data are right-skewed (skewness = 0.87) — watch time data almost always is. A few users binge-watch for hours while most watch for shorter periods. But with $n = 500$, the CLT makes the t-test robust to this skewness.
Stage 4: The Test
# One-sample t-test
result = stats.ttest_1samp(session_times, popmean=45)
print(f"\n=== One-Sample t-Test Results ===")
print(f"H₀: μ = 45 minutes")
print(f"Hₐ: μ ≠ 45 minutes")
print(f"")
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value (two-tailed): {result.pvalue:.4f}")
print(f"df: {len(session_times) - 1}")
print(f"")
# Confidence interval
n = len(session_times)
x_bar = np.mean(session_times)
s = np.std(session_times, ddof=1)
se = s / np.sqrt(n)
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * se
print(f"95% CI: ({x_bar - margin:.2f}, {x_bar + margin:.2f}) minutes")
print(f"")
# Decision
alpha = 0.05
if result.pvalue <= alpha:
print(f"Decision: REJECT H₀ (p = {result.pvalue:.4f} ≤ {alpha})")
print(f"Watch time appears to differ from 45-minute benchmark.")
else:
print(f"Decision: FAIL TO REJECT H₀ (p = {result.pvalue:.4f} > {alpha})")
print(f"No significant difference from 45-minute benchmark.")
Output:
=== One-Sample t-Test Results ===
H₀: μ = 45 minutes
Hₐ: μ ≠ 45 minutes
t-statistic: -1.3289
p-value (two-tailed): 0.1845
df: 499
95% CI: (41.83, 45.61) minutes
Decision: FAIL TO REJECT H₀ (p = 0.1845 > 0.05)
No significant difference from 45-minute benchmark.
Stage 5: Interpretation and Action
The result: $t_{499} = -1.33$, $p = 0.184$. No significant evidence that watch time differs from 45 minutes. The 95% CI of (41.83, 45.61) contains 45.
Alex's automated report to the product team reads:
Daily Watch Time Check (2026-03-15) Average session time: 43.72 min (95% CI: 41.83 to 45.61) Status: Within benchmark range. No action needed. Note: Mean is 1.28 min below benchmark but difference is not statistically significant (p = 0.184).
The Scale of the Problem
Here's what makes this interesting: StreamVibe doesn't run one t-test per day. They run hundreds.
Every product team has metrics they're monitoring. The recommendation algorithm team checks click-through rates. The content team checks completion rates. The UX team checks navigation efficiency. The advertising team checks ad engagement. Each team runs multiple tests daily.
StreamVibe Daily Statistical Tests
═══════════════════════════════════
Watch time vs. benchmark ──── t-test
Engagement rate vs. target ──── t-test
Load time vs. threshold ──── t-test
Bounce rate vs. industry avg ──── t-test
Revenue per user vs. target ──── t-test
Session length vs. last month ──── two-sample t-test (Ch.16)
New vs. old algorithm ──── A/B test (Ch.16)
User segment comparisons ──── ANOVA (Ch.20)
... and dozens more
The Multiple Testing Problem
If StreamVibe runs 100 independent t-tests per day, each at $\alpha = 0.05$, how many false alarms do they expect?
$$\text{Expected false alarms} = 100 \times 0.05 = 5 \text{ per day}$$
Five times per day, a test will produce a "significant" result even when nothing has changed. Over a week, that's 35 false alarms. Over a month, that's about 150.
This is the multiple testing problem from Chapter 13, applied at industrial scale. StreamVibe's solution: they use adjusted significance thresholds (Bonferroni correction or false discovery rate control) and require any "significant" result to be replicated before taking action.
Alex has learned the hard way: a single significant t-test result is the beginning of an investigation, not the end.
Beyond the One-Sample Test: What's Coming Next
Alex's one-sample t-test answers: "Is our watch time at the benchmark?" But the more important questions at StreamVibe are comparative:
- "Did the new recommendation algorithm increase watch time compared to the old one?" (Two-sample t-test, Chapter 16)
- "Is the effect large enough to justify the engineering cost of deploying the new algorithm?" (Effect size, Chapter 17)
- "If there is a real 2-minute increase, how many users do we need in the test to detect it reliably?" (Power analysis, Chapter 17)
These comparative questions require the two-sample t-test and paired t-test, which build directly on the one-sample t-test you've learned in this chapter. The logic is identical — you're still computing a test statistic, checking conditions, and finding a p-value. The formulas just get a bit more elaborate.
The Robustness Question in Practice
Alex's data were right-skewed (skewness = 0.87). In a statistics class, a student might worry: "The data aren't normal! Can we trust the t-test?"
At StreamVibe, with $n = 500$, nobody worries. The CLT guarantees that $\bar{x}$ has an approximately normal sampling distribution. The simulation studies from Section 15.6 confirm that the t-test's actual Type I error rate is virtually 0.05 for $n = 500$, regardless of the population shape.
But Alex's colleague on the fraud detection team tells a different story. He's testing whether the average transaction amount for a group of 12 suspicious accounts exceeds a threshold. His data:
$15, $22, $18, $12, $25, $8,430, $20, $14, $28, $16, $19, $23
That $8,430 value is an extreme outlier. For this small, outlier-contaminated dataset, the t-test would be unreliable. The fraud analyst uses the Wilcoxon signed-rank test (Chapter 21) instead, which isn't affected by a single extreme value.
The lesson: robustness depends on context. The same procedure that works beautifully for Alex's 500-session analysis can fail spectacularly for a 12-observation dataset with an outlier. Always check your conditions.
How AI Systems Use t-Tests
The t-test isn't just a tool that humans use to evaluate AI — AI systems use t-tests internally.
Automated model monitoring. When StreamVibe deploys a machine learning model, the system continuously monitors the model's prediction accuracy. If the average prediction error drifts above a threshold (detected by a t-test), the system triggers an alert: "Model degradation detected."
Feature importance testing. When building a recommendation model, the system tests whether adding a new feature (like time of day) significantly improves prediction accuracy. This is essentially a paired t-test: compare prediction error with the feature vs. without it.
Anomaly detection. StreamVibe's security system uses a variant of the t-test to detect unusual patterns. If the average login frequency for a user suddenly differs from their historical baseline, the system flags the account for review.
In each case, the underlying logic is the same: compute a test statistic that measures how far an observed mean is from an expected value, relative to the estimated variability.
Discussion Questions
-
StreamVibe runs 100 t-tests per day. At $\alpha = 0.05$, they expect 5 false alarms. What would happen if they used $\alpha = 0.01$ instead? What's the tradeoff?
-
Alex's data were right-skewed but the t-test was valid because $n = 500$. At what sample size would Alex need to start worrying about the skewness? What would you recommend if Alex only had 10 sessions to analyze?
-
The automated pipeline checks conditions programmatically. What are the limitations of automated condition checking? Can a computer fully replace human judgment about whether a t-test is appropriate?
-
A product manager sees that the 95% CI is (41.83, 45.61) and says: "The lower bound is 41.83 — that's more than 3 minutes below our benchmark! We need to take action!" How would you respond?
-
The fraud analyst's data contained a single outlier of $8,430 among values mostly in the $12-$28 range. What would happen to the mean, standard deviation, and t-statistic if this outlier were included? Compute approximately.
-
StreamVibe's model monitoring system uses t-tests to detect "model drift." If the model's average error has genuinely increased by 0.5%, but the monitoring system uses $n = 50$ observations and the error standard deviation is 5%, would the t-test likely detect this drift? What could be done to improve detection?