Case Study: A/B Testing in Tech — Designing Experiments at Scale
The Setup
Alex Rivera has been at StreamVibe for four months now, and she's gotten comfortable with the company's data. She can pull reports, analyze user segments, and present findings to her team. But this week, her manager dropped something new on her desk.
"The product team wants to redesign the homepage," her manager says. "They think a new layout will increase engagement. I want you to design the A/B test."
Alex feels a mix of excitement and nervousness. She knows what an A/B test is — she learned the basic concept in her stats class. But designing one that will actually run on StreamVibe's 8 million active users? That's a different level entirely.
Let's follow Alex through the process and see how the principles from this chapter play out in a real tech environment.
Step 1: Define the Question Precisely
Alex's first instinct is to test "whether the new homepage is better." But her statistics training kicks in: better at what, exactly?
She sits down with the product team and nails down the specifics:
- Research question: Does the new homepage layout increase average watch time per user session?
- Primary metric: Average watch time per session (in minutes)
- Secondary metrics: Sessions per week, click-through rate on recommendations, bounce rate (percentage of users who leave within 30 seconds)
- Minimum detectable effect: The product team considers a 3% increase in watch time "worth it" (anything less isn't worth the engineering effort to deploy the new design)
This step matters more than most people realize. If Alex tested "engagement" without defining it precisely, she could end up in a situation where watch time goes up but bounce rate also goes up — and the team would argue endlessly about whether the new design "worked."
Statistical thinking in action: Defining your research question precisely before collecting data is Pillar 1 of the four pillars of statistical investigation (Chapter 1). Alex is practicing good statistics before writing a single line of code.
Step 2: Determine the Sample
Alex could test the new homepage on all 8 million users, but that would be reckless. What if the new design has a bug? What if it's dramatically worse? You don't want to risk the experience of your entire user base.
Standard practice is to test on a fraction of users:
- Control group (Group A): 5% of users see the current homepage (400,000 users)
- Treatment group (Group B): 5% of users see the new homepage (400,000 users)
- The remaining 90%: Continue seeing the current homepage, unaffected by the test
Wait — 400,000 users per group? That seems like overkill. Couldn't she use a smaller sample?
She could, but there's a reason for the large numbers. StreamVibe users vary enormously in their watching habits — some watch 4 hours a day, others check in once a week. With this much variation, you need a large sample to reliably detect a 3% difference in watch time. We'll learn the formal tools for calculating sample size in Chapter 17 (power analysis), but the intuition is: more variation in the data = larger sample needed to detect a real effect.
Randomization in Practice
How does StreamVibe randomly assign users to groups? It's beautifully simple:
- When a user logs in, the system generates a hash of their user ID
- The hash produces a number between 0 and 99
- Users with hash values 0-4 are in Group A (control)
- Users with hash values 5-9 are in Group B (treatment)
- Users with hash values 10-99 see the normal experience
This method ensures: - Each user is randomly assigned — the hash function is effectively random with respect to user characteristics - Each user stays in the same group — the hash of a given user ID always produces the same number, so users don't bounce between layouts between sessions - The assignment is invisible to users — they don't know they're in an experiment
This is the digital equivalent of a double-blind study. Users don't know which version they're seeing (they have nothing to compare it to), and the engineers who built the feature don't interact with users during the test. The data speaks for itself.
Step 3: Run the Experiment
Alex launches the test on a Monday morning. For two weeks, 400,000 users see the old homepage and 400,000 see the new one. The system automatically logs every session for both groups.
Why Two Weeks?
The test duration matters. Too short, and you might catch an unusual period (a holiday weekend, a viral show release). Too long, and you waste time on a test that should have been concluded earlier. Alex chose two weeks because:
- It spans two full weekends (viewing patterns differ on weekdays vs. weekends)
- It's long enough for the "novelty effect" to wear off — some users might engage more with a new layout simply because it's new, not because it's better
- It aligns with StreamVibe's standard testing window
Step 4: Check the Data
After two weeks, Alex pulls the results:
| Metric | Group A (Old Homepage) | Group B (New Homepage) | Difference |
|---|---|---|---|
| Users | 398,412 | 399,847 | — |
| Avg watch time per session | 42.3 min | 43.8 min | +3.5% |
| Sessions per week | 4.2 | 4.3 | +2.4% |
| Recommendation click rate | 18.7% | 21.2% | +2.5 pp |
| Bounce rate | 12.1% | 11.8% | -0.3 pp |
The new homepage shows improvement across every metric. Watch time increased by 3.5% — exceeding the 3% threshold the team set. But Alex knows better than to celebrate yet. She asks the critical question: could this difference be due to chance?
With 400,000 users in each group, random variation alone could produce small differences even if the layouts were identical. The question is whether a 3.5% difference with this sample size is large enough to be convincing — or whether it's within the range of normal random fluctuation.
This is exactly the question that hypothesis testing (Chapter 13) and confidence intervals (Chapter 12) will answer formally. For now, Alex uses StreamVibe's internal statistics tool, which reports:
- p-value for watch time difference: 0.001 (we'll learn what this means precisely in Chapter 13, but the short version: if the two layouts were truly identical, there's only a 0.1% chance you'd see a difference this large by random chance alone)
- 95% confidence interval for the difference: 2.8% to 4.2% (we'll learn to construct these in Chapter 12)
The result is statistically significant. The new homepage genuinely increases watch time.
Step 5: Interpret and Decide
Alex presents her findings. The new homepage increases watch time by approximately 3.5%. This is above the 3% threshold, the result is statistically significant, and all secondary metrics moved in the right direction.
But a thoughtful colleague raises a question: "We see the overall average went up. But did it go up for everyone? Or did it go up a lot for some users and down for others?"
This is an excellent question. Alex digs deeper:
| User Segment | Group A Watch Time | Group B Watch Time | Change |
|---|---|---|---|
| New users (< 30 days) | 28.1 min | 33.4 min | +18.9% |
| Regular users (30-365 days) | 44.2 min | 44.8 min | +1.4% |
| Power users (> 365 days) | 67.8 min | 65.2 min | -3.8% |
Now the picture is more nuanced. The new homepage is great for new users (the recommendation layout helps them discover content), neutral for regular users, and actually worse for power users (who had their own browsing patterns disrupted by the layout change).
The overall average went up because the positive effect on new users outweighed the negative effect on power users — but "overall average went up" hides an important story.
Connection to Chapter 2: This is a great example of why understanding your data matters. The "average watch time" is a numerical variable, and the overall average tells one story. But when you break it down by a categorical variable (user segment), a more complex — and more useful — story emerges. The variable types and the observational units shape what you can see.
A/B Testing Pitfalls: What Can Go Wrong
Alex's test went smoothly, but A/B tests can fail in many ways. Here are the most common pitfalls:
Pitfall 1: Peeking at Results Too Early
If Alex had checked the results after three days and seen a 5% improvement, she might have been tempted to stop the test and declare victory. This is called "peeking" or "optional stopping," and it inflates the chance of a false positive. Early results are noisy — small samples have high variability. A three-day winner might reverse itself over two weeks. The statistical tests we'll learn in Chapter 13 assume you decide your sample size (or test duration) before looking at the data. Peeking violates that assumption.
Pitfall 2: Testing Too Many Things at Once
What if the new homepage changed the layout, the color scheme, the font, and the recommendation algorithm all at once? If watch time goes up, which change caused it? This is the experimental equivalent of confounding — when multiple variables change simultaneously, you can't isolate which one had the effect.
Best practice: change one thing at a time. If you must change multiple things, use more sophisticated experimental designs (multivariate testing) that can separate the effects — but these require larger samples and more complex analysis.
Pitfall 3: The Novelty Effect
Users might engage more with a new design simply because it's new and different — not because it's better. This effect fades as users get accustomed to the new layout. Running the test for long enough (at least one to two weeks) helps mitigate this, but it's always worth checking whether the effect is stable over time or decaying.
Pitfall 4: Interference Between Groups
If users in different groups talk to each other ("Hey, my StreamVibe homepage looks different from yours!"), it can contaminate the experiment. In social products, this is an especially big problem — a change that affects one user's behavior might indirectly affect their friends' behavior. This is called "network interference" or "spillover effects," and it's an active area of research in tech experimentation.
Pitfall 5: Ignoring Practical Significance
A result can be statistically significant without being practically significant. If a test shows that a new button color increases clicks by 0.01% with p < 0.001, the result is real — but is it worth the engineering effort to deploy? Companies need to set a minimum threshold for practical importance before running the test (as Alex did with the 3% threshold).
The Scale of Modern A/B Testing
To appreciate the scope of this practice, consider some numbers:
-
Google runs over 10,000 A/B tests per year. In 2000, they famously tested 41 shades of blue for advertising links. The winning shade generated an estimated $200 million in additional annual revenue.
-
Netflix tests different thumbnail images for shows. The image you see for a movie on your homepage might be different from what your friend sees — because Netflix is testing which image makes each of you more likely to click.
-
Amazon tests page layouts, recommendation algorithms, pricing displays, and checkout flows continuously. Jeff Bezos once said that Amazon's success comes from running "a very large number of experiments."
-
Booking.com reportedly runs over 1,000 concurrent A/B tests at any given time.
These companies have essentially industrialized the scientific method. Every product decision is treated as a hypothesis to be tested, not an opinion to be debated.
The Ethics of Testing at Scale
But this raises ethical questions that Alex — and anyone working in tech — should think carefully about:
Manipulation vs. Optimization
When StreamVibe optimizes its homepage to maximize watch time, is it helping users find content they enjoy? Or is it manipulating them into watching more than they want to? The line between "serving the user" and "exploiting the user" is not always clear, especially when the company's revenue depends on time spent on the platform.
Dark Patterns
A/B testing can be used to optimize for user welfare (making it easier to find relevant content) or against it (making it harder to cancel a subscription). "Dark patterns" — design choices that trick users into actions they don't intend — are often discovered and refined through A/B testing. A button placement that makes users accidentally purchase a subscription is a "winning" A/B test variant, but it's not ethical.
The Facebook Emotional Contagion Study
In 2014, researchers at Facebook (now Meta) modified the news feeds of 689,003 users, showing some more positive content and others more negative content. They found that users exposed to more positive content wrote more positive posts themselves, and vice versa. The study was published in the Proceedings of the National Academy of Sciences and provoked massive backlash — not because of the finding, but because users were not informed they were being emotionally manipulated.
Facebook argued that A/B testing was covered by its terms of service. Critics argued that there's a meaningful difference between testing button colors and testing whether you can alter people's emotional states. The incident led to reforms at Facebook and renewed discussion about the ethics of large-scale experimentation.
Informed Consent in the Digital Age
Traditional research ethics require informed consent — participants must know they're in a study and agree to participate. A/B tests at tech companies typically operate without explicit consent. The question is whether the terms of service provide adequate consent, or whether more transparency is needed.
Discussion questions:
Where would you draw the line between acceptable A/B testing (testing button colors, layout changes) and unacceptable testing (manipulating emotional content, making it harder to cancel)?
Should tech companies be required to disclose which A/B tests they're running? What would the practical consequences be?
In Alex's test, power users were negatively affected by the new homepage. If StreamVibe deploys the new layout because the overall average improved, are they treating power users fairly?
Analysis Questions
1. Experimental Design: In Alex's A/B test, what is the explanatory variable? What is the response variable? What are the treatment and control groups? What role does randomization play?
Suggested Answer
- **Explanatory variable:** Homepage layout (old vs. new) — categorical, nominal - **Response variable:** Average watch time per session — numerical, continuous - **Treatment group:** Group B (new homepage) — 400,000 users - **Control group:** Group A (old homepage) — 400,000 users - **Randomization:** Users are randomly assigned to groups via a hash function on their user ID, ensuring the groups are comparable in all characteristics (age, viewing habits, subscription tier, device type, etc.) except the homepage layout. This allows any observed difference in watch time to be attributed to the layout change rather than to pre-existing differences between the groups.2. Confounding: Imagine StreamVibe had not randomized the assignment. Instead, they showed the new homepage to all users who signed up after January 1, and kept the old homepage for users who signed up before January 1. Why would this be a terrible experimental design?
Suggested Answer
This would introduce massive confounding. Users who signed up after January 1 (new users) are fundamentally different from users who signed up before (existing users). New users are still exploring the platform, haven't formed viewing habits, and may have different demographics. Existing users have established routines and expectations. Any difference in watch time could be due to these pre-existing differences rather than the homepage layout. The "treatment" and "control" groups would differ in at least two ways — the homepage they see *and* their tenure on the platform. You couldn't tell which factor caused any observed difference.3. Generalization: Alex's test ran on StreamVibe users — a specific population. Could she generalize the results to, say, a competitor's platform? Why or why not?
Suggested Answer
Not directly. StreamVibe's user base has specific characteristics: they chose StreamVibe over competitors, they have particular content preferences, and they're accustomed to StreamVibe's existing interface. A homepage layout that works for StreamVibe users might not work for users of a platform with different content, a different user base, or different browsing habits. The experiment has strong *internal validity* (it demonstrates a causal effect within StreamVibe) but limited *external validity* (the results may not transfer to other contexts). This is a common trade-off in experimental research.4. Subgroup Analysis: Alex found that the new homepage helped new users but hurt power users. What should StreamVibe do with this information? Should they deploy the new layout, keep the old one, or do something else entirely?
Suggested Answer
Several approaches are possible: - **Deploy the new layout only for new users** and keep the old layout for power users. This personalized approach serves each group's needs, though it adds complexity. - **Redesign the new layout** based on what worked for new users while addressing the pain points for power users. Then run another A/B test on the improved version. - **Deploy the new layout for everyone** but monitor power user retention closely and iterate. The 3.5% overall improvement is substantial and might be worth the trade-off. The best answer depends on business priorities — if power users generate more revenue per person, losing them might outweigh gains from new users. This illustrates that statistical significance alone doesn't make the decision; context and business judgment matter too.5. Ethical Reflection: Alex's A/B test was designed to increase watch time. Is "more watch time" always a good thing for users? When might optimizing for watch time conflict with user well-being?
Suggested Answer
More watch time is good for StreamVibe's revenue but not necessarily for users. Users who watch more might be: - Discovering content they genuinely enjoy (good) - Getting sucked into "just one more episode" loops that eat into sleep, exercise, or social time (not good) - Feeling compelled to watch content they don't truly enjoy because the interface is designed to minimize the "cost" of clicking away (manipulative) This is the broader tension in attention-economy businesses: the company's primary metric (engagement) doesn't always align with user welfare. Some companies have begun experimenting with "time well spent" metrics — measuring not just how much time users spend, but how satisfied they feel about how they spent that time. This represents a more ethical approach to experimentation.Key Takeaways from This Case Study
-
A/B testing is the scientific method at industrial scale. Companies that test rather than guess make better product decisions — but the tests need to be well-designed.
-
Precise questions lead to useful answers. Alex's insistence on defining "engagement" precisely (average watch time per session) prevented endless post-hoc debate about what the results meant.
-
Randomization is the engine of causal inference. The hash-based random assignment ensured that any difference in outcomes could be attributed to the homepage change, not to pre-existing user differences.
-
Averages can hide important stories. The overall 3.5% improvement masked a +19% gain for new users and a -4% loss for power users. Always look beneath the average.
-
Statistical significance is necessary but not sufficient. The result must also be practically significant (worth the cost of implementation) and ethically sound (serving users' interests, not just the company's metrics).
-
The power to experiment creates the responsibility to experiment ethically. A/B testing can optimize for user welfare or against it. The same methodology that helps you find the most useful homepage can also help you find the most addictive one. The choice is a human one, not a statistical one.