Case Study: Classifying Data at Scale — When Every Click Becomes Data

Contributors

Case Study: Classifying Data at Scale — When Every Click Becomes Data

The Setup

Every day, the major social media and streaming platforms generate staggering amounts of data. YouTube processes over 500 hours of video uploaded every minute. Instagram handles over 100 million photos per day. Netflix tracks what 260+ million subscribers watch, when they pause, when they rewind, and when they abandon a show 12 minutes in.

This is the world Alex Rivera works in at StreamVibe. His job — figuring out whether the new recommendation algorithm actually increases watch time — sounds simple until you realize the sheer complexity of the data he has to wrangle. Every user interaction generates multiple data points, and each one needs to be classified correctly before any analysis can begin.

This case study examines how data classification challenges play out at scale in the tech industry, and why getting it wrong can cost millions of dollars — or worse, erode user trust.

The Data Classification Minefield

Challenge 1: What Counts as "Watching"?

Alex's central metric is "watch time" — a seemingly straightforward continuous numerical variable measured in minutes. But defining it turns out to be surprisingly complicated.

Consider these scenarios:

Scenario	Watch Time?	Why It's Complicated
User watches 45 minutes of a movie	45 min	Clear case
User plays a show but falls asleep after 10 minutes; show plays for 3 hours	?	Is it 10 min (active watching) or 180 min (what the system recorded)?
User watches at 2x speed for 30 real minutes (covering 60 minutes of content)	?	30 minutes of viewing or 60 minutes of content?
User has a show running on a second screen while primarily using their phone	?	How do you distinguish active viewing from background noise?
User watches 5 minutes, pauses, returns 4 hours later, watches 20 more minutes	?	One session (25 min) or two sessions (5 min + 20 min)?

Each decision creates a different continuous variable with a different distribution. If Alex's team changed from "total playback time" to "active engagement time" (excluding moments when the user's device detected no interaction for 5+ minutes), average watch time per user might drop by 30-40%. The recommendation algorithm that "increased watch time by 8%" might actually have decreased active engagement.

The data type is the same — continuous numerical, ratio level — but the operational definition changes what the numbers mean. This is why data dictionaries are essential: two teams working with "watch time" could be measuring fundamentally different things.

Challenge 2: The Genre Classification Problem

StreamVibe, like every streaming platform, categorizes content by genre. Genre is a nominal categorical variable. But in practice, genre classification is far messier than it appears in a textbook.

Consider a show that is part comedy, part drama, with science fiction elements. How do you classify it? Options include:

Single label: Assign the "primary" genre only → Comedy
Multiple labels: Allow multiple genres → Comedy, Drama, Sci-Fi
Hierarchical: Use a main genre and subgenres → Drama > Comedy-Drama > Sci-Fi-Adjacent
Weighted: Assign percentages → 50% Comedy, 30% Drama, 20% Sci-Fi
User-perceived: Let users tag it → whatever users think it is

Each approach creates a different variable type: - Option 1: Simple nominal variable (one category per show) - Option 2: Set-valued variable (multiple categories per show — harder to analyze) - Option 3: Hierarchical nominal variable (requires special handling) - Option 4: Multiple continuous variables (genre percentages — one per genre) - Option 5: Crowd-sourced nominal variable (categories defined by users, not experts)

Alex's team discovered that their recommendation algorithm performed very differently depending on which genre classification system they used. With single labels, the algorithm tended to create "filter bubbles" — users who watched one comedy got recommended only comedies. With multi-label classification, the algorithm discovered subtler patterns: users who watched comedy-dramas also enjoyed certain documentaries with humorous narration.

Lesson: The same real-world concept (genre) can be operationalized as different variable types, and the choice shapes the analysis dramatically.

Challenge 3: User Engagement — Ordinal, Numerical, or Both?

Alex's boss wants to classify users into engagement tiers. The marketing team proposes:

Tier	Definition	Assigned Value
Churned	No activity in 30+ days	0
Dormant	1-2 sessions per month	1
Casual	3-8 sessions per month	2
Regular	9-20 sessions per month	3
Power User	21+ sessions per month	4

The marketing team then calculates the "average engagement score" across all users: 2.3. They put this in a board presentation. They track it monthly: 2.3, 2.4, 2.3, 2.5. "User engagement is trending up!"

But is this analysis valid? The engagement tier is ordinal — there's a clear order, but the distances aren't equal. The gap between "Churned" (0 sessions) and "Dormant" (1-2 sessions) might represent a user who just forgot their password. The gap between "Regular" (20 sessions) and "Power User" (21 sessions) is trivially small. Treating these tiers as equally spaced numerical values and calculating an average is a simplification that could mask important patterns.

For example, the "average" could increase from 2.3 to 2.5 in two very different ways: - Scenario A: 10% of Dormant users became Casual (widespread small improvement) - Scenario B: 5% of Casual users became Power Users, while 3% of Regular users Churned (concentrated gains with concerning losses)

Both might produce the same average engagement score, but they represent very different business situations. Scenario B — where you're gaining a few super-fans but losing your middle tier — might actually be a warning sign disguised as good news.

Lesson: Averaging ordinal data can obscure the story that the underlying distribution tells. Knowing the variable type tells you whether the summary statistic is trustworthy.

Challenge 4: Timestamps — The Variable That's Everything at Once

Every user action in StreamVibe's system is stamped with a datetime — when the user clicked play, when they paused, when they searched, when they logged in. The timestamp variable is deceptively versatile:

As a continuous variable: Used to calculate duration (time between play and pause) — this is how "watch time" is computed
As an ordinal variable: "Morning viewer," "Afternoon viewer," "Late-night viewer" — used for personalized scheduling
As a nominal variable: "Weekday" vs. "Weekend" — used to segment behavior
As a discrete variable: "Number of sessions per day" — counted by grouping timestamps
As the foundation for longitudinal data: The same user's activity tracked over weeks and months

A single raw variable (timestamp) can be transformed into variables of every type depending on the research question. This illustrates a practical skill you'll use throughout your career: raw data often needs to be reshaped and reclassified before analysis. The classification isn't always inherent in the data — sometimes you create it through how you process the data.

The Bigger Picture: Ethics and Classification

Alex's work raises ethical questions that connect to the chapter's theme of "human stories behind the data."

When StreamVibe classifies users into engagement tiers, those classifications affect what users experience. Power Users might get early access to new features. Dormant users might get aggressive email campaigns trying to re-engage them. A user classified as "at risk of churning" might see their subscription price quietly lowered — a form of price discrimination driven by data classification.

The categories aren't neutral. They shape who gets attention, who gets resources, and who gets ignored. A user who stopped watching because they were going through a difficult time might be reclassified from "Regular" to "Dormant" to "Churned" — and eventually lose their viewing history and recommendations when the system cleans out inactive accounts.

Professor Washington sees parallels to his criminal justice research. When an algorithm classifies a defendant's neighborhood as "high crime" (nominal) or assigns a risk score of 8 out of 10 (ordinal or numerical, depending on construction), those classifications have consequences. The data types feel objective and technical, but the act of classification is never truly neutral.

Connection to This Chapter

Chapter 2 Concept	StreamVibe Illustration
Operational definition matters	"Watch time" can be defined multiple ways, each producing a different continuous variable
Nominal classification is complex	Genre categories require decisions about single vs. multi-label, hierarchy, and user perception
Ordinal ≠ numerical	Averaging engagement tier scores can hide important distributional changes
Variables can be transformed	A single timestamp becomes continuous, ordinal, nominal, or discrete depending on how it's processed
Classification has consequences	User tier labels drive business decisions that affect people's experiences

Discussion Questions

Netflix famously uses thousands of "micro-genres" (like "Critically Acclaimed Emotional Underdog Movies" or "Dark Scandinavian TV Shows"). How does the granularity of a nominal variable — how many categories it has — affect analysis? What are the trade-offs between 5 broad genres and 2,000 micro-genres?
StreamVibe's engagement tier system assigns numerical values (0-4) to ordinal categories. Under what circumstances might this be a reasonable simplification? Under what circumstances could it lead to bad decisions?
If a streaming platform tracks every second of user behavior, does this raise privacy concerns? How much data about your viewing habits should a company be allowed to collect? Does the type of data (categorical vs. numerical, anonymous vs. identifiable) affect your answer?
Alex needs to decide whether to measure "user satisfaction" as a 1-10 numerical scale, a set of ordered categories (Very Unsatisfied to Very Satisfied), or a simple thumbs-up/thumbs-down. What are the pros and cons of each approach from a data classification perspective?
Compare the data classification challenges in this case study (social media/streaming) to those in Case Study 1 (healthcare). What similarities do you see? What's different about the stakes involved?

Try It Yourself

Choose a social media platform, streaming service, or app you use regularly. Think about all the data it might collect about you.

List at least 10 variables the platform might record about your usage.
Classify each variable as nominal, ordinal, discrete, or continuous.
For each variable, note whether the classification is clear-cut or ambiguous.
Identify at least one variable that could be classified differently depending on how it's defined or processed.
Write a paragraph: How does knowing the types of data collected about you change how you think about the platform?

Sources: YouTube internal statistics (2023, reported at YouTube Official Blog). Instagram press data (Meta Platforms, 2023). Netflix technology blog posts on recommendation systems (2022-2024). This case study uses StreamVibe as a fictional composite example; specific metrics are illustrative, not drawn from any single real platform.