Key Takeaways: Lies, Damn Lies, and Statistics: Ethical Data Practice

One-Sentence Summary

Ethical data practice requires recognizing that statistics can mislead without fabrication (through cherry-picking, Simpson's paradox, and the ecological fallacy), that research integrity depends on distinguishing confirmatory from exploratory analysis (preventing p-hacking and HARKing), that data privacy and informed consent are harder to achieve than most people think, and that every data-driven decision embeds value judgments about whose welfare matters — judgments that demand transparency, accountability, and input from affected communities.

Core Concepts at a Glance

Concept Definition Why It Matters
Simpson's paradox A trend in aggregated data that reverses when broken into subgroups Data can tell opposite stories at different levels — always check both
Ecological fallacy Drawing conclusions about individuals from group-level data Group statistics don't describe individual people
Cherry-picking Selecting data, ranges, or subgroups that support your conclusion True statistics can tell false stories through selective presentation
P-hacking Trying multiple analyses until a significant result appears Inflates false positive rates far beyond the nominal alpha level
HARKing Presenting post-hoc discoveries as pre-specified hypotheses Misrepresents the discovery process and overstates evidence
Informed consent Participants' knowing agreement to participate in research Respects individual autonomy and protects against exploitation
Re-identification risk The ability to identify individuals from "anonymized" data 87% of Americans can be identified by birth date, zip code, and gender
Fairness impossibility Equal calibration, equal FPR, and equal FNR cannot all be achieved simultaneously Every algorithm embeds a value judgment about which fairness matters

Simpson's Paradox

Element Description
What it is A trend that reverses when data is disaggregated into subgroups
Classic example UC Berkeley admissions: women had lower overall admission rates but higher rates in most departments
Why it happens A confounding variable is unevenly distributed across comparison groups
Ethical implication Both the aggregate and disaggregated stories are "true" — choosing which to present is an ethical decision
The fix Always check both aggregate and subgroup data; report both; be transparent about the level of analysis

Questionable Research Practices

Practice What It Is Why It's Wrong The Fix
P-hacking Trying multiple analyses until p < 0.05 Inflates false positive rate (64% with 20 tests) Pre-register analysis plan
HARKing Presenting post-hoc findings as hypotheses Misrepresents evidence strength Label exploratory analyses as such
Cherry-picking Selecting supportive data, ignoring contradictory data Creates misleading impression from true facts Report all analyses; justify any restrictions
Optional stopping Checking significance repeatedly and stopping when p < 0.05 Inflates Type I error beyond nominal alpha Pre-specify sample size
Selective reporting Reporting only significant results File drawer problem; biases the literature Report all results including null findings
Flexible outlier removal Removing outliers only when they hurt results Distorts data to match hypothesis Pre-specify outlier criteria

Ethical Frameworks for Data Practice

Framework Core Question Applied to James's Algorithm
Utilitarian Which choice produces the greatest total good? Use if total errors decrease, even if one group bears more cost
Rights-based Does this respect every individual's fundamental rights? Reject if it violates the right to individual assessment
Care ethics What response best serves the most vulnerable? Modify to protect communities historically harmed by the justice system

Research Ethics Timeline

Year Event Significance
1932 Tuskegee Syphilis Study begins 399 Black men denied treatment for 40 years
1972 Tuskegee exposed by journalist Led to public outrage and policy reform
1974 National Research Act Created National Commission for human subjects protection
1979 Belmont Report Established three principles: Respect, Beneficence, Justice
1981 Common Rule (45 CFR 46) Required IRB review for federally funded research
2014 Facebook emotional contagion study Manipulated 689K users' emotions without consent
2015 Open Science Collaboration Only 36% of psychology findings replicated
2018 GDPR enacted EU data privacy regulation with major penalties
2020 CCPA enacted California data privacy regulation

Data Privacy

Concept Key Point
Re-identification Removing names is not enough; date of birth + zip code + gender can identify 87% of Americans
Netflix attack Narayanan and Shmatikov re-identified "anonymous" users from movie ratings
GDPR Opt-in consent, right to deletion, up to 4% of global revenue in penalties
CCPA Opt-out model, right to know and delete, up to $7,500 per intentional violation
The lesson Any dataset with enough variables can potentially be linked to external information

The "Lying with True Statistics" Checklist

Ask yourself before every analysis:

Check Question
Cherry-picking Would my conclusion change if I used all the available data?
Denominator games Am I reporting both absolute and relative numbers?
Aggregation effects Could this trend reverse at a different level of analysis?
Survivorship bias Am I only looking at the "winners"?
Correlation → causation Am I implying a causal relationship that my study design can't support?
Missing context Would someone who disagreed with me say I presented the data fairly?

Perspective-Taking Framework

For any data-driven decision, consider:

Stakeholder Question
The analyst What are my incentives? Am I under pressure?
The decision-maker Who is using this? What decisions will they make?
The subjects Whose data is this? Did they consent?
The affected community Who will be impacted? Were they consulted?
The absent voices Who is NOT in the data?
Future users How might this data be used in ways I didn't intend?

Personal Code of Statistical Ethics (Template)

Domain Principle
Collecting data Obtain informed consent; be transparent about purpose
Analyzing data Pre-register confirmatory analyses; report all results
Reporting results Include effect sizes and CIs; acknowledge limitations
Making decisions Consider who might be harmed; seek affected perspectives
Will not Cherry-pick; present correlations as causal; suppress null results

Key Python Code

Simpson's Paradox Detector

def check_simpsons_paradox(df, outcome, group, stratify_by):
    """
    Compare aggregate and stratified trends to detect
    Simpson's paradox.

    Returns dict with aggregate and per-stratum results,
    plus a paradox_detected flag.
    """
    agg = df.groupby(group)[outcome].mean()
    groups = sorted(agg.index)
    agg_diff = agg[groups[1]] - agg[groups[0]]

    reversal_count = 0
    for stratum in df[stratify_by].unique():
        subset = df[df[stratify_by] == stratum]
        strat_means = subset.groupby(group)[outcome].mean()
        strat_diff = strat_means[groups[1]] - strat_means[groups[0]]
        if (agg_diff > 0 and strat_diff < 0) or \
           (agg_diff < 0 and strat_diff > 0):
            reversal_count += 1

    return reversal_count > len(df[stratify_by].unique()) / 2

Common Mistakes

Mistake Correction
"The aggregate trend tells the whole story" Always check for Simpson's paradox by stratifying
"The data is anonymized so privacy is protected" Re-identification is possible with surprisingly few fields
"I found it in the data, so it must be real" Exploratory findings need confirmatory replication
"p < 0.05 after my third analysis" Multiple testing inflates false positive rates
"The algorithm is objective, so it's fair" Algorithms inherit the biases in their training data
"Correlation = causation in this case" Study design determines whether causal claims are justified
"More accurate overall = better for everyone" Aggregate accuracy can mask group-level unfairness
"I removed names, so consent doesn't matter" Using data for purposes beyond the original consent is ethically problematic

Connections

Connection Details
Ch.4 (Study design) Informed consent and IRB introduced; deepened here with Tuskegee and modern cases
Ch.13 (Hypothesis testing) P-hacking introduced; deepened here as ethical violation, not just methodological error
Ch.17 (Power and effect sizes) Publication bias and replication crisis; deepened here as systemic ethical failure
Ch.22 (Correlation) Correlation vs. causation; reframed here as ethical imperative, not just statistical principle
Ch.23 (Multiple regression) Simpson's paradox introduced with kidney stones; given full ethical treatment here
Ch.25 (Communication) Misleading graphs; reframed here as ethical violations, not just technical errors
Ch.28 (Journey continues) Personal code of ethics carries forward into all future data work