Key Takeaways: Lies, Damn Lies, and Statistics: Ethical Data Practice

Contributors

Key Takeaways: Lies, Damn Lies, and Statistics: Ethical Data Practice

One-Sentence Summary

Ethical data practice requires recognizing that statistics can mislead without fabrication (through cherry-picking, Simpson's paradox, and the ecological fallacy), that research integrity depends on distinguishing confirmatory from exploratory analysis (preventing p-hacking and HARKing), that data privacy and informed consent are harder to achieve than most people think, and that every data-driven decision embeds value judgments about whose welfare matters — judgments that demand transparency, accountability, and input from affected communities.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Simpson's paradox	A trend in aggregated data that reverses when broken into subgroups	Data can tell opposite stories at different levels — always check both
Ecological fallacy	Drawing conclusions about individuals from group-level data	Group statistics don't describe individual people
Cherry-picking	Selecting data, ranges, or subgroups that support your conclusion	True statistics can tell false stories through selective presentation
P-hacking	Trying multiple analyses until a significant result appears	Inflates false positive rates far beyond the nominal alpha level
HARKing	Presenting post-hoc discoveries as pre-specified hypotheses	Misrepresents the discovery process and overstates evidence
Informed consent	Participants' knowing agreement to participate in research	Respects individual autonomy and protects against exploitation
Re-identification risk	The ability to identify individuals from "anonymized" data	87% of Americans can be identified by birth date, zip code, and gender
Fairness impossibility	Equal calibration, equal FPR, and equal FNR cannot all be achieved simultaneously	Every algorithm embeds a value judgment about which fairness matters

Simpson's Paradox

Element	Description
What it is	A trend that reverses when data is disaggregated into subgroups
Classic example	UC Berkeley admissions: women had lower overall admission rates but higher rates in most departments
Why it happens	A confounding variable is unevenly distributed across comparison groups
Ethical implication	Both the aggregate and disaggregated stories are "true" — choosing which to present is an ethical decision
The fix	Always check both aggregate and subgroup data; report both; be transparent about the level of analysis

Questionable Research Practices

Practice	What It Is	Why It's Wrong	The Fix
P-hacking	Trying multiple analyses until p < 0.05	Inflates false positive rate (64% with 20 tests)	Pre-register analysis plan
HARKing	Presenting post-hoc findings as hypotheses	Misrepresents evidence strength	Label exploratory analyses as such
Cherry-picking	Selecting supportive data, ignoring contradictory data	Creates misleading impression from true facts	Report all analyses; justify any restrictions
Optional stopping	Checking significance repeatedly and stopping when p < 0.05	Inflates Type I error beyond nominal alpha	Pre-specify sample size
Selective reporting	Reporting only significant results	File drawer problem; biases the literature	Report all results including null findings
Flexible outlier removal	Removing outliers only when they hurt results	Distorts data to match hypothesis	Pre-specify outlier criteria

Ethical Frameworks for Data Practice

Framework	Core Question	Applied to James's Algorithm
Utilitarian	Which choice produces the greatest total good?	Use if total errors decrease, even if one group bears more cost
Rights-based	Does this respect every individual's fundamental rights?	Reject if it violates the right to individual assessment
Care ethics	What response best serves the most vulnerable?	Modify to protect communities historically harmed by the justice system

Research Ethics Timeline

Year	Event	Significance
1932	Tuskegee Syphilis Study begins	399 Black men denied treatment for 40 years
1972	Tuskegee exposed by journalist	Led to public outrage and policy reform
1974	National Research Act	Created National Commission for human subjects protection
1979	Belmont Report	Established three principles: Respect, Beneficence, Justice
1981	Common Rule (45 CFR 46)	Required IRB review for federally funded research
2014	Facebook emotional contagion study	Manipulated 689K users' emotions without consent
2015	Open Science Collaboration	Only 36% of psychology findings replicated
2018	GDPR enacted	EU data privacy regulation with major penalties
2020	CCPA enacted	California data privacy regulation

Data Privacy

Concept	Key Point
Re-identification	Removing names is not enough; date of birth + zip code + gender can identify 87% of Americans
Netflix attack	Narayanan and Shmatikov re-identified "anonymous" users from movie ratings
GDPR	Opt-in consent, right to deletion, up to 4% of global revenue in penalties
CCPA	Opt-out model, right to know and delete, up to $7,500 per intentional violation
The lesson	Any dataset with enough variables can potentially be linked to external information

The "Lying with True Statistics" Checklist

Ask yourself before every analysis:

Check	Question
Cherry-picking	Would my conclusion change if I used all the available data?
Denominator games	Am I reporting both absolute and relative numbers?
Aggregation effects	Could this trend reverse at a different level of analysis?
Survivorship bias	Am I only looking at the "winners"?
Correlation → causation	Am I implying a causal relationship that my study design can't support?
Missing context	Would someone who disagreed with me say I presented the data fairly?

Perspective-Taking Framework

For any data-driven decision, consider:

Stakeholder	Question
The analyst	What are my incentives? Am I under pressure?
The decision-maker	Who is using this? What decisions will they make?
The subjects	Whose data is this? Did they consent?
The affected community	Who will be impacted? Were they consulted?
The absent voices	Who is NOT in the data?
Future users	How might this data be used in ways I didn't intend?

Personal Code of Statistical Ethics (Template)

Domain	Principle
Collecting data	Obtain informed consent; be transparent about purpose
Analyzing data	Pre-register confirmatory analyses; report all results
Reporting results	Include effect sizes and CIs; acknowledge limitations
Making decisions	Consider who might be harmed; seek affected perspectives
Will not	Cherry-pick; present correlations as causal; suppress null results

Key Python Code

Simpson's Paradox Detector

def check_simpsons_paradox(df, outcome, group, stratify_by):
    """
    Compare aggregate and stratified trends to detect
    Simpson's paradox.

    Returns dict with aggregate and per-stratum results,
    plus a paradox_detected flag.
    """
    agg = df.groupby(group)[outcome].mean()
    groups = sorted(agg.index)
    agg_diff = agg[groups[1]] - agg[groups[0]]

    reversal_count = 0
    for stratum in df[stratify_by].unique():
        subset = df[df[stratify_by] == stratum]
        strat_means = subset.groupby(group)[outcome].mean()
        strat_diff = strat_means[groups[1]] - strat_means[groups[0]]
        if (agg_diff > 0 and strat_diff < 0) or \
           (agg_diff < 0 and strat_diff > 0):
            reversal_count += 1

    return reversal_count > len(df[stratify_by].unique()) / 2

Common Mistakes

Mistake	Correction
"The aggregate trend tells the whole story"	Always check for Simpson's paradox by stratifying
"The data is anonymized so privacy is protected"	Re-identification is possible with surprisingly few fields
"I found it in the data, so it must be real"	Exploratory findings need confirmatory replication
"p < 0.05 after my third analysis"	Multiple testing inflates false positive rates
"The algorithm is objective, so it's fair"	Algorithms inherit the biases in their training data
"Correlation = causation in this case"	Study design determines whether causal claims are justified
"More accurate overall = better for everyone"	Aggregate accuracy can mask group-level unfairness
"I removed names, so consent doesn't matter"	Using data for purposes beyond the original consent is ethically problematic

Connections

Connection	Details
Ch.4 (Study design)	Informed consent and IRB introduced; deepened here with Tuskegee and modern cases
Ch.13 (Hypothesis testing)	P-hacking introduced; deepened here as ethical violation, not just methodological error
Ch.17 (Power and effect sizes)	Publication bias and replication crisis; deepened here as systemic ethical failure
Ch.22 (Correlation)	Correlation vs. causation; reframed here as ethical imperative, not just statistical principle
Ch.23 (Multiple regression)	Simpson's paradox introduced with kidney stones; given full ethical treatment here
Ch.25 (Communication)	Misleading graphs; reframed here as ethical violations, not just technical errors
Ch.28 (Journey continues)	Personal code of ethics carries forward into all future data work