The AI Ethics Checklist: 20 Questions to Ask Before Deploying

In January 2009, US Airways Flight 1549 struck a flock of geese shortly after takeoff from LaGuardia Airport. Captain Chesley "Sully" Sullenberger successfully landed the plane on the Hudson River, saving all 155 people on board. What made that outcome possible was not just Sully's skill — it was decades of aviation safety culture built on checklists. Pilots use pre-flight checklists, takeoff checklists, emergency checklists, and landing checklists. These checklists exist because aviation learned, through catastrophic failures, that even brilliant experts make preventable mistakes under pressure.

Medicine learned the same lesson. Surgeon Atul Gawande demonstrated in The Checklist Manifesto that a simple surgical safety checklist reduced deaths and complications by over 30% in hospitals worldwide. Not because surgeons did not know the steps — but because systematically verifying each one caught errors that informal processes missed.

AI deployment is at the stage where aviation was before checklists became mandatory: we have powerful technology, increasing complexity, real consequences for failure, and an inconsistent patchwork of safety practices. An AI ethics checklist will not solve every problem, but it forces teams to ask critical questions before a system causes harm rather than after.

Here are 20 questions every team should answer before deploying an AI system, organized into four categories. For each question, we explain why it matters, what goes wrong without it, and a practical tip for addressing it.

Category 1: Data and Training (Questions 1-5)

The ethical foundation of any AI system is built during data collection and model training. Decisions made here propagate through every downstream use.

1. Where did the training data come from, and do we have the right to use it?

Why it matters: Data provenance — knowing the origin, collection method, and legal status of your training data — is the bedrock of responsible AI. Models trained on improperly obtained data create legal liability, ethical violations, and reputational risk.

What goes wrong without it: In 2023, multiple class-action lawsuits were filed against generative AI companies for training on copyrighted works without permission. Artists, writers, and coders found their work reproduced by AI systems that had ingested it without consent or compensation. Companies that could not document their data provenance faced the most severe legal and reputational consequences.

Practical tip: Create a data card or data sheet for every dataset used in training. Document the source, collection date, licensing terms, any known limitations, and the chain of custody. If you cannot trace a dataset to a legitimate source with clear usage rights, do not use it.

Why it matters: Consent is not just a legal requirement under regulations like GDPR — it is a fundamental ethical principle. People who shared their medical records for research did not consent to those records training a commercial AI. Photos posted to social media were not necessarily offered for facial recognition training.

What goes wrong without it: Clearview AI scraped billions of photos from social media to build a facial recognition database without the knowledge or consent of anyone in those photos. The result was regulatory fines, bans in multiple countries, and a fundamentally adversarial relationship with the public.

Practical tip: Map every data source to its consent basis. For each source, answer: "If the people in this data knew exactly how we are using it, would they be comfortable with that?" If the answer is no, reconsider.

3. Does the training data represent the population the system will serve?

Why it matters: AI systems learn patterns from training data. If that data overrepresents some groups and underrepresents others, the model's performance will be uneven — and often worst for already-marginalized populations.

What goes wrong without it: Early pulse oximeters were calibrated primarily on lighter-skinned patients, leading to inaccurate readings for darker-skinned patients. Similarly, dermatology AI systems trained predominantly on images of skin conditions on lighter skin performed poorly at diagnosing conditions on darker skin, potentially delaying treatment for the patients who could benefit most.

Practical tip: Conduct a demographic audit of your training data. Compare the distribution of key demographic variables (age, gender, race, geography, language, socioeconomic status) in your data to the distribution in your target population. Document gaps and their potential impact.

4. Have we tested for bias in the training data and model outputs?

Why it matters: Even representative data can contain historical biases. If your training data reflects a world where certain groups were systematically disadvantaged, your model will learn and perpetuate those disadvantages.

What goes wrong without it: Amazon built an AI recruiting tool trained on a decade of hiring data. Because the tech industry had historically hired predominantly men, the system learned to penalize resumes containing words like "women's" (as in "women's chess club") and downgraded graduates of all-women's colleges. Amazon scrapped the project, but only after significant development investment and reputational damage.

Practical tip: Use bias detection tools (such as IBM AI Fairness 360, Google What-If Tool, or Microsoft Fairlearn) to test model outputs across protected categories. Test for disparate impact: does the model's error rate, approval rate, or recommendation quality differ significantly across demographic groups?

5. How is personal and sensitive data protected throughout the pipeline?

Why it matters: AI systems often require large datasets that may contain personally identifiable information (PII), health records, financial data, or other sensitive information. Protection is needed not just in storage but throughout the entire pipeline — collection, processing, training, inference, and model storage (since models can sometimes "memorize" training examples).

What goes wrong without it: Researchers have demonstrated that large language models can be prompted to reveal training data, including personal information, API keys, and private documents that appeared in the training corpus. Models trained on sensitive data without proper safeguards become a persistent privacy vulnerability.

Practical tip: Implement privacy-preserving techniques appropriate to your risk level: anonymization, differential privacy, federated learning, or data minimization. Test whether the trained model can be prompted or attacked to reveal training data. Have your privacy practices reviewed by someone with data protection expertise.

Category 2: Model Development (Questions 6-10)

How you build, test, and validate your model determines whether it behaves as intended across all the situations it will encounter.

6. What fairness metrics are we using, and why those specific ones?

Why it matters: "Fairness" is not a single concept — it is a family of metrics that are often mathematically incompatible. Demographic parity (equal approval rates across groups), equalized odds (equal error rates), and calibration (equal predictive accuracy) cannot all be satisfied simultaneously in most realistic scenarios. You must choose which definition of fairness is most appropriate for your context.

What goes wrong without it: The COMPAS recidivism prediction system used in US courts illustrates this tension. ProPublica found it had higher false positive rates for Black defendants (wrongly predicting they would reoffend). Northpointe, the system's creator, argued the system was calibrated equally across races (similar prediction accuracy). Both were technically correct — they were using different fairness metrics. The failure was not selecting one metric over another; it was failing to explicitly choose, justify, and communicate which definition of fairness was being applied.

Practical tip: Document your fairness metric choices and justifications in writing. Involve domain experts and affected community members in choosing which definition of fairness is most appropriate. Acknowledge tradeoffs openly rather than claiming the system is simply "fair."

7. Can we explain how the model makes its decisions?

Why it matters: Interpretability is essential for trust, debugging, regulatory compliance, and identifying bias. If you cannot explain why your model made a specific decision, you cannot meaningfully audit it, and individuals affected by its decisions cannot understand or challenge them.

What goes wrong without it: In healthcare, opaque AI diagnostic systems have made recommendations that clinicians could not evaluate or override because no explanation was provided. This undermines clinical judgment, potentially harms patients, and creates liability when the AI is wrong.

Practical tip: Use interpretability techniques appropriate to your model: feature importance for tree-based models, SHAP values for complex models, attention visualization for transformers. For high-stakes decisions (medical, legal, financial), prioritize inherently interpretable models or ensure post-hoc explanations are available for every individual decision.

8. Have we identified and tested edge cases?

Why it matters: AI systems encounter situations in deployment that were rare or absent in training data. Edge cases — unusual inputs, adversarial conditions, distribution shifts — are where most real-world failures occur.

What goes wrong without it: Self-driving car systems have struggled with edge cases that humans handle intuitively: unusual road markings, construction zones, emergency vehicles approaching from unexpected directions, and pedestrians in unusual situations (a person in a wheelchair, a child chasing a ball). These edge cases are rare in training data but critical in deployment.

Practical tip: Systematically brainstorm edge cases with cross-functional teams, including people with diverse life experiences who may identify scenarios the development team would not think of. Create a test suite of edge cases and monitor model performance on them specifically. Update this suite as new edge cases are discovered in production.

9. Does the model perform equitably across all subgroups it will serve?

Why it matters: A model with 95% overall accuracy may have 99% accuracy for majority populations and 70% accuracy for minority populations. Aggregate performance metrics can mask severe disparities.

What goes wrong without it: Gender Shades research by Joy Buolamwini and Timnit Gebru found that commercial facial recognition systems had error rates below 1% for lighter-skinned men but error rates up to 35% for darker-skinned women. The overall accuracy numbers looked good — the disaggregated numbers revealed a crisis.

Practical tip: Always disaggregate performance metrics by relevant subgroups. Do not just report overall accuracy, precision, and recall — report them for every demographic group, geographic region, language, and use case segment your system serves. Set minimum performance thresholds for each subgroup, not just overall.

10. Has the model been tested against adversarial attacks?

Why it matters: Adversarial testing — deliberately trying to make the model fail, produce harmful outputs, or behave in unintended ways — reveals vulnerabilities that standard testing misses. If your team does not try to break the system, someone else will.

What goes wrong without it: Shortly after Microsoft released the Tay chatbot on Twitter in 2016, users deliberately fed it racist and offensive content, exploiting the lack of adversarial safeguards to turn the bot into a platform for hate speech within hours. More recently, prompt injection attacks on LLM-powered applications have allowed users to bypass safety filters, extract system prompts, and manipulate AI assistants into performing unintended actions.

Practical tip: Conduct red team exercises before deployment. Hire external security researchers or use automated adversarial testing tools. Test for prompt injection, data poisoning, model inversion, and membership inference attacks. Make adversarial testing a recurring practice, not a one-time event.

Category 3: Deployment (Questions 11-15)

Deployment is where your model meets the real world. The controls, monitoring, and processes you put in place determine whether problems are caught early or spiral into harm.

11. Is there meaningful human oversight for high-stakes decisions?

Why it matters: "Human in the loop" is often cited as a safeguard, but it only works if the human has the authority, information, time, and training to actually override the AI. Rubber-stamping AI recommendations is not oversight.

What goes wrong without it: Studies of AI-assisted decision-making show that humans tend to over-rely on AI recommendations, a phenomenon called "automation bias." In one study, radiologists using AI assistance actually missed more cancers than radiologists working without AI when the AI made errors — the AI's confidence overrode the doctors' own judgment.

Practical tip: Design the human oversight process so that reviewers see relevant information before seeing the AI's recommendation (to reduce anchoring bias). Set thresholds for mandatory human review (e.g., all decisions above a certain confidence or consequence level). Track how often humans override the AI and investigate cases where they never do.

12. Is there an appeals or contestation process for people affected by the system?

Why it matters: No AI system is perfect. People affected by AI decisions — job applicants, loan seekers, content creators, accused individuals — need a way to challenge decisions they believe are wrong.

What goes wrong without it: Automated content moderation systems on social media platforms have removed legitimate speech, suspended accounts without explanation, and disproportionately affected certain communities. When there is no meaningful appeals process, affected individuals have no recourse, and the system's errors go uncorrected.

Practical tip: Create a clear, accessible process for individuals to request human review of AI decisions. Set response time commitments. Track appeal outcomes and use them to identify systematic errors in the model.

13. Do we have a monitoring plan for post-deployment performance?

Why it matters: Model performance degrades over time due to data drift, concept drift, and changing user behavior. A model that performs well at launch may perform poorly six months later.

What goes wrong without it: A credit scoring model trained on pre-pandemic data would have performed poorly during COVID-19 as economic behavior shifted dramatically. Without monitoring, the degradation would go undetected until default rates spiked.

Practical tip: Define key performance indicators (KPIs) and set alert thresholds before deployment. Monitor performance disaggregated by subgroup. Establish a regular review cadence (weekly, monthly) and define triggers for model retraining or retirement.

14. Do we have an incident response plan?

Why it matters: When an AI system causes harm — a discriminatory decision, a privacy breach, a dangerous recommendation — the response must be swift, structured, and transparent. Ad hoc responses to AI incidents increase harm and erode trust.

What goes wrong without it: When a ride-sharing company's self-driving test vehicle struck and killed a pedestrian in 2018, the lack of a clear incident response protocol contributed to delays in notification, investigation, and systemic correction.

Practical tip: Create an AI incident response plan modeled on cybersecurity incident response frameworks. Define severity levels, notification chains, investigation procedures, and communication templates. Run tabletop exercises simulating AI incidents before they happen.

15. Are users and affected parties clearly informed they are interacting with AI?

Why it matters: Transparency about AI involvement is both an ethical obligation and, increasingly, a legal one. People cannot give informed consent or properly calibrate their trust if they do not know AI is making or influencing decisions about them.

What goes wrong without it: Google Duplex, an AI system that could make phone calls to book appointments, initially operated without identifying itself as AI. The backlash was swift — people felt deceived and manipulated. Google subsequently added disclosure. Regulations like the EU AI Act now require disclosure for many AI interactions.

Practical tip: Disclose AI involvement clearly and proactively. Do not bury it in terms of service. For conversational AI, state upfront that the user is talking to an AI. For decision-support systems, explain what role AI played in the decision. Use plain language, not technical jargon.

Category 4: Governance (Questions 16-20)

Governance ensures that ethical AI is not a one-time effort but an ongoing organizational commitment with clear accountability.

16. Who is accountable when the system causes harm?

Why it matters: Diffuse responsibility is one of the biggest risks in AI deployment. When responsibility is spread across data scientists, product managers, executives, and third-party vendors, it often means nobody takes ownership when things go wrong.

What goes wrong without it: In many AI-related scandals, organizations have deflected responsibility — blaming the algorithm, the training data, or third-party providers. This accountability void means problems are not fixed, affected individuals are not compensated, and the same mistakes recur.

Practical tip: Assign a named individual (not a committee) as the accountable owner for each AI system. This person should have the authority to pause or shut down the system if necessary. Document accountability in writing and make it part of the organizational chart.

17. Is the system thoroughly documented?

Why it matters: Documentation is the foundation of accountability, auditability, and organizational learning. Without documentation, the rationale behind design decisions is lost when team members leave, auditors cannot evaluate the system, and future teams cannot learn from past mistakes.

What goes wrong without it: Organizations that deploy AI systems without proper documentation (model cards, data sheets, decision logs) find themselves unable to answer regulators' questions, unable to reproduce results, and unable to diagnose problems when they emerge months or years later.

Practical tip: At minimum, create a model card (documenting model purpose, training data, performance metrics, limitations, and ethical considerations) and a data sheet (documenting data sources, collection methods, and known issues) for every deployed model. Use templates from Google's Model Cards or Microsoft's Datasheets for Datasets as starting points.

18. Is there a schedule for regular audits?

Why it matters: One-time ethics reviews are insufficient. The world changes, the data changes, the user population changes, regulations change, and community standards change. Regular audits catch drift before it causes harm.

What goes wrong without it: A hiring algorithm that was fair at deployment may become discriminatory over time as the demographics of the applicant pool shift or as societal norms around hiring practices evolve. Without regular audits, these changes go undetected.

Practical tip: Establish a formal audit schedule — quarterly for high-risk systems, annually for lower-risk ones. Include both technical audits (performance, fairness metrics, data quality) and ethical audits (stakeholder impact, consent validity, regulatory compliance). Consider engaging external auditors for independence and credibility.

19. Have affected stakeholders been consulted?

Why it matters: The people affected by an AI system often understand risks and impacts that developers do not. Community input is not just a nicety — it surfaces blind spots and builds legitimacy.

What goes wrong without it: Predictive policing systems deployed without community input have reinforced existing patterns of over-policing in minority neighborhoods. The feedback loop — more police in an area leads to more arrests, which leads to more data showing crime in that area, which leads to more police — was obvious to community members but invisible to developers.

Practical tip: Identify all stakeholder groups affected by your system (not just users but also people subject to the system's decisions, communities impacted indirectly, and frontline workers who will interact with the system). Engage them early in development, not just after launch. Create structured feedback channels and demonstrate how input influenced the design.

20. Are there clear criteria for when the system should be retired?

Why it matters: Every AI system should have defined conditions under which it will be decommissioned. Without sunset criteria, outdated, poorly performing, or harmful systems persist indefinitely because the inertia of "it is already deployed" is powerful.

What goes wrong without it: Legacy AI systems that no one fully understands, no one maintains, and no one has the authority to shut down are a growing problem in large organizations. These systems continue making decisions long after their training data, assumptions, and performance have become obsolete.

Practical tip: Define sunset criteria at the time of deployment: performance thresholds below which the system must be retrained or retired, regulatory changes that would require redesign, maximum time between full audits, and conditions under which the system must be immediately shut down (e.g., discovery of systematic discrimination). Document these criteria and assign someone to monitor them.

Using This Checklist in Practice

An AI ethics checklist is only valuable if it is actually used. Here are practical recommendations for integrating these 20 questions into your workflow:

For a comprehensive framework covering AI ethics principles, case studies, governance structures, and implementation strategies in depth, see our complete treatment in AI Ethics. The checklist in this post is a starting point. The book is the full toolkit.