A Collection of Research Resources on Explainable Machine Learning

I create a GitHub repository which includes a collection of awesome research papers on Explainable Machine Learning (also referred as Explainable AI/XAI, Interpretable Machine Learning). As a rapidly emerging field, it can be frustrated to be buried by enormous amount of papers at the begining of reviewing literatures. I hope this paper list can help new ML researchers/practitioners to learn about this field with lesser pain and stress.

Unlike most repositories you might find in GitHub which maintain comprehensive lists of resources in Explainable ML, I try to keep this list short to make it less intimating for beginners. It is definitely an objective selection which is based on my preferences and research tastes.

The paper list below is likely to be outdated as I might not update the list here. Please check the GitHub repository for up-to-date list.
Papers marked in bold are highly recommended to read.

1. General Idea

Survey

  • The Mythos of Model Interpretability. Lipton, 2016 pdf

  • Open the Black Box Data-Driven Explanation of Black Box Decision Systems. Pedreschi et al. pdf

  • Techniques for Interpretable Machine Learning. Du et al. 2018 pdf

  • Explaining Explanations in AI. Mittelstadt et. al., 2019 pdf

  • Explanation in artificial intelligence: Insights from the social sciences. Miller, 2019 pdf

  • Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf

  • Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf

  • Explaining Deep Neural Networks. Camburu, 2020 pdf

2. Global Explanation

Interpretable Models

  • Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Rudin, 2019 pdf

Generalized Addictive Model

  • Accurate intelligible models with pairwise interactions. Lou et. al., 2013 pdf
  • Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. Caruana et. al., 2015 pdf | InterpretableML

Rule-based Method

  • Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Letham et. al., 2015 pdf
  • Interpretable Decision Sets: A Joint Framework for Description and Prediction. Lakkaraju et. al., 2016 pdf

Scoring System

  • Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice. Rudin, 2018 pdf

Model Distillation

Use interpretable models to approximate blackbox learning; similar to the imitation learning in RL.
  • Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et. al., 2018 pdf
  • Faithful and Customizable Explanations of Black Box Models. Lakkaraju et. al. 2019 pdf

Representation-based Explanation

  • Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), Kim et. al. 2018 pdf
  • This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al., 2019 pdf
    • This Looks Like That, Because … Explaining Prototypes for Interpretable Image Recognition. Nauta et al., 2020 pdf
    • Learning to Explain With Complemental Examples. Kanehira & Harada, 2019 pdf

Self-Explaining Neural Network

  • Towards Robust Interpretability with Self-Explaining Neural Networks. Alvarez-Melis et. al., 2018 pdf
  • Deep Weighted Averaging Classifiers. Card et al., 2019 pdf

3. Local Explanation

Note: cumulating multiple local explanations can be viewed as constructing a global explanation.

Feature-based Explanation

  • Permutation importance: a corrected feature importance measure. Altmann et. al. 2010 link | sklearn
  • “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Ribeiro et. al., 2016 pdf | LIME
  • A Unified Approach to Interpreting Model Predictions. Lundberg & Lee, 2017 pdf | SHAP
  • Anchors: High-Precision Model-Agnostic Explanations. Ribeiro et. al. 2018 pdf

Example-based Explanation

  • Examples are not enough, learn to criticize! Criticism for Interpretability. Kim et. al., 2016 pdf
  • Understanding Black-box Predictions via Influence Functions. Koh & Liang, 2017 pdf

Counterfactual Explanation

Also referred as algorithmic recourse or contrastive explanation.
  • Counterfactual Explanations for Machine Learning: A Review. Verma et al., 2020 pdf
  • A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. Karimi et al., 2020 pdf

Minimize distance counterfactuals

  • Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Wachter et. al., 2017 pdf
  • Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. Mothilal et al., 2019 pdf

Minimize cost (algorithmic recourse)

  • Actionable Recourse in Linear Classification. Ustun et al., 2019 pdf
  • Algorithmic Recourse: from Counterfactual Explanations to Interventions. Karimi et al., 2021 pdf

Causal constraints

  • Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers. Mahajan et al., 2020 pdf

4. Explainability in Human-in-the-loop ML

HCI’s perspective of Explainable ML
  • Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. Krause et. al., 2016 pdf
  • Human-centered Machine Learning: a Machine-in-the-loop Approach. Tan, 2018 blog
  • Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. Abdul et. al., 2018 pdf
  • Explaining models: an empirical study of how explanations impact fairness judgment. Dodge et. al., 2019 pdf
  • Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. Cai et. al, 2019 pdf
  • Designing Theory-Driven User-Centric Explainable AI. Wang et. al., 2019 pdf
  • Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Bansal et al., 2021 pdf

5. Evaluate Explainable ML

Evaluation of explainable ML can be loosely categorized into two classes:

  • faithfulness on evaluating how well the explanation reflects the true inner behavior of the black-box model.
  • interpretability on evaluating how understandable the explanation to human.
  • The Price of Interpretability. Bertsimas et. al., 2019 pdf
  • Beyond Accuracy: Behavioral Testing of NLP Models with Checklist. Ribeiro et. al., 2020 pdf @ ACL 2020 Best Paper

Evaluating Faithfulness

Evaluate whether or not the explanation faithfully reflects how model works (it turns out that 100% faithfully is often not the case in post-hoc explanations).
  • Sanity Checks for Saliency Maps Adebayo et al., 2018 pdf
  • Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? Jacovi & Goldberg, 2020 ACL

Robust Explanation

  • Interpretation of Neural Networks Is Fragile. Ghorbani et. al., 2019 pdf
  • Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. Slack et. al., 2020 pdf
  • Robust and Stable Black Box Explanations. Lakkaraju et. al., 2020 pdf

Evaluating Interpretability

Evaluate interpretability (does the explanations make sense to human or not).
  • Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
  • ‘It’s Reducing a Human Being to a Percentage’; Perceptions of Justice in Algorithmic Decisions Binns et al., 2018 pdf
  • Human Evaluation of Models Built for Interpretability. Lage et. al., 2019 pdf
  • Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. Kaur et. al., 2019 pdf
  • Manipulating and Measuring Model Interpretability. Poursabzi-Sangdeh et al., 2021 pdf

6. Useful Resources

Courses & Talks

  • Tutorial on Explainable ML Website
  • Interpretability and Explainability in Machine Learning, Fall 2019 @ Harvard University by Hima Lakkaraju Course
  • Human-centered Machine Learning @University of Colorado Boulder by Chenhao Tan course
  • Model Explainability Forum by TWIML AI Podcast YouTube | link

Collections of Resources

Toolbox

Ph.D. Candidate