A Collection of Research Resources on Explainable Machine Learning

Last updated on Aug 19, 2021 6 min read

I create a GitHub repository which includes a collection of awesome research papers on Explainable Machine Learning (also referred as Explainable AI/XAI, Interpretable Machine Learning). As a rapidly emerging field, it can be frustrated to be buried by enormous amount of papers at the begining of reviewing literatures. I hope this paper list can help new ML researchers/practitioners to learn about this field with lesser pain and stress.

Unlike most repositories you might find in GitHub which maintain comprehensive lists of resources in Explainable ML, I try to keep this list short to make it less intimating for beginners. It is definitely an objective selection which is based on my preferences and research tastes.

The paper list below is likely to be outdated as I might not update the list here. Please check the GitHub repository for up-to-date list.

Papers marked in bold are highly recommended to read.

1. General Idea

Survey

The Mythos of Model Interpretability. Lipton, 2016 pdf
Open the Black Box Data-Driven Explanation of Black Box Decision Systems. Pedreschi et al. pdf
Techniques for Interpretable Machine Learning. Du et al. 2018 pdf
Explaining Explanations in AI. Mittelstadt et. al., 2019 pdf
Explanation in artificial intelligence: Insights from the social sciences. Miller, 2019 pdf
Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf
Explaining Deep Neural Networks. Camburu, 2020 pdf

2. Global Explanation

Interpretable Models

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Rudin, 2019 pdf

Generalized Addictive Model

Accurate intelligible models with pairwise interactions. Lou et. al., 2013 pdf
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. Caruana et. al., 2015 pdf | InterpretableML

Rule-based Method

Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Letham et. al., 2015 pdf
Interpretable Decision Sets: A Joint Framework for Description and Prediction. Lakkaraju et. al., 2016 pdf

Scoring System

Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice. Rudin, 2018 pdf

Model Distillation

Use interpretable models to approximate blackbox learning; similar to the imitation learning in RL.

Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et. al., 2018 pdf
Faithful and Customizable Explanations of Black Box Models. Lakkaraju et. al. 2019 pdf

Representation-based Explanation

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), Kim et. al. 2018 pdf
This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al., 2019 pdf
- This Looks Like That, Because … Explaining Prototypes for Interpretable Image Recognition. Nauta et al., 2020 pdf
- Learning to Explain With Complemental Examples. Kanehira & Harada, 2019 pdf

Self-Explaining Neural Network

Towards Robust Interpretability with Self-Explaining Neural Networks. Alvarez-Melis et. al., 2018 pdf
Deep Weighted Averaging Classifiers. Card et al., 2019 pdf

3. Local Explanation

Note: cumulating multiple local explanations can be viewed as constructing a global explanation.

Feature-based Explanation

Permutation importance: a corrected feature importance measure. Altmann et. al. 2010 link | sklearn
“Why Should I Trust You?” Explaining the Predictions of Any Classifier. Ribeiro et. al., 2016 pdf | LIME
A Unified Approach to Interpreting Model Predictions. Lundberg & Lee, 2017 pdf | SHAP
Anchors: High-Precision Model-Agnostic Explanations. Ribeiro et. al. 2018 pdf

Example-based Explanation

Examples are not enough, learn to criticize! Criticism for Interpretability. Kim et. al., 2016 pdf
Understanding Black-box Predictions via Influence Functions. Koh & Liang, 2017 pdf

Counterfactual Explanation

Also referred as algorithmic recourse or contrastive explanation.

Counterfactual Explanations for Machine Learning: A Review. Verma et al., 2020 pdf
A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. Karimi et al., 2020 pdf

Minimize distance counterfactuals

Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Wachter et. al., 2017 pdf
Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. Mothilal et al., 2019 pdf

Minimize cost (algorithmic recourse)

Actionable Recourse in Linear Classification. Ustun et al., 2019 pdf
Algorithmic Recourse: from Counterfactual Explanations to Interventions. Karimi et al., 2021 pdf

Causal constraints

Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers. Mahajan et al., 2020 pdf

4. Explainability in Human-in-the-loop ML

HCI’s perspective of Explainable ML

Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. Krause et. al., 2016 pdf
Human-centered Machine Learning: a Machine-in-the-loop Approach. Tan, 2018 blog
Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. Abdul et. al., 2018 pdf
Explaining models: an empirical study of how explanations impact fairness judgment. Dodge et. al., 2019 pdf
Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. Cai et. al, 2019 pdf
Designing Theory-Driven User-Centric Explainable AI. Wang et. al., 2019 pdf
Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Bansal et al., 2021 pdf

5. Evaluate Explainable ML

Evaluation of explainable ML can be loosely categorized into two classes:

faithfulness on evaluating how well the explanation reflects the true inner behavior of the black-box model.
interpretability on evaluating how understandable the explanation to human.

The Price of Interpretability. Bertsimas et. al., 2019 pdf
Beyond Accuracy: Behavioral Testing of NLP Models with Checklist. Ribeiro et. al., 2020 pdf @ ACL 2020 Best Paper

Evaluating Faithfulness

Evaluate whether or not the explanation faithfully reflects how model works (it turns out that 100% faithfully is often not the case in post-hoc explanations).

Sanity Checks for Saliency Maps Adebayo et al., 2018 pdf
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? Jacovi & Goldberg, 2020 ACL

Robust Explanation

Interpretation of Neural Networks Is Fragile. Ghorbani et. al., 2019 pdf
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. Slack et. al., 2020 pdf
Robust and Stable Black Box Explanations. Lakkaraju et. al., 2020 pdf

Evaluating Interpretability

Evaluate interpretability (does the explanations make sense to human or not).

Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
‘It’s Reducing a Human Being to a Percentage’; Perceptions of Justice in Algorithmic Decisions Binns et al., 2018 pdf
Human Evaluation of Models Built for Interpretability. Lage et. al., 2019 pdf
Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. Kaur et. al., 2019 pdf
Manipulating and Measuring Model Interpretability. Poursabzi-Sangdeh et al., 2021 pdf

6. Useful Resources

Courses & Talks

Tutorial on Explainable ML Website
Interpretability and Explainability in Machine Learning, Fall 2019 @ Harvard University by Hima Lakkaraju Course
Human-centered Machine Learning @University of Colorado Boulder by Chenhao Tan course
Model Explainability Forum by TWIML AI Podcast YouTube | link

Collections of Resources

XAI-Papers GitHub

Toolbox

InterpretML GitHub