A Collection of Research Resources on Explainable Machine Learning
I create a GitHub repository which includes a collection of awesome research papers on Explainable Machine Learning (also referred as Explainable AI/XAI, Interpretable Machine Learning). As a rapidly emerging field, it can be frustrated to be buried by enormous amount of papers at the begining of reviewing literatures. I hope this paper list can help new ML researchers/practitioners to learn about this field with lesser pain and stress.
Unlike most repositories you might find in GitHub which maintain comprehensive lists of resources in Explainable ML, I try to keep this list short to make it less intimating for beginners. It is definitely an objective selection which is based on my preferences and research tastes.
1. General Idea
Survey
The Mythos of Model Interpretability. Lipton, 2016 pdf
Open the Black Box Data-Driven Explanation of Black Box Decision Systems. Pedreschi et al. pdf
Techniques for Interpretable Machine Learning. Du et al. 2018 pdf
Explaining Explanations in AI. Mittelstadt et. al., 2019 pdf
Explanation in artificial intelligence: Insights from the social sciences. Miller, 2019 pdf
Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf
Explaining Deep Neural Networks. Camburu, 2020 pdf
2. Global Explanation
Interpretable Models
- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Rudin, 2019 pdf
Generalized Addictive Model
- Accurate intelligible models with pairwise interactions. Lou et. al., 2013 pdf
- Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. Caruana et. al., 2015 pdf | InterpretableML
Rule-based Method
- Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Letham et. al., 2015 pdf
- Interpretable Decision Sets: A Joint Framework for Description and Prediction. Lakkaraju et. al., 2016 pdf
Scoring System
- Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice. Rudin, 2018 pdf
Model Distillation
- Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et. al., 2018 pdf
- Faithful and Customizable Explanations of Black Box Models. Lakkaraju et. al. 2019 pdf
Representation-based Explanation
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), Kim et. al. 2018 pdf
- This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al., 2019 pdf
Self-Explaining Neural Network
- Towards Robust Interpretability with Self-Explaining Neural Networks. Alvarez-Melis et. al., 2018 pdf
- Deep Weighted Averaging Classifiers. Card et al., 2019 pdf
3. Local Explanation
Feature-based Explanation
- Permutation importance: a corrected feature importance measure. Altmann et. al. 2010 link | sklearn
- “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Ribeiro et. al., 2016 pdf | LIME
- A Unified Approach to Interpreting Model Predictions. Lundberg & Lee, 2017 pdf | SHAP
- Anchors: High-Precision Model-Agnostic Explanations. Ribeiro et. al. 2018 pdf
Example-based Explanation
- Examples are not enough, learn to criticize! Criticism for Interpretability. Kim et. al., 2016 pdf
- Understanding Black-box Predictions via Influence Functions. Koh & Liang, 2017 pdf
Counterfactual Explanation
- Counterfactual Explanations for Machine Learning: A Review. Verma et al., 2020 pdf
- A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. Karimi et al., 2020 pdf
Minimize distance counterfactuals
- Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Wachter et. al., 2017 pdf
- Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. Mothilal et al., 2019 pdf
Minimize cost (algorithmic recourse)
- Actionable Recourse in Linear Classification. Ustun et al., 2019 pdf
- Algorithmic Recourse: from Counterfactual Explanations to Interventions. Karimi et al., 2021 pdf
Causal constraints
- Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers. Mahajan et al., 2020 pdf
4. Explainability in Human-in-the-loop ML
- Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. Krause et. al., 2016 pdf
- Human-centered Machine Learning: a Machine-in-the-loop Approach. Tan, 2018 blog
- Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. Abdul et. al., 2018 pdf
- Explaining models: an empirical study of how explanations impact fairness judgment. Dodge et. al., 2019 pdf
- Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. Cai et. al, 2019 pdf
- Designing Theory-Driven User-Centric Explainable AI. Wang et. al., 2019 pdf
- Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Bansal et al., 2021 pdf
5. Evaluate Explainable ML
Evaluation of explainable ML can be loosely categorized into two classes:
- faithfulness on evaluating how well the explanation reflects the true inner behavior of the black-box model.
- interpretability on evaluating how understandable the explanation to human.
- The Price of Interpretability. Bertsimas et. al., 2019 pdf
- Beyond Accuracy: Behavioral Testing of NLP Models with Checklist. Ribeiro et. al., 2020 pdf @ ACL 2020 Best Paper
Evaluating Faithfulness
- Sanity Checks for Saliency Maps Adebayo et al., 2018 pdf
- Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? Jacovi & Goldberg, 2020 ACL
Robust Explanation
- Interpretation of Neural Networks Is Fragile. Ghorbani et. al., 2019 pdf
- Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. Slack et. al., 2020 pdf
- Robust and Stable Black Box Explanations. Lakkaraju et. al., 2020 pdf
Evaluating Interpretability
- Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
- ‘It’s Reducing a Human Being to a Percentage’; Perceptions of Justice in Algorithmic Decisions Binns et al., 2018 pdf
- Human Evaluation of Models Built for Interpretability. Lage et. al., 2019 pdf
- Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. Kaur et. al., 2019 pdf
- Manipulating and Measuring Model Interpretability. Poursabzi-Sangdeh et al., 2021 pdf
6. Useful Resources
Courses & Talks
- Tutorial on Explainable ML Website
- Interpretability and Explainability in Machine Learning, Fall 2019 @ Harvard University by Hima Lakkaraju Course
- Human-centered Machine Learning @University of Colorado Boulder by Chenhao Tan course
- Model Explainability Forum by TWIML AI Podcast YouTube | link
Collections of Resources
- XAI-Papers GitHub
Toolbox
- InterpretML GitHub