Authors: Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, Prithviraj Sen
Paper reference: https://arxiv.org/pdf/2010.00711.pdf

Categorization of Explanations

Local vs Global

A local explanation provides justification for the model’s prediction on a specific input.
A global explanation provides similar justification by revealing how the model’s predictive process works (understanding models' behavior), independently of any particular input.

Self-Explaining vs Post-Hoc

A self-explaining approach generates the explanation at the same time as the prediction, using information emitted by the model (such as attention score) as a result of the process of making that prediction.
a post-hoc approach requires that an additional operation (such as gradient-based methods, input perturbations, train a surrogate model (LIME)) is performed after the predictions are made.

Explainability Techniques

Feature importance (in frequent use)

Derive explanation by investigating the importance scores of different features used to output the final prediction, such as attention, gradient-based methods.

Surrogate model (in frequent use)

Model predictions are explained by learning a second, usually more explainable model, as a proxy (such as LIME). The learned surrogate models and the original models may have completely different mechanisms to make predictions (concern about model faithfulness).

Example-driven

Explain the prediction of an input instance by identifying and presenting other instances, usually from available labeled data, that are semantically similar to the input instance (used in QA).

Provenance-based

The final prediction is the result of a series of reasoning steps.

Declarative induction

Human-readable representations, such as rules and trees.

A few visualization techniques:

Saliency maps.
Presents the learned declarative representations, logic rules and trees,
Train a language model with human natural language explanations and coupling with a deep generative model (can be template based).

Evaluation

Informal examination of explanations. Discuss how examples of generated explanations align with human intuition.
Comparison to ground truth. Such as comparing generated explanations to ground truth data; Employ automatic metrics including perplexity, BLEU; Having multiple annotators and reporting inter-annotator agreement or mean human performance to account for disagreements on the precise value of the ground truth.
Human evaluation. Have multiple annotators, report inter-annotator agreement, and correctly deal with subjectivity and variance in the responses.
Evaluate using counterfactuals.
Fidelity (how much explanations reflect the actual workings of the underlying model), comprehensibility (how easy explanations are to understand by humans).

Future Directions

Define evaluation metrics for model explainability.
Performance and explananility trade-off.
Improve and evaluate faithfulness of models.

Categorization of Explanations#

Local vs Global#

Self-Explaining vs Post-Hoc#

Explainability Techniques#

Feature importance (in frequent use)#

Surrogate model (in frequent use)#

Example-driven#

Provenance-based#

Declarative induction#

Evaluation#

Future Directions#