Counterfactuals of Counterfactuals: a back-translation-inspired approach to analyse counterfactual editors
In the wake of responsible AI, interpretability methods, which attempt to provide an explanation for the predictions of neural models have seen rapid progress. In this work, we are concerned with explanations that are applicable to natural language processing (NLP) models and tasks, and we focus spe...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In the wake of responsible AI, interpretability methods, which attempt to
provide an explanation for the predictions of neural models have seen rapid
progress. In this work, we are concerned with explanations that are applicable
to natural language processing (NLP) models and tasks, and we focus
specifically on the analysis of counterfactual, contrastive explanations. We
note that while there have been several explainers proposed to produce
counterfactual explanations, their behaviour can vary significantly and the
lack of a universal ground truth for the counterfactual edits imposes an
insuperable barrier on their evaluation. We propose a new back
translation-inspired evaluation methodology that utilises earlier outputs of
the explainer as ground truth proxies to investigate the consistency of
explainers. We show that by iteratively feeding the counterfactual to the
explainer we can obtain valuable insights into the behaviour of both the
predictor and the explainer models, and infer patterns that would be otherwise
obscured. Using this methodology, we conduct a thorough analysis and propose a
novel metric to evaluate the consistency of counterfactual generation
approaches with different characteristics across available performance
indicators. |
---|---|
DOI: | 10.48550/arxiv.2305.17055 |