PRIME: Prioritizing Interpretability in Failure Mode Extraction
In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models. Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provi...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we study the challenge of providing human-understandable
descriptions for failure modes in trained image classification models. Existing
works address this problem by first identifying clusters (or directions) of
incorrectly classified samples in a latent space and then aiming to provide
human-understandable text descriptions for them. We observe that in some cases,
describing text does not match well with identified failure modes, partially
owing to the fact that shared interpretable attributes of failure modes may not
be captured using clustering in the feature space. To improve on these
shortcomings, we propose a novel approach that prioritizes interpretability in
this problem: we start by obtaining human-understandable concepts (tags) of
images in the dataset and then analyze the model's behavior based on the
presence or absence of combinations of these tags. Our method also ensures that
the tags describing a failure mode form a minimal set, avoiding redundant and
noisy descriptions. Through several experiments on different datasets, we show
that our method successfully identifies failure modes and generates
high-quality text descriptions associated with them. These results highlight
the importance of prioritizing interpretability in understanding model
failures. |
---|---|
DOI: | 10.48550/arxiv.2310.00164 |