Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of . How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon...
Gespeichert in:
Veröffentlicht in: | Entropy (Basel, Switzerland) Switzerland), 2023-11, Vol.25 (12), p.1574 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of
. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and
neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem:
. The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring
suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of
between the machine's representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive
game, and clarify the relationship between alignment and a well-known property of representations, namely
. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as
, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations. |
---|---|
ISSN: | 1099-4300 1099-4300 |
DOI: | 10.3390/e25121574 |