Training Neural Networks for Modularity aids Interpretability
An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | An approach to improve network interpretability is via clusterability, i.e.,
splitting a model into disjoint clusters that can be studied independently. We
find pretrained models to be highly unclusterable and thus train models to be
more modular using an ``enmeshment loss'' function that encourages the
formation of non-interacting clusters. Using automated interpretability
measures, we show that our method finds clusters that learn different,
disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a
promising direction for making neural networks easier to interpret. |
---|---|
DOI: | 10.48550/arxiv.2409.15747 |