Attributing Learned Concepts in Neural Networks to Training Data
By now there is substantial evidence that deep learning models learn certain human-interpretable features as part of their internal representations of data. As having the right (or wrong) concepts is critical to trustworthy machine learning systems, it is natural to ask which inputs from the model...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | By now there is substantial evidence that deep learning models learn certain
human-interpretable features as part of their internal representations of data.
As having the right (or wrong) concepts is critical to trustworthy machine
learning systems, it is natural to ask which inputs from the model's original
training set were most important for learning a concept at a given layer. To
answer this, we combine data attribution methods with methods for probing the
concepts learned by a model. Training network and probe ensembles for two
concept datasets on a range of network layers, we use the recently developed
TRAK method for large-scale data attribution. We find some evidence for
convergence, where removing the 10,000 top attributing images for a concept and
retraining the model does not change the location of the concept in the network
nor the probing sparsity of the concept. This suggests that rather than being
highly dependent on a few specific examples, the features that inform the
development of a concept are spread in a more diffuse manner across its
exemplars, implying robustness in concept formation. |
---|---|
DOI: | 10.48550/arxiv.2310.03149 |