The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is miss...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-05
Hauptverfasser:	Bushnaq, Lucius, Heimersheim, Stefan, Goldowsky-Dill, Nicholas, Braun, Dan, Mendel, Jake, Hänni, Kaarel, Griffin, Avery, Stöhler, Jörn, Wache, Magdalena, Hobbhahn, Marius
Format:	Artikel
Sprache:	eng
Schlagworte:	Jacobi matrix method Jacobian matrix Large language models Neural networks Principal components analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!