Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention head...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce methods for discovering and applying sparse feature circuits.
These are causally implicated subnetworks of human-interpretable features for
explaining language model behaviors. Circuits identified in prior work consist
of polysemantic and difficult-to-interpret units like attention heads or
neurons, rendering them unsuitable for many downstream applications. In
contrast, sparse feature circuits enable detailed understanding of
unanticipated mechanisms. Because they are based on fine-grained units, sparse
feature circuits are useful for downstream tasks: We introduce SHIFT, where we
improve the generalization of a classifier by ablating features that a human
judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised
and scalable interpretability pipeline by discovering thousands of sparse
feature circuits for automatically discovered model behaviors. |
---|---|
DOI: | 10.48550/arxiv.2403.19647 |