Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activa...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Olmo, Jeffrey Wilson, Jared Forsey, Max Hepner, Bryce Howe, Thomas Vin Wingate, David |
description | Sparse Autoencoders (SAEs) are a promising approach for extracting neural
network representations by learning a sparse and overcomplete decomposition of
the network's internal activations. However, SAEs are traditionally trained
considering only activation values and not the effect those activations have on
downstream computations. This limits the information available to learn
features, and biases the autoencoder towards neglecting features which are
represented with small activation values but strongly influence model outputs.
To address this, we introduce Gradient SAEs (g-SAEs), which modify the
$k$-sparse autoencoder architecture by augmenting the TopK activation function
to rely on the gradients of the input activation when selecting the $k$
elements. For a given sparsity level, g-SAEs produce reconstructions that are
more faithful to original network performance when propagated through the
network. Additionally, we find evidence that g-SAEs learn latents that are on
average more effective at steering models in arbitrary contexts. By considering
the downstream effects of activations, our approach leverages the dual nature
of neural network features as both $\textit{representations}$, retrospectively,
and $\textit{actions}$, prospectively. While previous methods have approached
the problem of feature discovery primarily focused on the former aspect, g-SAEs
represent a step towards accounting for the latter as well. |
doi_str_mv | 10.48550/arxiv.2411.10397 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_10397</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_10397</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_103973</originalsourceid><addsrcrecordid>eNqFzr0OgjAUQOEuDkZ9ACfvC4hUIP6sKmqim3ElN3CLjdKSSyX69lbi7nSWbzhCjGUYxMskCWfIL90G81jKQIbRatEX15TQPZkacDd0cMY7AcJWK0VMJqc1nKglxlKbEvaMhSbjGlCW4VjVbFsqvM6dtgb57TGy8XQoegofDY1-HYhJurtsDtPuIKtZV55n35OsO4n-iw9iwT6K</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><source>arXiv.org</source><creator>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</creator><creatorcontrib>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</creatorcontrib><description>Sparse Autoencoders (SAEs) are a promising approach for extracting neural
network representations by learning a sparse and overcomplete decomposition of
the network's internal activations. However, SAEs are traditionally trained
considering only activation values and not the effect those activations have on
downstream computations. This limits the information available to learn
features, and biases the autoencoder towards neglecting features which are
represented with small activation values but strongly influence model outputs.
To address this, we introduce Gradient SAEs (g-SAEs), which modify the
$k$-sparse autoencoder architecture by augmenting the TopK activation function
to rely on the gradients of the input activation when selecting the $k$
elements. For a given sparsity level, g-SAEs produce reconstructions that are
more faithful to original network performance when propagated through the
network. Additionally, we find evidence that g-SAEs learn latents that are on
average more effective at steering models in arbitrary contexts. By considering
the downstream effects of activations, our approach leverages the dual nature
of neural network features as both $\textit{representations}$, retrospectively,
and $\textit{actions}$, prospectively. While previous methods have approached
the problem of feature discovery primarily focused on the former aspect, g-SAEs
represent a step towards accounting for the latter as well.</description><identifier>DOI: 10.48550/arxiv.2411.10397</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.10397$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.10397$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Olmo, Jeffrey</creatorcontrib><creatorcontrib>Wilson, Jared</creatorcontrib><creatorcontrib>Forsey, Max</creatorcontrib><creatorcontrib>Hepner, Bryce</creatorcontrib><creatorcontrib>Howe, Thomas Vin</creatorcontrib><creatorcontrib>Wingate, David</creatorcontrib><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><description>Sparse Autoencoders (SAEs) are a promising approach for extracting neural
network representations by learning a sparse and overcomplete decomposition of
the network's internal activations. However, SAEs are traditionally trained
considering only activation values and not the effect those activations have on
downstream computations. This limits the information available to learn
features, and biases the autoencoder towards neglecting features which are
represented with small activation values but strongly influence model outputs.
To address this, we introduce Gradient SAEs (g-SAEs), which modify the
$k$-sparse autoencoder architecture by augmenting the TopK activation function
to rely on the gradients of the input activation when selecting the $k$
elements. For a given sparsity level, g-SAEs produce reconstructions that are
more faithful to original network performance when propagated through the
network. Additionally, we find evidence that g-SAEs learn latents that are on
average more effective at steering models in arbitrary contexts. By considering
the downstream effects of activations, our approach leverages the dual nature
of neural network features as both $\textit{representations}$, retrospectively,
and $\textit{actions}$, prospectively. While previous methods have approached
the problem of feature discovery primarily focused on the former aspect, g-SAEs
represent a step towards accounting for the latter as well.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzr0OgjAUQOEuDkZ9ACfvC4hUIP6sKmqim3ElN3CLjdKSSyX69lbi7nSWbzhCjGUYxMskCWfIL90G81jKQIbRatEX15TQPZkacDd0cMY7AcJWK0VMJqc1nKglxlKbEvaMhSbjGlCW4VjVbFsqvM6dtgb57TGy8XQoegofDY1-HYhJurtsDtPuIKtZV55n35OsO4n-iw9iwT6K</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Olmo, Jeffrey</creator><creator>Wilson, Jared</creator><creator>Forsey, Max</creator><creator>Hepner, Bryce</creator><creator>Howe, Thomas Vin</creator><creator>Wingate, David</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241115</creationdate><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><author>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_103973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Olmo, Jeffrey</creatorcontrib><creatorcontrib>Wilson, Jared</creatorcontrib><creatorcontrib>Forsey, Max</creatorcontrib><creatorcontrib>Hepner, Bryce</creatorcontrib><creatorcontrib>Howe, Thomas Vin</creatorcontrib><creatorcontrib>Wingate, David</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Olmo, Jeffrey</au><au>Wilson, Jared</au><au>Forsey, Max</au><au>Hepner, Bryce</au><au>Howe, Thomas Vin</au><au>Wingate, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</atitle><date>2024-11-15</date><risdate>2024</risdate><abstract>Sparse Autoencoders (SAEs) are a promising approach for extracting neural
network representations by learning a sparse and overcomplete decomposition of
the network's internal activations. However, SAEs are traditionally trained
considering only activation values and not the effect those activations have on
downstream computations. This limits the information available to learn
features, and biases the autoencoder towards neglecting features which are
represented with small activation values but strongly influence model outputs.
To address this, we introduce Gradient SAEs (g-SAEs), which modify the
$k$-sparse autoencoder architecture by augmenting the TopK activation function
to rely on the gradients of the input activation when selecting the $k$
elements. For a given sparsity level, g-SAEs produce reconstructions that are
more faithful to original network performance when propagated through the
network. Additionally, we find evidence that g-SAEs learn latents that are on
average more effective at steering models in arbitrary contexts. By considering
the downstream effects of activations, our approach leverages the dual nature
of neural network features as both $\textit{representations}$, retrospectively,
and $\textit{actions}$, prospectively. While previous methods have approached
the problem of feature discovery primarily focused on the former aspect, g-SAEs
represent a step towards accounting for the latter as well.</abstract><doi>10.48550/arxiv.2411.10397</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2411.10397 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2411_10397 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning |
title | Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T01%3A29%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Features%20that%20Make%20a%20Difference:%20Leveraging%20Gradients%20for%20Improved%20Dictionary%20Learning&rft.au=Olmo,%20Jeffrey&rft.date=2024-11-15&rft_id=info:doi/10.48550/arxiv.2411.10397&rft_dat=%3Carxiv_GOX%3E2411_10397%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |