Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Olmo, Jeffrey, Wilson, Jared, Forsey, Max, Hepner, Bryce, Howe, Thomas Vin, Wingate, David
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Olmo, Jeffrey Wilson, Jared Forsey, Max Hepner, Bryce Howe, Thomas Vin Wingate, David
description	Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.
doi_str_mv	10.48550/arxiv.2411.10397
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_10397</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_10397</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_103973</originalsourceid><addsrcrecordid>eNqFzr0OgjAUQOEuDkZ9ACfvC4hUIP6sKmqim3ElN3CLjdKSSyX69lbi7nSWbzhCjGUYxMskCWfIL90G81jKQIbRatEX15TQPZkacDd0cMY7AcJWK0VMJqc1nKglxlKbEvaMhSbjGlCW4VjVbFsqvM6dtgb57TGy8XQoegofDY1-HYhJurtsDtPuIKtZV55n35OsO4n-iw9iwT6K</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><source>arXiv.org</source><creator>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</creator><creatorcontrib>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</creatorcontrib><description>Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.</description><identifier>DOI: 10.48550/arxiv.2411.10397</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.10397$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.10397$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Olmo, Jeffrey</creatorcontrib><creatorcontrib>Wilson, Jared</creatorcontrib><creatorcontrib>Forsey, Max</creatorcontrib><creatorcontrib>Hepner, Bryce</creatorcontrib><creatorcontrib>Howe, Thomas Vin</creatorcontrib><creatorcontrib>Wingate, David</creatorcontrib><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><description>Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzr0OgjAUQOEuDkZ9ACfvC4hUIP6sKmqim3ElN3CLjdKSSyX69lbi7nSWbzhCjGUYxMskCWfIL90G81jKQIbRatEX15TQPZkacDd0cMY7AcJWK0VMJqc1nKglxlKbEvaMhSbjGlCW4VjVbFsqvM6dtgb57TGy8XQoegofDY1-HYhJurtsDtPuIKtZV55n35OsO4n-iw9iwT6K</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Olmo, Jeffrey</creator><creator>Wilson, Jared</creator><creator>Forsey, Max</creator><creator>Hepner, Bryce</creator><creator>Howe, Thomas Vin</creator><creator>Wingate, David</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241115</creationdate><title>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</title><author>Olmo, Jeffrey ; Wilson, Jared ; Forsey, Max ; Hepner, Bryce ; Howe, Thomas Vin ; Wingate, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_103973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Olmo, Jeffrey</creatorcontrib><creatorcontrib>Wilson, Jared</creatorcontrib><creatorcontrib>Forsey, Max</creatorcontrib><creatorcontrib>Hepner, Bryce</creatorcontrib><creatorcontrib>Howe, Thomas Vin</creatorcontrib><creatorcontrib>Wingate, David</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Olmo, Jeffrey</au><au>Wilson, Jared</au><au>Forsey, Max</au><au>Hepner, Bryce</au><au>Howe, Thomas Vin</au><au>Wingate, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning</atitle><date>2024-11-15</date><risdate>2024</risdate><abstract>Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.</abstract><doi>10.48550/arxiv.2411.10397</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.10397
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_10397
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
title	Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T01%3A29%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Features%20that%20Make%20a%20Difference:%20Leveraging%20Gradients%20for%20Improved%20Dictionary%20Learning&rft.au=Olmo,%20Jeffrey&rft.date=2024-11-15&rft_id=info:doi/10.48550/arxiv.2411.10397&rft_dat=%3Carxiv_GOX%3E2411_10397%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true