Gated recurrent neural networks discover attention

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multipl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zucchet, Nicolas, Kobayashi, Seijin, Akram, Yassir, von Oswald, Johannes, Larcher, Maxime, Steger, Angelika, Sacramento, João
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Neural and Evolutionary Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zucchet, Nicolas Kobayashi, Seijin Akram, Yassir von Oswald, Johannes Larcher, Maxime Steger, Angelika Sacramento, João
description	Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
doi_str_mv	10.48550/arxiv.2309.01775
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_01775</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_01775</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-27e4241aea1425a2f655b086770b76c5c1d118ab3e326b4ff4efbb3899e121423</originalsourceid><addsrcrecordid>eNotjr1uwjAURr0wVMADdCIvkOB_J2OFKK2E1IU9unaupQhIqhsn0LcnpZ3O8Ok7Ooy9Cl7o0hi-Bbq3UyEVrwounDMvTB4gYZMRhpEIu5R1OBJcZqRbT-cha9oh9BNSBinNe9t3K7aIcBlw_c8lO73vT7uP_Ph1-Ny9HXOwzuTSoZZaAILQ0oCM1hjPS-sc984GE0QjRAleoZLW6xg1Ru9VWVUo5HxRS7b50z6j629qr0A_9W98_YxXDxhhPuc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Gated recurrent neural networks discover attention</title><source>arXiv.org</source><creator>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</creator><creatorcontrib>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</creatorcontrib><description>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</description><identifier>DOI: 10.48550/arxiv.2309.01775</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Neural and Evolutionary Computing</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.01775$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.01775$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zucchet, Nicolas</creatorcontrib><creatorcontrib>Kobayashi, Seijin</creatorcontrib><creatorcontrib>Akram, Yassir</creatorcontrib><creatorcontrib>von Oswald, Johannes</creatorcontrib><creatorcontrib>Larcher, Maxime</creatorcontrib><creatorcontrib>Steger, Angelika</creatorcontrib><creatorcontrib>Sacramento, João</creatorcontrib><title>Gated recurrent neural networks discover attention</title><description>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Neural and Evolutionary Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotjr1uwjAURr0wVMADdCIvkOB_J2OFKK2E1IU9unaupQhIqhsn0LcnpZ3O8Ok7Ooy9Cl7o0hi-Bbq3UyEVrwounDMvTB4gYZMRhpEIu5R1OBJcZqRbT-cha9oh9BNSBinNe9t3K7aIcBlw_c8lO73vT7uP_Ph1-Ny9HXOwzuTSoZZaAILQ0oCM1hjPS-sc984GE0QjRAleoZLW6xg1Ru9VWVUo5HxRS7b50z6j629qr0A_9W98_YxXDxhhPuc</recordid><startdate>20230904</startdate><enddate>20230904</enddate><creator>Zucchet, Nicolas</creator><creator>Kobayashi, Seijin</creator><creator>Akram, Yassir</creator><creator>von Oswald, Johannes</creator><creator>Larcher, Maxime</creator><creator>Steger, Angelika</creator><creator>Sacramento, João</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230904</creationdate><title>Gated recurrent neural networks discover attention</title><author>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-27e4241aea1425a2f655b086770b76c5c1d118ab3e326b4ff4efbb3899e121423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Neural and Evolutionary Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Zucchet, Nicolas</creatorcontrib><creatorcontrib>Kobayashi, Seijin</creatorcontrib><creatorcontrib>Akram, Yassir</creatorcontrib><creatorcontrib>von Oswald, Johannes</creatorcontrib><creatorcontrib>Larcher, Maxime</creatorcontrib><creatorcontrib>Steger, Angelika</creatorcontrib><creatorcontrib>Sacramento, João</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zucchet, Nicolas</au><au>Kobayashi, Seijin</au><au>Akram, Yassir</au><au>von Oswald, Johannes</au><au>Larcher, Maxime</au><au>Steger, Angelika</au><au>Sacramento, João</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Gated recurrent neural networks discover attention</atitle><date>2023-09-04</date><risdate>2023</risdate><abstract>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</abstract><doi>10.48550/arxiv.2309.01775</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2309.01775
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2309_01775
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Neural and Evolutionary Computing
title	Gated recurrent neural networks discover attention
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T01%3A31%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Gated%20recurrent%20neural%20networks%20discover%20attention&rft.au=Zucchet,%20Nicolas&rft.date=2023-09-04&rft_id=info:doi/10.48550/arxiv.2309.01775&rft_dat=%3Carxiv_GOX%3E2309_01775%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true