Gated recurrent neural networks discover attention

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multipl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zucchet, Nicolas, Kobayashi, Seijin, Akram, Yassir, von Oswald, Johannes, Larcher, Maxime, Steger, Angelika, Sacramento, João
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zucchet, Nicolas
Kobayashi, Seijin
Akram, Yassir
von Oswald, Johannes
Larcher, Maxime
Steger, Angelika
Sacramento, João
description Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
doi_str_mv 10.48550/arxiv.2309.01775
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_01775</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_01775</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-27e4241aea1425a2f655b086770b76c5c1d118ab3e326b4ff4efbb3899e121423</originalsourceid><addsrcrecordid>eNotjr1uwjAURr0wVMADdCIvkOB_J2OFKK2E1IU9unaupQhIqhsn0LcnpZ3O8Ok7Ooy9Cl7o0hi-Bbq3UyEVrwounDMvTB4gYZMRhpEIu5R1OBJcZqRbT-cha9oh9BNSBinNe9t3K7aIcBlw_c8lO73vT7uP_Ph1-Ny9HXOwzuTSoZZaAILQ0oCM1hjPS-sc984GE0QjRAleoZLW6xg1Ru9VWVUo5HxRS7b50z6j629qr0A_9W98_YxXDxhhPuc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Gated recurrent neural networks discover attention</title><source>arXiv.org</source><creator>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</creator><creatorcontrib>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</creatorcontrib><description>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</description><identifier>DOI: 10.48550/arxiv.2309.01775</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Neural and Evolutionary Computing</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.01775$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.01775$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zucchet, Nicolas</creatorcontrib><creatorcontrib>Kobayashi, Seijin</creatorcontrib><creatorcontrib>Akram, Yassir</creatorcontrib><creatorcontrib>von Oswald, Johannes</creatorcontrib><creatorcontrib>Larcher, Maxime</creatorcontrib><creatorcontrib>Steger, Angelika</creatorcontrib><creatorcontrib>Sacramento, João</creatorcontrib><title>Gated recurrent neural networks discover attention</title><description>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Neural and Evolutionary Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotjr1uwjAURr0wVMADdCIvkOB_J2OFKK2E1IU9unaupQhIqhsn0LcnpZ3O8Ok7Ooy9Cl7o0hi-Bbq3UyEVrwounDMvTB4gYZMRhpEIu5R1OBJcZqRbT-cha9oh9BNSBinNe9t3K7aIcBlw_c8lO73vT7uP_Ph1-Ny9HXOwzuTSoZZaAILQ0oCM1hjPS-sc984GE0QjRAleoZLW6xg1Ru9VWVUo5HxRS7b50z6j629qr0A_9W98_YxXDxhhPuc</recordid><startdate>20230904</startdate><enddate>20230904</enddate><creator>Zucchet, Nicolas</creator><creator>Kobayashi, Seijin</creator><creator>Akram, Yassir</creator><creator>von Oswald, Johannes</creator><creator>Larcher, Maxime</creator><creator>Steger, Angelika</creator><creator>Sacramento, João</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230904</creationdate><title>Gated recurrent neural networks discover attention</title><author>Zucchet, Nicolas ; Kobayashi, Seijin ; Akram, Yassir ; von Oswald, Johannes ; Larcher, Maxime ; Steger, Angelika ; Sacramento, João</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-27e4241aea1425a2f655b086770b76c5c1d118ab3e326b4ff4efbb3899e121423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Neural and Evolutionary Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Zucchet, Nicolas</creatorcontrib><creatorcontrib>Kobayashi, Seijin</creatorcontrib><creatorcontrib>Akram, Yassir</creatorcontrib><creatorcontrib>von Oswald, Johannes</creatorcontrib><creatorcontrib>Larcher, Maxime</creatorcontrib><creatorcontrib>Steger, Angelika</creatorcontrib><creatorcontrib>Sacramento, João</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zucchet, Nicolas</au><au>Kobayashi, Seijin</au><au>Akram, Yassir</au><au>von Oswald, Johannes</au><au>Larcher, Maxime</au><au>Steger, Angelika</au><au>Sacramento, João</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Gated recurrent neural networks discover attention</atitle><date>2023-09-04</date><risdate>2023</risdate><abstract>Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.</abstract><doi>10.48550/arxiv.2309.01775</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2309.01775
ispartof
issn
language eng
recordid cdi_arxiv_primary_2309_01775
source arXiv.org
subjects Computer Science - Learning
Computer Science - Neural and Evolutionary Computing
title Gated recurrent neural networks discover attention
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T01%3A31%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Gated%20recurrent%20neural%20networks%20discover%20attention&rft.au=Zucchet,%20Nicolas&rft.date=2023-09-04&rft_id=info:doi/10.48550/arxiv.2309.01775&rft_dat=%3Carxiv_GOX%3E2309_01775%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true