Learning Linear Attention in Polynomial Time

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-10
Hauptverfasser:	Yau, Morris, Akyürek, Ekin, Mao, Jiayuan, Tenenbaum, Joshua B, Jegelka, Stefanie, Jacob, Andreas
Format:	Artikel
Sprache:	eng
Schlagworte:	Associative memory Attention Automata theory Computation Flight simulators Learning Polynomials Transformers Turing machines
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yau, Morris Akyürek, Ekin Mao, Jiayuan Tenenbaum, Joshua B Jegelka, Stefanie Jacob, Andreas
description	Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3116751622</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116751622</sourcerecordid><originalsourceid>FETCH-proquest_journals_31167516223</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQ8UlNLMrLzEtX8MnMAzIVHEtKUvNKMvPzFDLzFALycyrz8nMzE3MUQjJzU3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8Vn5pUR5QKt7Y0NDM3NTQzMjImDhVAHA7MWA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116751622</pqid></control><display><type>article</type><title>Learning Linear Attention in Polynomial Time</title><source>Free E- Journals</source><creator>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</creator><creatorcontrib>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</creatorcontrib><description>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Associative memory ; Attention ; Automata theory ; Computation ; Flight simulators ; Learning ; Polynomials ; Transformers ; Turing machines</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Yau, Morris</creatorcontrib><creatorcontrib>Akyürek, Ekin</creatorcontrib><creatorcontrib>Mao, Jiayuan</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>Jegelka, Stefanie</creatorcontrib><creatorcontrib>Jacob, Andreas</creatorcontrib><title>Learning Linear Attention in Polynomial Time</title><title>arXiv.org</title><description>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</description><subject>Associative memory</subject><subject>Attention</subject><subject>Automata theory</subject><subject>Computation</subject><subject>Flight simulators</subject><subject>Learning</subject><subject>Polynomials</subject><subject>Transformers</subject><subject>Turing machines</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQ8UlNLMrLzEtX8MnMAzIVHEtKUvNKMvPzFDLzFALycyrz8nMzE3MUQjJzU3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8Vn5pUR5QKt7Y0NDM3NTQzMjImDhVAHA7MWA</recordid><startdate>20241018</startdate><enddate>20241018</enddate><creator>Yau, Morris</creator><creator>Akyürek, Ekin</creator><creator>Mao, Jiayuan</creator><creator>Tenenbaum, Joshua B</creator><creator>Jegelka, Stefanie</creator><creator>Jacob, Andreas</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241018</creationdate><title>Learning Linear Attention in Polynomial Time</title><author>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31167516223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Associative memory</topic><topic>Attention</topic><topic>Automata theory</topic><topic>Computation</topic><topic>Flight simulators</topic><topic>Learning</topic><topic>Polynomials</topic><topic>Transformers</topic><topic>Turing machines</topic><toplevel>online_resources</toplevel><creatorcontrib>Yau, Morris</creatorcontrib><creatorcontrib>Akyürek, Ekin</creatorcontrib><creatorcontrib>Mao, Jiayuan</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>Jegelka, Stefanie</creatorcontrib><creatorcontrib>Jacob, Andreas</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yau, Morris</au><au>Akyürek, Ekin</au><au>Mao, Jiayuan</au><au>Tenenbaum, Joshua B</au><au>Jegelka, Stefanie</au><au>Jacob, Andreas</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Learning Linear Attention in Polynomial Time</atitle><jtitle>arXiv.org</jtitle><date>2024-10-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3116751622
source	Free E- Journals
subjects	Associative memory Attention Automata theory Computation Flight simulators Learning Polynomials Transformers Turing machines
title	Learning Linear Attention in Polynomial Time
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T02%3A00%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Learning%20Linear%20Attention%20in%20Polynomial%20Time&rft.jtitle=arXiv.org&rft.au=Yau,%20Morris&rft.date=2024-10-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3116751622%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3116751622&rft_id=info:pmid/&rfr_iscdi=true