Learning Linear Attention in Polynomial Time

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-10
Hauptverfasser: Yau, Morris, Akyürek, Ekin, Mao, Jiayuan, Tenenbaum, Joshua B, Jegelka, Stefanie, Jacob, Andreas
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Yau, Morris
Akyürek, Ekin
Mao, Jiayuan
Tenenbaum, Joshua B
Jegelka, Stefanie
Jacob, Andreas
description Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3116751622</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116751622</sourcerecordid><originalsourceid>FETCH-proquest_journals_31167516223</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQ8UlNLMrLzEtX8MnMAzIVHEtKUvNKMvPzFDLzFALycyrz8nMzE3MUQjJzU3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8Vn5pUR5QKt7Y0NDM3NTQzMjImDhVAHA7MWA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116751622</pqid></control><display><type>article</type><title>Learning Linear Attention in Polynomial Time</title><source>Free E- Journals</source><creator>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</creator><creatorcontrib>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</creatorcontrib><description>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Associative memory ; Attention ; Automata theory ; Computation ; Flight simulators ; Learning ; Polynomials ; Transformers ; Turing machines</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Yau, Morris</creatorcontrib><creatorcontrib>Akyürek, Ekin</creatorcontrib><creatorcontrib>Mao, Jiayuan</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>Jegelka, Stefanie</creatorcontrib><creatorcontrib>Jacob, Andreas</creatorcontrib><title>Learning Linear Attention in Polynomial Time</title><title>arXiv.org</title><description>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</description><subject>Associative memory</subject><subject>Attention</subject><subject>Automata theory</subject><subject>Computation</subject><subject>Flight simulators</subject><subject>Learning</subject><subject>Polynomials</subject><subject>Transformers</subject><subject>Turing machines</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQ8UlNLMrLzEtX8MnMAzIVHEtKUvNKMvPzFDLzFALycyrz8nMzE3MUQjJzU3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8Vn5pUR5QKt7Y0NDM3NTQzMjImDhVAHA7MWA</recordid><startdate>20241018</startdate><enddate>20241018</enddate><creator>Yau, Morris</creator><creator>Akyürek, Ekin</creator><creator>Mao, Jiayuan</creator><creator>Tenenbaum, Joshua B</creator><creator>Jegelka, Stefanie</creator><creator>Jacob, Andreas</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241018</creationdate><title>Learning Linear Attention in Polynomial Time</title><author>Yau, Morris ; Akyürek, Ekin ; Mao, Jiayuan ; Tenenbaum, Joshua B ; Jegelka, Stefanie ; Jacob, Andreas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31167516223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Associative memory</topic><topic>Attention</topic><topic>Automata theory</topic><topic>Computation</topic><topic>Flight simulators</topic><topic>Learning</topic><topic>Polynomials</topic><topic>Transformers</topic><topic>Turing machines</topic><toplevel>online_resources</toplevel><creatorcontrib>Yau, Morris</creatorcontrib><creatorcontrib>Akyürek, Ekin</creatorcontrib><creatorcontrib>Mao, Jiayuan</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>Jegelka, Stefanie</creatorcontrib><creatorcontrib>Jacob, Andreas</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yau, Morris</au><au>Akyürek, Ekin</au><au>Mao, Jiayuan</au><au>Tenenbaum, Joshua B</au><au>Jegelka, Stefanie</au><au>Jacob, Andreas</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Learning Linear Attention in Polynomial Time</atitle><jtitle>arXiv.org</jtitle><date>2024-10-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_3116751622
source Free E- Journals
subjects Associative memory
Attention
Automata theory
Computation
Flight simulators
Learning
Polynomials
Transformers
Turing machines
title Learning Linear Attention in Polynomial Time
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T02%3A00%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Learning%20Linear%20Attention%20in%20Polynomial%20Time&rft.jtitle=arXiv.org&rft.au=Yau,%20Morris&rft.date=2024-10-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3116751622%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3116751622&rft_id=info:pmid/&rfr_iscdi=true