The Parallelism Tradeoff: Limitations of Log-Precision Transformers

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Association for Computational Linguistics 2023-06, Vol.11, p.531-545
Hauptverfasser:	Merrill, William, Sabharwal, Ashish
Format:	Artikel
Sprache:	eng
Schlagworte:	Circuits Complexity theory Context free grammar Linguistics Logarithms Mathematics Natural language processing Neural networks Parallel processing Power Tradeoffs Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	545
container_issue
container_start_page	531
container_title	Transactions of the Association for Computational Linguistics
container_volume	11
creator	Merrill, William Sabharwal, Ashish
description	Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
doi_str_mv	10.1162/tacl_a_00562
format	Article
fullrecord	<record><control><sourceid>proquest_mit_j</sourceid><recordid>TN_cdi_mit_journals_10_1162_tacl_a_00562</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c</doaj_id><sourcerecordid>2893946879</sourcerecordid><originalsourceid>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</originalsourceid><addsrcrecordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893946879</pqid></control><display><type>article</type><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><source>ProQuest Central (Alumni Edition)</source><source>DOAJ Directory of Open Access Journals</source><source>ProQuest Central Korea</source><source>EZB-FREE-00999 freely available EZB journals</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Merrill, William ; Sabharwal, Ashish</creator><creatorcontrib>Merrill, William ; Sabharwal, Ashish</creatorcontrib><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><identifier>ISSN: 2307-387X</identifier><identifier>EISSN: 2307-387X</identifier><identifier>DOI: 10.1162/tacl_a_00562</identifier><language>eng</language><publisher>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press</publisher><subject>Circuits ; Complexity theory ; Context free grammar ; Linguistics ; Logarithms ; Mathematics ; Natural language processing ; Neural networks ; Parallel processing ; Power ; Tradeoffs ; Transformers</subject><ispartof>Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545</ispartof><rights>2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</citedby><cites>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2893946879?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,21368,21369,21371,27903,27904,33509,33723,33984,43638,43784,43932,64361,64365,72215</link.rule.ids></links><search><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><title>Transactions of the Association for Computational Linguistics</title><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><subject>Circuits</subject><subject>Complexity theory</subject><subject>Context free grammar</subject><subject>Linguistics</subject><subject>Logarithms</subject><subject>Mathematics</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Parallel processing</subject><subject>Power</subject><subject>Tradeoffs</subject><subject>Transformers</subject><issn>2307-387X</issn><issn>2307-387X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Merrill, William</creator><creator>Sabharwal, Ashish</creator><general>MIT Press</general><general>MIT Press Journals, The</general><general>The MIT Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>DOA</scope></search><sort><creationdate>20230612</creationdate><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><author>Merrill, William ; Sabharwal, Ashish</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Circuits</topic><topic>Complexity theory</topic><topic>Context free grammar</topic><topic>Linguistics</topic><topic>Logarithms</topic><topic>Mathematics</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Parallel processing</topic><topic>Power</topic><topic>Tradeoffs</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Transactions of the Association for Computational Linguistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Merrill, William</au><au>Sabharwal, Ashish</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</atitle><jtitle>Transactions of the Association for Computational Linguistics</jtitle><date>2023-06-12</date><risdate>2023</risdate><volume>11</volume><spage>531</spage><epage>545</epage><pages>531-545</pages><issn>2307-387X</issn><eissn>2307-387X</eissn><abstract>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</abstract><cop>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA</cop><pub>MIT Press</pub><doi>10.1162/tacl_a_00562</doi><tpages>15</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2307-387X
ispartof	Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545
issn	2307-387X 2307-387X
language	eng
recordid	cdi_mit_journals_10_1162_tacl_a_00562
source	ProQuest Central (Alumni Edition); DOAJ Directory of Open Access Journals; ProQuest Central Korea; EZB-FREE-00999 freely available EZB journals; ProQuest Central UK/Ireland; ProQuest Central
subjects	Circuits Complexity theory Context free grammar Linguistics Logarithms Mathematics Natural language processing Neural networks Parallel processing Power Tradeoffs Transformers
title	The Parallelism Tradeoff: Limitations of Log-Precision Transformers
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T18%3A18%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_mit_j&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Parallelism%20Tradeoff:%20Limitations%20of%20Log-Precision%20Transformers&rft.jtitle=Transactions%20of%20the%20Association%20for%20Computational%20Linguistics&rft.au=Merrill,%20William&rft.date=2023-06-12&rft.volume=11&rft.spage=531&rft.epage=545&rft.pages=531-545&rft.issn=2307-387X&rft.eissn=2307-387X&rft_id=info:doi/10.1162/tacl_a_00562&rft_dat=%3Cproquest_mit_j%3E2893946879%3C/proquest_mit_j%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2893946879&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c&rfr_iscdi=true