The Parallelism Tradeoff: Limitations of Log-Precision Transformers

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Association for Computational Linguistics 2023-06, Vol.11, p.531-545
Hauptverfasser: Merrill, William, Sabharwal, Ashish
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 545
container_issue
container_start_page 531
container_title Transactions of the Association for Computational Linguistics
container_volume 11
creator Merrill, William
Sabharwal, Ashish
description Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
doi_str_mv 10.1162/tacl_a_00562
format Article
fullrecord <record><control><sourceid>proquest_mit_j</sourceid><recordid>TN_cdi_mit_journals_10_1162_tacl_a_00562</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c</doaj_id><sourcerecordid>2893946879</sourcerecordid><originalsourceid>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</originalsourceid><addsrcrecordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893946879</pqid></control><display><type>article</type><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><source>ProQuest Central (Alumni Edition)</source><source>DOAJ Directory of Open Access Journals</source><source>ProQuest Central Korea</source><source>EZB-FREE-00999 freely available EZB journals</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Merrill, William ; Sabharwal, Ashish</creator><creatorcontrib>Merrill, William ; Sabharwal, Ashish</creatorcontrib><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><identifier>ISSN: 2307-387X</identifier><identifier>EISSN: 2307-387X</identifier><identifier>DOI: 10.1162/tacl_a_00562</identifier><language>eng</language><publisher>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press</publisher><subject>Circuits ; Complexity theory ; Context free grammar ; Linguistics ; Logarithms ; Mathematics ; Natural language processing ; Neural networks ; Parallel processing ; Power ; Tradeoffs ; Transformers</subject><ispartof>Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545</ispartof><rights>2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</citedby><cites>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2893946879?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,21368,21369,21371,27903,27904,33509,33723,33984,43638,43784,43932,64361,64365,72215</link.rule.ids></links><search><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><title>Transactions of the Association for Computational Linguistics</title><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><subject>Circuits</subject><subject>Complexity theory</subject><subject>Context free grammar</subject><subject>Linguistics</subject><subject>Logarithms</subject><subject>Mathematics</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Parallel processing</subject><subject>Power</subject><subject>Tradeoffs</subject><subject>Transformers</subject><issn>2307-387X</issn><issn>2307-387X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Merrill, William</creator><creator>Sabharwal, Ashish</creator><general>MIT Press</general><general>MIT Press Journals, The</general><general>The MIT Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>DOA</scope></search><sort><creationdate>20230612</creationdate><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><author>Merrill, William ; Sabharwal, Ashish</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Circuits</topic><topic>Complexity theory</topic><topic>Context free grammar</topic><topic>Linguistics</topic><topic>Logarithms</topic><topic>Mathematics</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Parallel processing</topic><topic>Power</topic><topic>Tradeoffs</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Transactions of the Association for Computational Linguistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Merrill, William</au><au>Sabharwal, Ashish</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</atitle><jtitle>Transactions of the Association for Computational Linguistics</jtitle><date>2023-06-12</date><risdate>2023</risdate><volume>11</volume><spage>531</spage><epage>545</epage><pages>531-545</pages><issn>2307-387X</issn><eissn>2307-387X</eissn><abstract>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if ≠ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental : any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</abstract><cop>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA</cop><pub>MIT Press</pub><doi>10.1162/tacl_a_00562</doi><tpages>15</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2307-387X
ispartof Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545
issn 2307-387X
2307-387X
language eng
recordid cdi_mit_journals_10_1162_tacl_a_00562
source ProQuest Central (Alumni Edition); DOAJ Directory of Open Access Journals; ProQuest Central Korea; EZB-FREE-00999 freely available EZB journals; ProQuest Central UK/Ireland; ProQuest Central
subjects Circuits
Complexity theory
Context free grammar
Linguistics
Logarithms
Mathematics
Natural language processing
Neural networks
Parallel processing
Power
Tradeoffs
Transformers
title The Parallelism Tradeoff: Limitations of Log-Precision Transformers
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T18%3A18%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_mit_j&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Parallelism%20Tradeoff:%20Limitations%20of%20Log-Precision%20Transformers&rft.jtitle=Transactions%20of%20the%20Association%20for%20Computational%20Linguistics&rft.au=Merrill,%20William&rft.date=2023-06-12&rft.volume=11&rft.spage=531&rft.epage=545&rft.pages=531-545&rft.issn=2307-387X&rft.eissn=2307-387X&rft_id=info:doi/10.1162/tacl_a_00562&rft_dat=%3Cproquest_mit_j%3E2893946879%3C/proquest_mit_j%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2893946879&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c&rfr_iscdi=true