The Parallelism Tradeoff: Limitations of Log-Precision Transformers
Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space...
Gespeichert in:
Veröffentlicht in: | Transactions of the Association for Computational Linguistics 2023-06, Vol.11, p.531-545 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 545 |
---|---|
container_issue | |
container_start_page | 531 |
container_title | Transactions of the Association for Computational Linguistics |
container_volume | 11 |
creator | Merrill, William Sabharwal, Ashish |
description | Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if
≠
(i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental
: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm. |
doi_str_mv | 10.1162/tacl_a_00562 |
format | Article |
fullrecord | <record><control><sourceid>proquest_mit_j</sourceid><recordid>TN_cdi_mit_journals_10_1162_tacl_a_00562</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c</doaj_id><sourcerecordid>2893946879</sourcerecordid><originalsourceid>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</originalsourceid><addsrcrecordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893946879</pqid></control><display><type>article</type><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><source>ProQuest Central (Alumni Edition)</source><source>DOAJ Directory of Open Access Journals</source><source>ProQuest Central Korea</source><source>EZB-FREE-00999 freely available EZB journals</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Merrill, William ; Sabharwal, Ashish</creator><creatorcontrib>Merrill, William ; Sabharwal, Ashish</creatorcontrib><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if
≠
(i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental
: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><identifier>ISSN: 2307-387X</identifier><identifier>EISSN: 2307-387X</identifier><identifier>DOI: 10.1162/tacl_a_00562</identifier><language>eng</language><publisher>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press</publisher><subject>Circuits ; Complexity theory ; Context free grammar ; Linguistics ; Logarithms ; Mathematics ; Natural language processing ; Neural networks ; Parallel processing ; Power ; Tradeoffs ; Transformers</subject><ispartof>Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545</ispartof><rights>2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</citedby><cites>FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2893946879?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,21368,21369,21371,27903,27904,33509,33723,33984,43638,43784,43932,64361,64365,72215</link.rule.ids></links><search><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><title>Transactions of the Association for Computational Linguistics</title><description>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if
≠
(i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental
: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</description><subject>Circuits</subject><subject>Complexity theory</subject><subject>Context free grammar</subject><subject>Linguistics</subject><subject>Logarithms</subject><subject>Mathematics</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Parallel processing</subject><subject>Power</subject><subject>Tradeoffs</subject><subject>Transformers</subject><issn>2307-387X</issn><issn>2307-387X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNp1kMtKAzEUhgdRsGh3PsCAGxeO5jZJxoUgxUthwC4quAuZXGrKtKnJVNCnN-2IVtDVOfx8-U74s-wEggsIKbrspGqFFACUFO1lA4QBKzBnz_s7-2E2jHEOAIAcckDRIBtNX0w-kUG2rWldXOTTILXx1l7ltVu4TnbOL2PubV77WTEJRrmYkg22jNaHhQnxODuwso1m-DWPsqe72-nooagf78ejm7pQpCy7gnCjKKPS0gZryxpImrK0CJbYYKsaVJVMSwIQQJIrkgLVYIYNY7rSWBGMj7Jx79VezsUquIUM78JLJ7aBDzMhQ-dUawSqLIMYAUkMIw1XHCimG0AqqKnWSCXXae9aBf-6NrETc78Oy_R9gXiFK0I5qxJ13lMq-BiDsd9XIRCb1sVu6wk_6_FU3I_vH_T6D3SDvEEoMGE0PUtVYLGd4sOtfgs-AVr7lrM</recordid><startdate>20230612</startdate><enddate>20230612</enddate><creator>Merrill, William</creator><creator>Sabharwal, Ashish</creator><general>MIT Press</general><general>MIT Press Journals, The</general><general>The MIT Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>DOA</scope></search><sort><creationdate>20230612</creationdate><title>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</title><author>Merrill, William ; Sabharwal, Ashish</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c455t-48ec676af6b3df7b14b55f2153e3fcb2957da40202a8c4cb2cb373e77d9d3c433</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Circuits</topic><topic>Complexity theory</topic><topic>Context free grammar</topic><topic>Linguistics</topic><topic>Logarithms</topic><topic>Mathematics</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Parallel processing</topic><topic>Power</topic><topic>Tradeoffs</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Merrill, William</creatorcontrib><creatorcontrib>Sabharwal, Ashish</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Transactions of the Association for Computational Linguistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Merrill, William</au><au>Sabharwal, Ashish</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Parallelism Tradeoff: Limitations of Log-Precision Transformers</atitle><jtitle>Transactions of the Association for Computational Linguistics</jtitle><date>2023-06-12</date><risdate>2023</risdate><volume>11</volume><spage>531</spage><epage>545</epage><pages>531-545</pages><issn>2307-387X</issn><eissn>2307-387X</eissn><abstract>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if
≠
(i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental
: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.</abstract><cop>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA</cop><pub>MIT Press</pub><doi>10.1162/tacl_a_00562</doi><tpages>15</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2307-387X |
ispartof | Transactions of the Association for Computational Linguistics, 2023-06, Vol.11, p.531-545 |
issn | 2307-387X 2307-387X |
language | eng |
recordid | cdi_mit_journals_10_1162_tacl_a_00562 |
source | ProQuest Central (Alumni Edition); DOAJ Directory of Open Access Journals; ProQuest Central Korea; EZB-FREE-00999 freely available EZB journals; ProQuest Central UK/Ireland; ProQuest Central |
subjects | Circuits Complexity theory Context free grammar Linguistics Logarithms Mathematics Natural language processing Neural networks Parallel processing Power Tradeoffs Transformers |
title | The Parallelism Tradeoff: Limitations of Log-Precision Transformers |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T18%3A18%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_mit_j&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Parallelism%20Tradeoff:%20Limitations%20of%20Log-Precision%20Transformers&rft.jtitle=Transactions%20of%20the%20Association%20for%20Computational%20Linguistics&rft.au=Merrill,%20William&rft.date=2023-06-12&rft.volume=11&rft.spage=531&rft.epage=545&rft.pages=531-545&rft.issn=2307-387X&rft.eissn=2307-387X&rft_id=info:doi/10.1162/tacl_a_00562&rft_dat=%3Cproquest_mit_j%3E2893946879%3C/proquest_mit_j%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2893946879&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_29f71320a4e74b8c80c7db0491d6dd2c&rfr_iscdi=true |