Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers
Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional info...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Li, Xiaoyu Liang, Yingyu Shi, Zhenmei Song, Zhao Wan, Mingda |
description | Tensor Attention extends traditional attention mechanisms by capturing
high-order correlations across multiple modalities, addressing the limitations
of classical matrix-based attention. Meanwhile, Rotary Position Embedding
($\mathsf{RoPE}$) has shown superior performance in encoding positional
information in long-context scenarios, significantly enhancing transformer
models' expressiveness. Despite these empirical successes, the theoretical
limitations of these technologies remain underexplored. In this study, we
analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based
Tensor Attention, showing that with polynomial precision, constant-depth
layers, and linear or sublinear hidden dimension, they cannot solve fixed
membership problems or $(A_{F,r})^*$ closure problems, under the assumption
that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between
the empirical performance and theoretical constraints of Tensor Attention and
$\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that
could guide the development of more theoretically grounded approaches to
Transformer model design and scaling. |
doi_str_mv | 10.48550/arxiv.2412.18040 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_18040</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_18040</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_180403</originalsourceid><addsrcrecordid>eNqFzjELgkAYxvFbGqL6AE29g6umpuAaYjRKOAZy1Xt4oHfyvocZ0XevpL3pWf4P_IRYR2GQZGkabiWNegjiJIqDKAuTcC5U1aAldPoqW8itYUdSG8dgDbgGoRh7QmY9IJT2jgRWgXfupGtYPU-2LF6ef5GMN6jQsCXYO4fG6c-9ImlYWeqQeClmSraMq98uxOZQVPnRn0R1T7qT9Ki_snqS7f4Xb2PPRS4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><source>arXiv.org</source><creator>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</creator><creatorcontrib>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</creatorcontrib><description>Tensor Attention extends traditional attention mechanisms by capturing
high-order correlations across multiple modalities, addressing the limitations
of classical matrix-based attention. Meanwhile, Rotary Position Embedding
($\mathsf{RoPE}$) has shown superior performance in encoding positional
information in long-context scenarios, significantly enhancing transformer
models' expressiveness. Despite these empirical successes, the theoretical
limitations of these technologies remain underexplored. In this study, we
analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based
Tensor Attention, showing that with polynomial precision, constant-depth
layers, and linear or sublinear hidden dimension, they cannot solve fixed
membership problems or $(A_{F,r})^*$ closure problems, under the assumption
that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between
the empirical performance and theoretical constraints of Tensor Attention and
$\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that
could guide the development of more theoretically grounded approaches to
Transformer model design and scaling.</description><identifier>DOI: 10.48550/arxiv.2412.18040</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computational Complexity ; Computer Science - Learning</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.18040$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.18040$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiaoyu</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><creatorcontrib>Wan, Mingda</creatorcontrib><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><description>Tensor Attention extends traditional attention mechanisms by capturing
high-order correlations across multiple modalities, addressing the limitations
of classical matrix-based attention. Meanwhile, Rotary Position Embedding
($\mathsf{RoPE}$) has shown superior performance in encoding positional
information in long-context scenarios, significantly enhancing transformer
models' expressiveness. Despite these empirical successes, the theoretical
limitations of these technologies remain underexplored. In this study, we
analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based
Tensor Attention, showing that with polynomial precision, constant-depth
layers, and linear or sublinear hidden dimension, they cannot solve fixed
membership problems or $(A_{F,r})^*$ closure problems, under the assumption
that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between
the empirical performance and theoretical constraints of Tensor Attention and
$\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that
could guide the development of more theoretically grounded approaches to
Transformer model design and scaling.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computational Complexity</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzjELgkAYxvFbGqL6AE29g6umpuAaYjRKOAZy1Xt4oHfyvocZ0XevpL3pWf4P_IRYR2GQZGkabiWNegjiJIqDKAuTcC5U1aAldPoqW8itYUdSG8dgDbgGoRh7QmY9IJT2jgRWgXfupGtYPU-2LF6ef5GMN6jQsCXYO4fG6c-9ImlYWeqQeClmSraMq98uxOZQVPnRn0R1T7qT9Ki_snqS7f4Xb2PPRS4</recordid><startdate>20241223</startdate><enddate>20241223</enddate><creator>Li, Xiaoyu</creator><creator>Liang, Yingyu</creator><creator>Shi, Zhenmei</creator><creator>Song, Zhao</creator><creator>Wan, Mingda</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241223</creationdate><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><author>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_180403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computational Complexity</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiaoyu</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><creatorcontrib>Wan, Mingda</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiaoyu</au><au>Liang, Yingyu</au><au>Shi, Zhenmei</au><au>Song, Zhao</au><au>Wan, Mingda</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</atitle><date>2024-12-23</date><risdate>2024</risdate><abstract>Tensor Attention extends traditional attention mechanisms by capturing
high-order correlations across multiple modalities, addressing the limitations
of classical matrix-based attention. Meanwhile, Rotary Position Embedding
($\mathsf{RoPE}$) has shown superior performance in encoding positional
information in long-context scenarios, significantly enhancing transformer
models' expressiveness. Despite these empirical successes, the theoretical
limitations of these technologies remain underexplored. In this study, we
analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based
Tensor Attention, showing that with polynomial precision, constant-depth
layers, and linear or sublinear hidden dimension, they cannot solve fixed
membership problems or $(A_{F,r})^*$ closure problems, under the assumption
that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between
the empirical performance and theoretical constraints of Tensor Attention and
$\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that
could guide the development of more theoretically grounded approaches to
Transformer model design and scaling.</abstract><doi>10.48550/arxiv.2412.18040</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2412.18040 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2412_18040 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computational Complexity Computer Science - Learning |
title | Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T18%3A02%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Theoretical%20Constraints%20on%20the%20Expressive%20Power%20of%20$%5Cmathsf%7BRoPE%7D$-based%20Tensor%20Attention%20Transformers&rft.au=Li,%20Xiaoyu&rft.date=2024-12-23&rft_id=info:doi/10.48550/arxiv.2412.18040&rft_dat=%3Carxiv_GOX%3E2412_18040%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |