$Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers$

Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional info...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Xiaoyu, Liang, Yingyu, Shi, Zhenmei, Song, Zhao, Wan, Mingda
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computational Complexity Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Li, Xiaoyu Liang, Yingyu Shi, Zhenmei Song, Zhao Wan, Mingda
description	Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^*$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.
doi_str_mv	10.48550/arxiv.2412.18040
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_18040</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_18040</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_180403</originalsourceid><addsrcrecordid>eNqFzjELgkAYxvFbGqL6AE29g6umpuAaYjRKOAZy1Xt4oHfyvocZ0XevpL3pWf4P_IRYR2GQZGkabiWNegjiJIqDKAuTcC5U1aAldPoqW8itYUdSG8dgDbgGoRh7QmY9IJT2jgRWgXfupGtYPU-2LF6ef5GMN6jQsCXYO4fG6c-9ImlYWeqQeClmSraMq98uxOZQVPnRn0R1T7qT9Ki_snqS7f4Xb2PPRS4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><source>arXiv.org</source><creator>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</creator><creatorcontrib>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</creatorcontrib><description>Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.</description><identifier>DOI: 10.48550/arxiv.2412.18040</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computational Complexity ; Computer Science - Learning</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.18040$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.18040$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiaoyu</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><creatorcontrib>Wan, Mingda</creatorcontrib><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><description>Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computational Complexity</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzjELgkAYxvFbGqL6AE29g6umpuAaYjRKOAZy1Xt4oHfyvocZ0XevpL3pWf4P_IRYR2GQZGkabiWNegjiJIqDKAuTcC5U1aAldPoqW8itYUdSG8dgDbgGoRh7QmY9IJT2jgRWgXfupGtYPU-2LF6ef5GMN6jQsCXYO4fG6c-9ImlYWeqQeClmSraMq98uxOZQVPnRn0R1T7qT9Ki_snqS7f4Xb2PPRS4</recordid><startdate>20241223</startdate><enddate>20241223</enddate><creator>Li, Xiaoyu</creator><creator>Liang, Yingyu</creator><creator>Shi, Zhenmei</creator><creator>Song, Zhao</creator><creator>Wan, Mingda</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241223</creationdate><title>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</title><author>Li, Xiaoyu ; Liang, Yingyu ; Shi, Zhenmei ; Song, Zhao ; Wan, Mingda</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_180403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computational Complexity</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiaoyu</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><creatorcontrib>Wan, Mingda</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiaoyu</au><au>Liang, Yingyu</au><au>Shi, Zhenmei</au><au>Song, Zhao</au><au>Wan, Mingda</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers</atitle><date>2024-12-23</date><risdate>2024</risdate><abstract>Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^*$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.</abstract><doi>10.48550/arxiv.2412.18040</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.18040
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_18040
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computational Complexity Computer Science - Learning
title	Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T18%3A02%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Theoretical%20Constraints%20on%20the%20Expressive%20Power%20of%20$%5Cmathsf%7BRoPE%7D$-based%20Tensor%20Attention%20Transformers&rft.au=Li,%20Xiaoyu&rft.date=2024-12-23&rft_id=info:doi/10.48550/arxiv.2412.18040&rft_dat=%3Carxiv_GOX%3E2412_18040%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true