Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining

In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Su, Diwei, Fei, Cheng, Luo, Jianxu
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Dense prediction Efficient transformer Token reduction Vision transformer
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	341
container_issue
container_start_page	325
container_title
container_volume	15130
creator	Su, Diwei Fei, Cheng Luo, Jianxu
description	In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, the application of transformers to CV tasks faces challenges, particularly when handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter, MaskDINO and SWAG, exhibiting promising performance on four tasks, including semantic segmentation, instance segmentation, panoptic segmentation, and image classification. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights. Our code will be released at https://github.com/MilknoCandy/Token-Adapter.
doi_str_mv	10.1007/978-3-031-73220-1_19
format	Book Chapter
fullrecord	<record><control><sourceid>proquest_sprin</sourceid><recordid>TN_cdi_proquest_ebookcentralchapters_31752202_298_410</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EBC31752202_298_410</sourcerecordid><originalsourceid>FETCH-LOGICAL-p174t-bfe44b1481af3b037eccfd75fbe411aa81ad2d69c76427230123af1ed7a26fd23</originalsourceid><addsrcrecordid>eNotkN1OAjEQhetvBOQNvOgLVDudsmUvDYKakGgI6mXT3W1lZWlxu-jrW8Cr6ZzOmdN-hNwAvwXO1V2uxgwZR2AKheAMNOQnZJhkTOJBg1PSgwyAIcr8jPT_LyBX56THkQuWK4mXpA8yU2OVCzm6IsMYvzjniJgJnvVIs7Cb8FP7T7oIv5EaX9FJaHYbH2lwdBnWNp1qT9_rWAdPl63x0YV2Y1s69aZobKQzE7vUPqRJS19bW9Vlt5_9qLtV2HV0YbvW1D5lXJMLZ5poh_91QN5m0-Xkic1fHp8n93O2BSU7VjgrZQFyDMZhwVHZsnSVGrnCSgBjkl6JKstLlUmhBHIQaBzYShmRuUrggIjj3rhtU6xtdRHCOmrgeg9XJ4oadcKlDyD1Hm4yjY6mbRu-dzZ22u5dpfXp-U25Mtv0y6gT41HyCC3ysZbA8Q9jcXnU</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>book_chapter</recordtype><pqid>EBC31752202_298_410</pqid></control><display><type>book_chapter</type><title>Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining</title><source>Springer Books</source><creator>Su, Diwei ; Fei, Cheng ; Luo, Jianxu</creator><contributor>Russakovsky, Olga ; Ricci, Elisa ; Sattler, Torsten ; Leonardis, Ales ; Roth, Stefan ; Varol, Gül</contributor><creatorcontrib>Su, Diwei ; Fei, Cheng ; Luo, Jianxu ; Russakovsky, Olga ; Ricci, Elisa ; Sattler, Torsten ; Leonardis, Ales ; Roth, Stefan ; Varol, Gül</creatorcontrib><description>In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, the application of transformers to CV tasks faces challenges, particularly when handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter, MaskDINO and SWAG, exhibiting promising performance on four tasks, including semantic segmentation, instance segmentation, panoptic segmentation, and image classification. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights. Our code will be released at https://github.com/MilknoCandy/Token-Adapter.</description><identifier>ISSN: 0302-9743</identifier><identifier>ISBN: 3031732197</identifier><identifier>ISBN: 9783031732195</identifier><identifier>EISSN: 1611-3349</identifier><identifier>EISBN: 9783031732201</identifier><identifier>EISBN: 3031732200</identifier><identifier>DOI: 10.1007/978-3-031-73220-1_19</identifier><identifier>OCLC: 1467879245</identifier><identifier>LCCallNum: TA1501-1820</identifier><language>eng</language><publisher>Switzerland: Springer</publisher><subject>Dense prediction ; Efficient transformer ; Token reduction ; Vision transformer</subject><ispartof>Computer Vision - ECCV 2024, 2024, Vol.15130, p.325-341</ispartof><rights>The Author(s), under exclusive license to Springer Nature Switzerland AG 2025</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-0890-5256 ; 0000-0002-2503-4202 ; 0009-0008-4922-9259</orcidid><relation>Lecture Notes in Computer Science</relation></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttps://ebookcentral.proquest.com/covers/31752202-l.jpg</thumbnail><link.rule.ids>775,776,780,789,27902</link.rule.ids></links><search><contributor>Russakovsky, Olga</contributor><contributor>Ricci, Elisa</contributor><contributor>Sattler, Torsten</contributor><contributor>Leonardis, Ales</contributor><contributor>Roth, Stefan</contributor><contributor>Varol, Gül</contributor><creatorcontrib>Su, Diwei</creatorcontrib><creatorcontrib>Fei, Cheng</creatorcontrib><creatorcontrib>Luo, Jianxu</creatorcontrib><title>Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining</title><title>Computer Vision - ECCV 2024</title><description>In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, the application of transformers to CV tasks faces challenges, particularly when handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter, MaskDINO and SWAG, exhibiting promising performance on four tasks, including semantic segmentation, instance segmentation, panoptic segmentation, and image classification. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights. Our code will be released at https://github.com/MilknoCandy/Token-Adapter.</description><subject>Dense prediction</subject><subject>Efficient transformer</subject><subject>Token reduction</subject><subject>Vision transformer</subject><issn>0302-9743</issn><issn>1611-3349</issn><isbn>3031732197</isbn><isbn>9783031732195</isbn><isbn>9783031732201</isbn><isbn>3031732200</isbn><fulltext>true</fulltext><rsrctype>book_chapter</rsrctype><creationdate>2024</creationdate><recordtype>book_chapter</recordtype><recordid>eNotkN1OAjEQhetvBOQNvOgLVDudsmUvDYKakGgI6mXT3W1lZWlxu-jrW8Cr6ZzOmdN-hNwAvwXO1V2uxgwZR2AKheAMNOQnZJhkTOJBg1PSgwyAIcr8jPT_LyBX56THkQuWK4mXpA8yU2OVCzm6IsMYvzjniJgJnvVIs7Cb8FP7T7oIv5EaX9FJaHYbH2lwdBnWNp1qT9_rWAdPl63x0YV2Y1s69aZobKQzE7vUPqRJS19bW9Vlt5_9qLtV2HV0YbvW1D5lXJMLZ5poh_91QN5m0-Xkic1fHp8n93O2BSU7VjgrZQFyDMZhwVHZsnSVGrnCSgBjkl6JKstLlUmhBHIQaBzYShmRuUrggIjj3rhtU6xtdRHCOmrgeg9XJ4oadcKlDyD1Hm4yjY6mbRu-dzZ22u5dpfXp-U25Mtv0y6gT41HyCC3ysZbA8Q9jcXnU</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Su, Diwei</creator><creator>Fei, Cheng</creator><creator>Luo, Jianxu</creator><general>Springer</general><general>Springer Nature Switzerland</general><scope>FFUUA</scope><orcidid>https://orcid.org/0000-0002-0890-5256</orcidid><orcidid>https://orcid.org/0000-0002-2503-4202</orcidid><orcidid>https://orcid.org/0009-0008-4922-9259</orcidid></search><sort><creationdate>2024</creationdate><title>Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining</title><author>Su, Diwei ; Fei, Cheng ; Luo, Jianxu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p174t-bfe44b1481af3b037eccfd75fbe411aa81ad2d69c76427230123af1ed7a26fd23</frbrgroupid><rsrctype>book_chapters</rsrctype><prefilter>book_chapters</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Dense prediction</topic><topic>Efficient transformer</topic><topic>Token reduction</topic><topic>Vision transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Su, Diwei</creatorcontrib><creatorcontrib>Fei, Cheng</creatorcontrib><creatorcontrib>Luo, Jianxu</creatorcontrib><collection>ProQuest Ebook Central - Book Chapters - Demo use only</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Su, Diwei</au><au>Fei, Cheng</au><au>Luo, Jianxu</au><au>Russakovsky, Olga</au><au>Ricci, Elisa</au><au>Sattler, Torsten</au><au>Leonardis, Ales</au><au>Roth, Stefan</au><au>Varol, Gül</au><format>book</format><genre>bookitem</genre><ristype>CHAP</ristype><atitle>Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining</atitle><btitle>Computer Vision - ECCV 2024</btitle><seriestitle>Lecture Notes in Computer Science</seriestitle><date>2024</date><risdate>2024</risdate><volume>15130</volume><spage>325</spage><epage>341</epage><pages>325-341</pages><issn>0302-9743</issn><eissn>1611-3349</eissn><isbn>3031732197</isbn><isbn>9783031732195</isbn><eisbn>9783031732201</eisbn><eisbn>3031732200</eisbn><abstract>In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, the application of transformers to CV tasks faces challenges, particularly when handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter, MaskDINO and SWAG, exhibiting promising performance on four tasks, including semantic segmentation, instance segmentation, panoptic segmentation, and image classification. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights. Our code will be released at https://github.com/MilknoCandy/Token-Adapter.</abstract><cop>Switzerland</cop><pub>Springer</pub><doi>10.1007/978-3-031-73220-1_19</doi><oclcid>1467879245</oclcid><tpages>17</tpages><orcidid>https://orcid.org/0000-0002-0890-5256</orcidid><orcidid>https://orcid.org/0000-0002-2503-4202</orcidid><orcidid>https://orcid.org/0009-0008-4922-9259</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0302-9743
ispartof	Computer Vision - ECCV 2024, 2024, Vol.15130, p.325-341
issn	0302-9743 1611-3349
language	eng
recordid	cdi_proquest_ebookcentralchapters_31752202_298_410
source	Springer Books
subjects	Dense prediction Efficient transformer Token reduction Vision transformer
title	Removing Rows and Columns of Tokens in Vision Transformer Enables Faster Dense Prediction Without Retraining
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T21%3A09%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_sprin&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=bookitem&rft.atitle=Removing%20Rows%20and%20Columns%20of%20Tokens%20in%20Vision%20Transformer%20Enables%20Faster%20Dense%20Prediction%20Without%20Retraining&rft.btitle=Computer%20Vision%20-%20ECCV%202024&rft.au=Su,%20Diwei&rft.date=2024&rft.volume=15130&rft.spage=325&rft.epage=341&rft.pages=325-341&rft.issn=0302-9743&rft.eissn=1611-3349&rft.isbn=3031732197&rft.isbn_list=9783031732195&rft_id=info:doi/10.1007/978-3-031-73220-1_19&rft_dat=%3Cproquest_sprin%3EEBC31752202_298_410%3C/proquest_sprin%3E%3Curl%3E%3C/url%3E&rft.eisbn=9783031732201&rft.eisbn_list=3031732200&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=EBC31752202_298_410&rft_id=info:pmid/&rfr_iscdi=true