UniNeXt: Exploring A Unified Architecture for Vision Recognition

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regard...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-08
Hauptverfasser:	Lin, Fangjian, Yuan, Jianlong, Wu, Sitong, Wang, Fan, Wang, Zhibin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer vision Mixers Performance enhancement
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Lin, Fangjian Yuan, Jianlong Wu, Sitong Wang, Fan Wang, Zhibin
description	Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.
doi_str_mv	10.48550/arxiv.2304.13700
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2304_13700</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2806661518</sourcerecordid><originalsourceid>FETCH-LOGICAL-a528-d23804659c8479514afc2fe301f345e92b4073bc40d1f3b7186056195ed8e2ea3</originalsourceid><addsrcrecordid>eNotj01Lw0AYhBdBsNT-AE8ueE589zMbT4ZSP6AoSBVvYbN5U7fUbNykUv-9sfU0wzAM8xBywSCVRim4tnHvv1MuQKZMZAAnZMKFYImRnJ-RWd9vAIDrjCslJuT2tfVP-D7c0MW-24bo2zUt6Bg2HmtaRPfhB3TDLiJtQqRvvvehpS_owrr1w-jPyWljtz3O_nVKVneL1fwhWT7fP86LZWIVN0nNhQGpVe6MzHLFpG0cb1AAa4RUmPNKQiYqJ6EekypjRoPSLFdYG-RoxZRcHmcPeGUX_aeNP-UfZnnAHBtXx0YXw9cO-6HchF1sx08lN6C1ZooZ8QsnN1N8</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2806661518</pqid></control><display><type>article</type><title>UniNeXt: Exploring A Unified Architecture for Vision Recognition</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Lin, Fangjian ; Yuan, Jianlong ; Wu, Sitong ; Wang, Fan ; Wang, Zhibin</creator><creatorcontrib>Lin, Fangjian ; Yuan, Jianlong ; Wu, Sitong ; Wang, Fan ; Wang, Zhibin</creatorcontrib><description>Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2304.13700</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer vision ; Mixers ; Performance enhancement</subject><ispartof>arXiv.org, 2023-08</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27902</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2304.13700$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.1145/3581783.3612260$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Fangjian</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Wu, Sitong</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><creatorcontrib>Wang, Zhibin</creatorcontrib><title>UniNeXt: Exploring A Unified Architecture for Vision Recognition</title><title>arXiv.org</title><description>Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer vision</subject><subject>Mixers</subject><subject>Performance enhancement</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotj01Lw0AYhBdBsNT-AE8ueE589zMbT4ZSP6AoSBVvYbN5U7fUbNykUv-9sfU0wzAM8xBywSCVRim4tnHvv1MuQKZMZAAnZMKFYImRnJ-RWd9vAIDrjCslJuT2tfVP-D7c0MW-24bo2zUt6Bg2HmtaRPfhB3TDLiJtQqRvvvehpS_owrr1w-jPyWljtz3O_nVKVneL1fwhWT7fP86LZWIVN0nNhQGpVe6MzHLFpG0cb1AAa4RUmPNKQiYqJ6EekypjRoPSLFdYG-RoxZRcHmcPeGUX_aeNP-UfZnnAHBtXx0YXw9cO-6HchF1sx08lN6C1ZooZ8QsnN1N8</recordid><startdate>20230817</startdate><enddate>20230817</enddate><creator>Lin, Fangjian</creator><creator>Yuan, Jianlong</creator><creator>Wu, Sitong</creator><creator>Wang, Fan</creator><creator>Wang, Zhibin</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230817</creationdate><title>UniNeXt: Exploring A Unified Architecture for Vision Recognition</title><author>Lin, Fangjian ; Yuan, Jianlong ; Wu, Sitong ; Wang, Fan ; Wang, Zhibin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a528-d23804659c8479514afc2fe301f345e92b4073bc40d1f3b7186056195ed8e2ea3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer vision</topic><topic>Mixers</topic><topic>Performance enhancement</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Fangjian</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Wu, Sitong</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><creatorcontrib>Wang, Zhibin</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lin, Fangjian</au><au>Yuan, Jianlong</au><au>Wu, Sitong</au><au>Wang, Fan</au><au>Wang, Zhibin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>UniNeXt: Exploring A Unified Architecture for Vision Recognition</atitle><jtitle>arXiv.org</jtitle><date>2023-08-17</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2304.13700</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-08
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2304_13700
source	arXiv.org; Free E- Journals
subjects	Computer Science - Computer Vision and Pattern Recognition Computer vision Mixers Performance enhancement
title	UniNeXt: Exploring A Unified Architecture for Vision Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T20%3A37%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=UniNeXt:%20Exploring%20A%20Unified%20Architecture%20for%20Vision%20Recognition&rft.jtitle=arXiv.org&rft.au=Lin,%20Fangjian&rft.date=2023-08-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2304.13700&rft_dat=%3Cproquest_arxiv%3E2806661518%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2806661518&rft_id=info:pmid/&rfr_iscdi=true