MixFormer: End-to-End Tracking With Iterative Mixed Attention

Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking frame...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2024-06, Vol.46 (6), p.4129-4146
Hauptverfasser:	Cui, Yutao, Jiang, Cheng, Wu, Gangshan, Wang, Limin
Format:	Artikel
Sprache:	eng
Schlagworte:	Compact tracking framework Computational efficiency Correlation Feature extraction Head Location awareness Magnetic heads mixed attention Modules Multiple target tracking Optical tracking score prediction self-supervised Target tracking Tracking Transformers vision transformer visual tracking
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4146
container_issue	6
container_start_page	4129
container_title	IEEE transactions on pattern analysis and machine intelligence
container_volume	46
creator	Cui, Yutao Jiang, Cheng Wu, Gangshan Wang, Limin
description	Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer , built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT , and a non-hierarchical simple tracker MixViT . For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.
doi_str_mv	10.1109/TPAMI.2024.3349519
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_3052594702</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10380715</ieee_id><sourcerecordid>3052182582</sourcerecordid><originalsourceid>FETCH-LOGICAL-c352t-d392452bb46e60a96c781bf1303cb0af7ee67b9e6d387712aa4f20d2816f92383</originalsourceid><addsrcrecordid>eNpdkF1LwzAYhYMobk7_gIgUvPGmM8nbtIngxRibDjb0YuJlSdu3mrm1mqSi_97uQxGvzs1zDoeHkFNG-4xRdTV_GMwmfU551AeIlGBqj3SZAhWCALVPupTFPJSSyw45cm5BKYsEhUPSAZkwEDHvkpuZ-RzXdoX2OhhVRejrsI1gbnX-aqrn4Mn4l2Di0WpvPjBoaSyCgfdYeVNXx-Sg1EuHJ7vskcfxaD68C6f3t5PhYBrmILgPC1A8EjzLohhjqlWcJ5JlJQMKeUZ1mSDGSaYwLtpjCeNaRyWnBZcsLhUHCT1yud19s_V7g86nK-NyXC51hXXjUqCCCxUllLfoxT90UTe2at9tKCa5kGuKb6nc1s5ZLNM3a1bafqWMpmu56UZuupab7uS2pfPddJOtsPit_NhsgbMtYBDxzyJImjAB34jje0o</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3052182582</pqid></control><display><type>article</type><title>MixFormer: End-to-End Tracking With Iterative Mixed Attention</title><source>IEEE Electronic Library (IEL)</source><creator>Cui, Yutao ; Jiang, Cheng ; Wu, Gangshan ; Wang, Limin</creator><creatorcontrib>Cui, Yutao ; Jiang, Cheng ; Wu, Gangshan ; Wang, Limin</creatorcontrib><description>Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer , built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT , and a non-hierarchical simple tracker MixViT . For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2024.3349519</identifier><identifier>PMID: 38713562</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Compact tracking framework ; Computational efficiency ; Correlation ; Feature extraction ; Head ; Location awareness ; Magnetic heads ; mixed attention ; Modules ; Multiple target tracking ; Optical tracking ; score prediction ; self-supervised ; Target tracking ; Tracking ; Transformers ; vision transformer ; visual tracking</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2024-06, Vol.46 (6), p.4129-4146</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c352t-d392452bb46e60a96c781bf1303cb0af7ee67b9e6d387712aa4f20d2816f92383</citedby><cites>FETCH-LOGICAL-c352t-d392452bb46e60a96c781bf1303cb0af7ee67b9e6d387712aa4f20d2816f92383</cites><orcidid>0000-0003-1391-1762 ; 0009-0003-2170-2538 ; 0000-0002-3674-7718 ; 0000-0003-4788-9751</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10380715$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10380715$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38713562$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Cui, Yutao</creatorcontrib><creatorcontrib>Jiang, Cheng</creatorcontrib><creatorcontrib>Wu, Gangshan</creatorcontrib><creatorcontrib>Wang, Limin</creatorcontrib><title>MixFormer: End-to-End Tracking With Iterative Mixed Attention</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer , built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT , and a non-hierarchical simple tracker MixViT . For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.</description><subject>Compact tracking framework</subject><subject>Computational efficiency</subject><subject>Correlation</subject><subject>Feature extraction</subject><subject>Head</subject><subject>Location awareness</subject><subject>Magnetic heads</subject><subject>mixed attention</subject><subject>Modules</subject><subject>Multiple target tracking</subject><subject>Optical tracking</subject><subject>score prediction</subject><subject>self-supervised</subject><subject>Target tracking</subject><subject>Tracking</subject><subject>Transformers</subject><subject>vision transformer</subject><subject>visual tracking</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkF1LwzAYhYMobk7_gIgUvPGmM8nbtIngxRibDjb0YuJlSdu3mrm1mqSi_97uQxGvzs1zDoeHkFNG-4xRdTV_GMwmfU551AeIlGBqj3SZAhWCALVPupTFPJSSyw45cm5BKYsEhUPSAZkwEDHvkpuZ-RzXdoX2OhhVRejrsI1gbnX-aqrn4Mn4l2Di0WpvPjBoaSyCgfdYeVNXx-Sg1EuHJ7vskcfxaD68C6f3t5PhYBrmILgPC1A8EjzLohhjqlWcJ5JlJQMKeUZ1mSDGSaYwLtpjCeNaRyWnBZcsLhUHCT1yud19s_V7g86nK-NyXC51hXXjUqCCCxUllLfoxT90UTe2at9tKCa5kGuKb6nc1s5ZLNM3a1bafqWMpmu56UZuupab7uS2pfPddJOtsPit_NhsgbMtYBDxzyJImjAB34jje0o</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Cui, Yutao</creator><creator>Jiang, Cheng</creator><creator>Wu, Gangshan</creator><creator>Wang, Limin</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-1391-1762</orcidid><orcidid>https://orcid.org/0009-0003-2170-2538</orcidid><orcidid>https://orcid.org/0000-0002-3674-7718</orcidid><orcidid>https://orcid.org/0000-0003-4788-9751</orcidid></search><sort><creationdate>20240601</creationdate><title>MixFormer: End-to-End Tracking With Iterative Mixed Attention</title><author>Cui, Yutao ; Jiang, Cheng ; Wu, Gangshan ; Wang, Limin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c352t-d392452bb46e60a96c781bf1303cb0af7ee67b9e6d387712aa4f20d2816f92383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Compact tracking framework</topic><topic>Computational efficiency</topic><topic>Correlation</topic><topic>Feature extraction</topic><topic>Head</topic><topic>Location awareness</topic><topic>Magnetic heads</topic><topic>mixed attention</topic><topic>Modules</topic><topic>Multiple target tracking</topic><topic>Optical tracking</topic><topic>score prediction</topic><topic>self-supervised</topic><topic>Target tracking</topic><topic>Tracking</topic><topic>Transformers</topic><topic>vision transformer</topic><topic>visual tracking</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cui, Yutao</creatorcontrib><creatorcontrib>Jiang, Cheng</creatorcontrib><creatorcontrib>Wu, Gangshan</creatorcontrib><creatorcontrib>Wang, Limin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cui, Yutao</au><au>Jiang, Cheng</au><au>Wu, Gangshan</au><au>Wang, Limin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MixFormer: End-to-End Tracking With Iterative Mixed Attention</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2024-06-01</date><risdate>2024</risdate><volume>46</volume><issue>6</issue><spage>4129</spage><epage>4146</epage><pages>4129-4146</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer , built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT , and a non-hierarchical simple tracker MixViT . For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38713562</pmid><doi>10.1109/TPAMI.2024.3349519</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0003-1391-1762</orcidid><orcidid>https://orcid.org/0009-0003-2170-2538</orcidid><orcidid>https://orcid.org/0000-0002-3674-7718</orcidid><orcidid>https://orcid.org/0000-0003-4788-9751</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0162-8828
ispartof	IEEE transactions on pattern analysis and machine intelligence, 2024-06, Vol.46 (6), p.4129-4146
issn	0162-8828 1939-3539 2160-9292
language	eng
recordid	cdi_proquest_miscellaneous_3052594702
source	IEEE Electronic Library (IEL)
subjects	Compact tracking framework Computational efficiency Correlation Feature extraction Head Location awareness Magnetic heads mixed attention Modules Multiple target tracking Optical tracking score prediction self-supervised Target tracking Tracking Transformers vision transformer visual tracking
title	MixFormer: End-to-End Tracking With Iterative Mixed Attention
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T09%3A46%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MixFormer:%20End-to-End%20Tracking%20With%20Iterative%20Mixed%20Attention&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Cui,%20Yutao&rft.date=2024-06-01&rft.volume=46&rft.issue=6&rft.spage=4129&rft.epage=4146&rft.pages=4129-4146&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2024.3349519&rft_dat=%3Cproquest_RIE%3E3052182582%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3052182582&rft_id=info:pmid/38713562&rft_ieee_id=10380715&rfr_iscdi=true