MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lai, Simiao, Liu, Chang, Zhu, Jiawen, Kang, Ben, Liu, Yang, Wang, Dong, Lu, Huchuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lai, Simiao Liu, Chang Zhu, Jiawen Kang, Ben Liu, Yang Wang, Dong Lu, Huchuan
description	Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.
doi_str_mv	10.48550/arxiv.2408.07889
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_07889</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_07889</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_078893</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwt7Cw5GRw803MTUoMC7FSCC5ILMnM1w1JzS3IL0rMUXDOzytJrSgpBTJ981NSczLz0hXS8osUivKTSotLFILcnXRDFEKKEpOzgTI8DKxpiTnFqbxQmptB3s01xNlDF2xjfEFRZm5iUWU8yOZ4sM3GhFUAACCNOFI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><source>arXiv.org</source><creator>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</creator><creatorcontrib>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</creatorcontrib><description>Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.</description><identifier>DOI: 10.48550/arxiv.2408.07889</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.07889$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.07889$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lai, Simiao</creatorcontrib><creatorcontrib>Liu, Chang</creatorcontrib><creatorcontrib>Zhu, Jiawen</creatorcontrib><creatorcontrib>Kang, Ben</creatorcontrib><creatorcontrib>Liu, Yang</creatorcontrib><creatorcontrib>Wang, Dong</creatorcontrib><creatorcontrib>Lu, Huchuan</creatorcontrib><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><description>Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwt7Cw5GRw803MTUoMC7FSCC5ILMnM1w1JzS3IL0rMUXDOzytJrSgpBTJ981NSczLz0hXS8osUivKTSotLFILcnXRDFEKKEpOzgTI8DKxpiTnFqbxQmptB3s01xNlDF2xjfEFRZm5iUWU8yOZ4sM3GhFUAACCNOFI</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Lai, Simiao</creator><creator>Liu, Chang</creator><creator>Zhu, Jiawen</creator><creator>Kang, Ben</creator><creator>Liu, Yang</creator><creator>Wang, Dong</creator><creator>Lu, Huchuan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240814</creationdate><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><author>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_078893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lai, Simiao</creatorcontrib><creatorcontrib>Liu, Chang</creatorcontrib><creatorcontrib>Zhu, Jiawen</creatorcontrib><creatorcontrib>Kang, Ben</creatorcontrib><creatorcontrib>Liu, Yang</creatorcontrib><creatorcontrib>Wang, Dong</creatorcontrib><creatorcontrib>Lu, Huchuan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lai, Simiao</au><au>Liu, Chang</au><au>Zhu, Jiawen</au><au>Kang, Ben</au><au>Liu, Yang</au><au>Wang, Dong</au><au>Lu, Huchuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</atitle><date>2024-08-14</date><risdate>2024</risdate><abstract>Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.</abstract><doi>10.48550/arxiv.2408.07889</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.07889
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_07889
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T15%3A24%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MambaVT:%20Spatio-Temporal%20Contextual%20Modeling%20for%20robust%20RGB-T%20Tracking&rft.au=Lai,%20Simiao&rft.date=2024-08-14&rft_id=info:doi/10.48550/arxiv.2408.07889&rft_dat=%3Carxiv_GOX%3E2408_07889%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true