MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking
Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Lai, Simiao Liu, Chang Zhu, Jiawen Kang, Ben Liu, Yang Wang, Dong Lu, Huchuan |
description | Existing RGB-T tracking algorithms have made remarkable progress by
leveraging the global interaction capability and extensive pre-trained models
of the Transformer architecture. Nonetheless, these methods mainly adopt
imagepair appearance matching and face challenges of the intrinsic high
quadratic complexity of the attention mechanism, resulting in constrained
exploitation of temporal information. Inspired by the recently emerged State
Space Model Mamba, renowned for its impressive long sequence modeling
capabilities and linear computational complexity, this work innovatively
proposes a pure Mamba-based framework (MambaVT) to fully exploit
spatio-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to
globally adapt to target appearance variations, and introduce short-term
historical trajectory prompts to predict the subsequent target states based on
local temporal location clues. Extensive experiments show the significant
potential of vision Mamba for RGB-T tracking, with MambaVT achieving
state-of-the-art performance on four mainstream benchmarks while requiring
lower computational costs. We aim for this work to serve as a simple yet strong
baseline, stimulating future research in this field. The code and pre-trained
models will be made available. |
doi_str_mv | 10.48550/arxiv.2408.07889 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_07889</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_07889</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_078893</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwt7Cw5GRw803MTUoMC7FSCC5ILMnM1w1JzS3IL0rMUXDOzytJrSgpBTJ981NSczLz0hXS8osUivKTSotLFILcnXRDFEKKEpOzgTI8DKxpiTnFqbxQmptB3s01xNlDF2xjfEFRZm5iUWU8yOZ4sM3GhFUAACCNOFI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><source>arXiv.org</source><creator>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</creator><creatorcontrib>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</creatorcontrib><description>Existing RGB-T tracking algorithms have made remarkable progress by
leveraging the global interaction capability and extensive pre-trained models
of the Transformer architecture. Nonetheless, these methods mainly adopt
imagepair appearance matching and face challenges of the intrinsic high
quadratic complexity of the attention mechanism, resulting in constrained
exploitation of temporal information. Inspired by the recently emerged State
Space Model Mamba, renowned for its impressive long sequence modeling
capabilities and linear computational complexity, this work innovatively
proposes a pure Mamba-based framework (MambaVT) to fully exploit
spatio-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to
globally adapt to target appearance variations, and introduce short-term
historical trajectory prompts to predict the subsequent target states based on
local temporal location clues. Extensive experiments show the significant
potential of vision Mamba for RGB-T tracking, with MambaVT achieving
state-of-the-art performance on four mainstream benchmarks while requiring
lower computational costs. We aim for this work to serve as a simple yet strong
baseline, stimulating future research in this field. The code and pre-trained
models will be made available.</description><identifier>DOI: 10.48550/arxiv.2408.07889</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.07889$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.07889$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lai, Simiao</creatorcontrib><creatorcontrib>Liu, Chang</creatorcontrib><creatorcontrib>Zhu, Jiawen</creatorcontrib><creatorcontrib>Kang, Ben</creatorcontrib><creatorcontrib>Liu, Yang</creatorcontrib><creatorcontrib>Wang, Dong</creatorcontrib><creatorcontrib>Lu, Huchuan</creatorcontrib><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><description>Existing RGB-T tracking algorithms have made remarkable progress by
leveraging the global interaction capability and extensive pre-trained models
of the Transformer architecture. Nonetheless, these methods mainly adopt
imagepair appearance matching and face challenges of the intrinsic high
quadratic complexity of the attention mechanism, resulting in constrained
exploitation of temporal information. Inspired by the recently emerged State
Space Model Mamba, renowned for its impressive long sequence modeling
capabilities and linear computational complexity, this work innovatively
proposes a pure Mamba-based framework (MambaVT) to fully exploit
spatio-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to
globally adapt to target appearance variations, and introduce short-term
historical trajectory prompts to predict the subsequent target states based on
local temporal location clues. Extensive experiments show the significant
potential of vision Mamba for RGB-T tracking, with MambaVT achieving
state-of-the-art performance on four mainstream benchmarks while requiring
lower computational costs. We aim for this work to serve as a simple yet strong
baseline, stimulating future research in this field. The code and pre-trained
models will be made available.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwt7Cw5GRw803MTUoMC7FSCC5ILMnM1w1JzS3IL0rMUXDOzytJrSgpBTJ981NSczLz0hXS8osUivKTSotLFILcnXRDFEKKEpOzgTI8DKxpiTnFqbxQmptB3s01xNlDF2xjfEFRZm5iUWU8yOZ4sM3GhFUAACCNOFI</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Lai, Simiao</creator><creator>Liu, Chang</creator><creator>Zhu, Jiawen</creator><creator>Kang, Ben</creator><creator>Liu, Yang</creator><creator>Wang, Dong</creator><creator>Lu, Huchuan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240814</creationdate><title>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</title><author>Lai, Simiao ; Liu, Chang ; Zhu, Jiawen ; Kang, Ben ; Liu, Yang ; Wang, Dong ; Lu, Huchuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_078893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lai, Simiao</creatorcontrib><creatorcontrib>Liu, Chang</creatorcontrib><creatorcontrib>Zhu, Jiawen</creatorcontrib><creatorcontrib>Kang, Ben</creatorcontrib><creatorcontrib>Liu, Yang</creatorcontrib><creatorcontrib>Wang, Dong</creatorcontrib><creatorcontrib>Lu, Huchuan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lai, Simiao</au><au>Liu, Chang</au><au>Zhu, Jiawen</au><au>Kang, Ben</au><au>Liu, Yang</au><au>Wang, Dong</au><au>Lu, Huchuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking</atitle><date>2024-08-14</date><risdate>2024</risdate><abstract>Existing RGB-T tracking algorithms have made remarkable progress by
leveraging the global interaction capability and extensive pre-trained models
of the Transformer architecture. Nonetheless, these methods mainly adopt
imagepair appearance matching and face challenges of the intrinsic high
quadratic complexity of the attention mechanism, resulting in constrained
exploitation of temporal information. Inspired by the recently emerged State
Space Model Mamba, renowned for its impressive long sequence modeling
capabilities and linear computational complexity, this work innovatively
proposes a pure Mamba-based framework (MambaVT) to fully exploit
spatio-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to
globally adapt to target appearance variations, and introduce short-term
historical trajectory prompts to predict the subsequent target states based on
local temporal location clues. Extensive experiments show the significant
potential of vision Mamba for RGB-T tracking, with MambaVT achieving
state-of-the-art performance on four mainstream benchmarks while requiring
lower computational costs. We aim for this work to serve as a simple yet strong
baseline, stimulating future research in this field. The code and pre-trained
models will be made available.</abstract><doi>10.48550/arxiv.2408.07889</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2408.07889 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2408_07889 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T15%3A24%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MambaVT:%20Spatio-Temporal%20Contextual%20Modeling%20for%20robust%20RGB-T%20Tracking&rft.au=Lai,%20Simiao&rft.date=2024-08-14&rft_id=info:doi/10.48550/arxiv.2408.07889&rft_dat=%3Carxiv_GOX%3E2408_07889%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |