Enhancing visual tracking with a unified temporal Transformer framework

Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, whi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on intelligent vehicles 2024, p.1-15
Hauptverfasser: Zhang, Tianlu, Jin, Ziniu, Debattista, Kurt, Zhang, Qiang, Han, Jungong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 15
container_issue
container_start_page 1
container_title IEEE transactions on intelligent vehicles
container_volume
creator Zhang, Tianlu
Jin, Ziniu
Debattista, Kurt
Zhang, Qiang
Han, Jungong
description Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.
doi_str_mv 10.1109/TIV.2024.3398405
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10522850</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10522850</ieee_id><sourcerecordid>10_1109_TIV_2024_3398405</sourcerecordid><originalsourceid>FETCH-LOGICAL-c132t-31efae51de27962bfbbb6096dc5120ad392108bd0385d78aad7d51d28f7145563</originalsourceid><addsrcrecordid>eNpNkLFOwzAURS0EElXpzsCQH0h4tuPEHlFVSqVKLIHVcuJnatoklZ1S8fekaisxvfukc-9wCHmkkFEK6rlafWYMWJ5xrmQO4oZMGC9VKhXkt9cshbwnsxi_AYAWkklQE7JcdBvTNb77Sn58PJhdMgTTbE__0Q-bxCSHzjuPNhmw3fdhBKpguuj60GJIXDAtHvuwfSB3zuwizi53Sj5eF9X8LV2_L1fzl3XaUM6GlFN0BgW1yEpVsNrVdV2AKmwjKANjuWIUZG2BS2FLaYwt7Ugz6UqaC1HwKYHzbhP6GAM6vQ--NeFXU9AnF3p0oU8u9MXFWHk6Vzwi_sMFY1IA_wN1zltq</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing visual tracking with a unified temporal Transformer framework</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</creator><creatorcontrib>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</creatorcontrib><description>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</description><identifier>ISSN: 2379-8858</identifier><identifier>EISSN: 2379-8904</identifier><identifier>DOI: 10.1109/TIV.2024.3398405</identifier><identifier>CODEN: ITIVBL</identifier><language>eng</language><publisher>IEEE</publisher><subject>Object tracking ; temporal information ; Transformer</subject><ispartof>IEEE transactions on intelligent vehicles, 2024, p.1-15</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-2982-5199 ; 0000-0002-2828-9905 ; 0000-0003-4361-956X ; 0000-0003-4592-5448</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10522850$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4009,27902,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10522850$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Tianlu</creatorcontrib><creatorcontrib>Jin, Ziniu</creatorcontrib><creatorcontrib>Debattista, Kurt</creatorcontrib><creatorcontrib>Zhang, Qiang</creatorcontrib><creatorcontrib>Han, Jungong</creatorcontrib><title>Enhancing visual tracking with a unified temporal Transformer framework</title><title>IEEE transactions on intelligent vehicles</title><addtitle>TIV</addtitle><description>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</description><subject>Object tracking</subject><subject>temporal information</subject><subject>Transformer</subject><issn>2379-8858</issn><issn>2379-8904</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkLFOwzAURS0EElXpzsCQH0h4tuPEHlFVSqVKLIHVcuJnatoklZ1S8fekaisxvfukc-9wCHmkkFEK6rlafWYMWJ5xrmQO4oZMGC9VKhXkt9cshbwnsxi_AYAWkklQE7JcdBvTNb77Sn58PJhdMgTTbE__0Q-bxCSHzjuPNhmw3fdhBKpguuj60GJIXDAtHvuwfSB3zuwizi53Sj5eF9X8LV2_L1fzl3XaUM6GlFN0BgW1yEpVsNrVdV2AKmwjKANjuWIUZG2BS2FLaYwt7Ugz6UqaC1HwKYHzbhP6GAM6vQ--NeFXU9AnF3p0oU8u9MXFWHk6Vzwi_sMFY1IA_wN1zltq</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Zhang, Tianlu</creator><creator>Jin, Ziniu</creator><creator>Debattista, Kurt</creator><creator>Zhang, Qiang</creator><creator>Han, Jungong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2982-5199</orcidid><orcidid>https://orcid.org/0000-0002-2828-9905</orcidid><orcidid>https://orcid.org/0000-0003-4361-956X</orcidid><orcidid>https://orcid.org/0000-0003-4592-5448</orcidid></search><sort><creationdate>2024</creationdate><title>Enhancing visual tracking with a unified temporal Transformer framework</title><author>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c132t-31efae51de27962bfbbb6096dc5120ad392108bd0385d78aad7d51d28f7145563</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Object tracking</topic><topic>temporal information</topic><topic>Transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Tianlu</creatorcontrib><creatorcontrib>Jin, Ziniu</creatorcontrib><creatorcontrib>Debattista, Kurt</creatorcontrib><creatorcontrib>Zhang, Qiang</creatorcontrib><creatorcontrib>Han, Jungong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on intelligent vehicles</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Tianlu</au><au>Jin, Ziniu</au><au>Debattista, Kurt</au><au>Zhang, Qiang</au><au>Han, Jungong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing visual tracking with a unified temporal Transformer framework</atitle><jtitle>IEEE transactions on intelligent vehicles</jtitle><stitle>TIV</stitle><date>2024</date><risdate>2024</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>2379-8858</issn><eissn>2379-8904</eissn><coden>ITIVBL</coden><abstract>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</abstract><pub>IEEE</pub><doi>10.1109/TIV.2024.3398405</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-2982-5199</orcidid><orcidid>https://orcid.org/0000-0002-2828-9905</orcidid><orcidid>https://orcid.org/0000-0003-4361-956X</orcidid><orcidid>https://orcid.org/0000-0003-4592-5448</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2379-8858
ispartof IEEE transactions on intelligent vehicles, 2024, p.1-15
issn 2379-8858
2379-8904
language eng
recordid cdi_ieee_primary_10522850
source IEEE Electronic Library (IEL)
subjects Object tracking
temporal information
Transformer
title Enhancing visual tracking with a unified temporal Transformer framework
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T10%3A47%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20visual%20tracking%20with%20a%20unified%20temporal%20Transformer%20framework&rft.jtitle=IEEE%20transactions%20on%20intelligent%20vehicles&rft.au=Zhang,%20Tianlu&rft.date=2024&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=2379-8858&rft.eissn=2379-8904&rft.coden=ITIVBL&rft_id=info:doi/10.1109/TIV.2024.3398405&rft_dat=%3Ccrossref_RIE%3E10_1109_TIV_2024_3398405%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10522850&rfr_iscdi=true