Enhancing visual tracking with a unified temporal Transformer framework

Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, whi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on intelligent vehicles 2024, p.1-15
Hauptverfasser:	Zhang, Tianlu, Jin, Ziniu, Debattista, Kurt, Zhang, Qiang, Han, Jungong
Format:	Artikel
Sprache:	eng
Schlagworte:	Object tracking temporal information Transformer
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	15
container_issue
container_start_page	1
container_title	IEEE transactions on intelligent vehicles
container_volume
creator	Zhang, Tianlu Jin, Ziniu Debattista, Kurt Zhang, Qiang Han, Jungong
description	Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.
doi_str_mv	10.1109/TIV.2024.3398405
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10522850</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10522850</ieee_id><sourcerecordid>10_1109_TIV_2024_3398405</sourcerecordid><originalsourceid>FETCH-LOGICAL-c132t-31efae51de27962bfbbb6096dc5120ad392108bd0385d78aad7d51d28f7145563</originalsourceid><addsrcrecordid>eNpNkLFOwzAURS0EElXpzsCQH0h4tuPEHlFVSqVKLIHVcuJnatoklZ1S8fekaisxvfukc-9wCHmkkFEK6rlafWYMWJ5xrmQO4oZMGC9VKhXkt9cshbwnsxi_AYAWkklQE7JcdBvTNb77Sn58PJhdMgTTbE__0Q-bxCSHzjuPNhmw3fdhBKpguuj60GJIXDAtHvuwfSB3zuwizi53Sj5eF9X8LV2_L1fzl3XaUM6GlFN0BgW1yEpVsNrVdV2AKmwjKANjuWIUZG2BS2FLaYwt7Ugz6UqaC1HwKYHzbhP6GAM6vQ--NeFXU9AnF3p0oU8u9MXFWHk6Vzwi_sMFY1IA_wN1zltq</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing visual tracking with a unified temporal Transformer framework</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</creator><creatorcontrib>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</creatorcontrib><description>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</description><identifier>ISSN: 2379-8858</identifier><identifier>EISSN: 2379-8904</identifier><identifier>DOI: 10.1109/TIV.2024.3398405</identifier><identifier>CODEN: ITIVBL</identifier><language>eng</language><publisher>IEEE</publisher><subject>Object tracking ; temporal information ; Transformer</subject><ispartof>IEEE transactions on intelligent vehicles, 2024, p.1-15</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-2982-5199 ; 0000-0002-2828-9905 ; 0000-0003-4361-956X ; 0000-0003-4592-5448</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10522850$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4009,27902,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10522850$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Tianlu</creatorcontrib><creatorcontrib>Jin, Ziniu</creatorcontrib><creatorcontrib>Debattista, Kurt</creatorcontrib><creatorcontrib>Zhang, Qiang</creatorcontrib><creatorcontrib>Han, Jungong</creatorcontrib><title>Enhancing visual tracking with a unified temporal Transformer framework</title><title>IEEE transactions on intelligent vehicles</title><addtitle>TIV</addtitle><description>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</description><subject>Object tracking</subject><subject>temporal information</subject><subject>Transformer</subject><issn>2379-8858</issn><issn>2379-8904</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkLFOwzAURS0EElXpzsCQH0h4tuPEHlFVSqVKLIHVcuJnatoklZ1S8fekaisxvfukc-9wCHmkkFEK6rlafWYMWJ5xrmQO4oZMGC9VKhXkt9cshbwnsxi_AYAWkklQE7JcdBvTNb77Sn58PJhdMgTTbE__0Q-bxCSHzjuPNhmw3fdhBKpguuj60GJIXDAtHvuwfSB3zuwizi53Sj5eF9X8LV2_L1fzl3XaUM6GlFN0BgW1yEpVsNrVdV2AKmwjKANjuWIUZG2BS2FLaYwt7Ugz6UqaC1HwKYHzbhP6GAM6vQ--NeFXU9AnF3p0oU8u9MXFWHk6Vzwi_sMFY1IA_wN1zltq</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Zhang, Tianlu</creator><creator>Jin, Ziniu</creator><creator>Debattista, Kurt</creator><creator>Zhang, Qiang</creator><creator>Han, Jungong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2982-5199</orcidid><orcidid>https://orcid.org/0000-0002-2828-9905</orcidid><orcidid>https://orcid.org/0000-0003-4361-956X</orcidid><orcidid>https://orcid.org/0000-0003-4592-5448</orcidid></search><sort><creationdate>2024</creationdate><title>Enhancing visual tracking with a unified temporal Transformer framework</title><author>Zhang, Tianlu ; Jin, Ziniu ; Debattista, Kurt ; Zhang, Qiang ; Han, Jungong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c132t-31efae51de27962bfbbb6096dc5120ad392108bd0385d78aad7d51d28f7145563</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Object tracking</topic><topic>temporal information</topic><topic>Transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Tianlu</creatorcontrib><creatorcontrib>Jin, Ziniu</creatorcontrib><creatorcontrib>Debattista, Kurt</creatorcontrib><creatorcontrib>Zhang, Qiang</creatorcontrib><creatorcontrib>Han, Jungong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on intelligent vehicles</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Tianlu</au><au>Jin, Ziniu</au><au>Debattista, Kurt</au><au>Zhang, Qiang</au><au>Han, Jungong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing visual tracking with a unified temporal Transformer framework</atitle><jtitle>IEEE transactions on intelligent vehicles</jtitle><stitle>TIV</stitle><date>2024</date><risdate>2024</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>2379-8858</issn><eissn>2379-8904</eissn><coden>ITIVBL</coden><abstract>Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.</abstract><pub>IEEE</pub><doi>10.1109/TIV.2024.3398405</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-2982-5199</orcidid><orcidid>https://orcid.org/0000-0002-2828-9905</orcidid><orcidid>https://orcid.org/0000-0003-4361-956X</orcidid><orcidid>https://orcid.org/0000-0003-4592-5448</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2379-8858
ispartof	IEEE transactions on intelligent vehicles, 2024, p.1-15
issn	2379-8858 2379-8904
language	eng
recordid	cdi_ieee_primary_10522850
source	IEEE Electronic Library (IEL)
subjects	Object tracking temporal information Transformer
title	Enhancing visual tracking with a unified temporal Transformer framework
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T10%3A47%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20visual%20tracking%20with%20a%20unified%20temporal%20Transformer%20framework&rft.jtitle=IEEE%20transactions%20on%20intelligent%20vehicles&rft.au=Zhang,%20Tianlu&rft.date=2024&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=2379-8858&rft.eissn=2379-8904&rft.coden=ITIVBL&rft_id=info:doi/10.1109/TIV.2024.3398405&rft_dat=%3Ccrossref_RIE%3E10_1109_TIV_2024_3398405%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10522850&rfr_iscdi=true