Higher Performance Visual Tracking with Dual-Modal Localization

Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among ex...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhou, Jinghao, Li, Bo, Qiao, Lei, Wang, Peng, Gan, Weihao, Wu, Wei, Yan, Junjie, Ouyang, Wanli
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhou, Jinghao Li, Bo Qiao, Lei Wang, Peng Gan, Weihao Wu, Wei Yan, Junjie Ouyang, Wanli
description	Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.
doi_str_mv	10.48550/arxiv.2103.10089
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2103_10089</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2103_10089</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-f8ee45122d45828c28f7a1efb20a70c0d13a6980595c67dfcfaf90e7caa312b73</originalsourceid><addsrcrecordid>eNotz01uwjAUBGBvWKDQA3SFL5DwbMexvUIVbaFSqrKI2EYPxwarIUGGlranL3-rkWakkT5CHhlkuZYSJhh_wnfGGYiMAWgzJNNF2GxdpEsXfR932FlHV-HwhS2tItrP0G3oKRy39Plcpe99cx7K3mIb_vAY-m5EBh7bg3u4Z0Kq15dqtkjLj_nb7KlMsVAm9dq5XDLOm1xqri3XXiFzfs0BFVhomMDCaJBG2kI13nr0BpyyiILxtRIJGd9ur4J6H8MO4299kdRXifgHsRtDjw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Higher Performance Visual Tracking with Dual-Modal Localization</title><source>arXiv.org</source><creator>Zhou, Jinghao ; Li, Bo ; Qiao, Lei ; Wang, Peng ; Gan, Weihao ; Wu, Wei ; Yan, Junjie ; Ouyang, Wanli</creator><creatorcontrib>Zhou, Jinghao ; Li, Bo ; Qiao, Lei ; Wang, Peng ; Gan, Weihao ; Wu, Wei ; Yan, Junjie ; Ouyang, Wanli</creatorcontrib><description>Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.</description><identifier>DOI: 10.48550/arxiv.2103.10089</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2103.10089$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2103.10089$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhou, Jinghao</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Qiao, Lei</creatorcontrib><creatorcontrib>Wang, Peng</creatorcontrib><creatorcontrib>Gan, Weihao</creatorcontrib><creatorcontrib>Wu, Wei</creatorcontrib><creatorcontrib>Yan, Junjie</creatorcontrib><creatorcontrib>Ouyang, Wanli</creatorcontrib><title>Higher Performance Visual Tracking with Dual-Modal Localization</title><description>Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz01uwjAUBGBvWKDQA3SFL5DwbMexvUIVbaFSqrKI2EYPxwarIUGGlranL3-rkWakkT5CHhlkuZYSJhh_wnfGGYiMAWgzJNNF2GxdpEsXfR932FlHV-HwhS2tItrP0G3oKRy39Plcpe99cx7K3mIb_vAY-m5EBh7bg3u4Z0Kq15dqtkjLj_nb7KlMsVAm9dq5XDLOm1xqri3XXiFzfs0BFVhomMDCaJBG2kI13nr0BpyyiILxtRIJGd9ur4J6H8MO4299kdRXifgHsRtDjw</recordid><startdate>20210318</startdate><enddate>20210318</enddate><creator>Zhou, Jinghao</creator><creator>Li, Bo</creator><creator>Qiao, Lei</creator><creator>Wang, Peng</creator><creator>Gan, Weihao</creator><creator>Wu, Wei</creator><creator>Yan, Junjie</creator><creator>Ouyang, Wanli</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210318</creationdate><title>Higher Performance Visual Tracking with Dual-Modal Localization</title><author>Zhou, Jinghao ; Li, Bo ; Qiao, Lei ; Wang, Peng ; Gan, Weihao ; Wu, Wei ; Yan, Junjie ; Ouyang, Wanli</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-f8ee45122d45828c28f7a1efb20a70c0d13a6980595c67dfcfaf90e7caa312b73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Jinghao</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Qiao, Lei</creatorcontrib><creatorcontrib>Wang, Peng</creatorcontrib><creatorcontrib>Gan, Weihao</creatorcontrib><creatorcontrib>Wu, Wei</creatorcontrib><creatorcontrib>Yan, Junjie</creatorcontrib><creatorcontrib>Ouyang, Wanli</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhou, Jinghao</au><au>Li, Bo</au><au>Qiao, Lei</au><au>Wang, Peng</au><au>Gan, Weihao</au><au>Wu, Wei</au><au>Yan, Junjie</au><au>Ouyang, Wanli</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Higher Performance Visual Tracking with Dual-Modal Localization</atitle><date>2021-03-18</date><risdate>2021</risdate><abstract>Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.</abstract><doi>10.48550/arxiv.2103.10089</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2103.10089
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2103_10089
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Higher Performance Visual Tracking with Dual-Modal Localization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T05%3A30%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Higher%20Performance%20Visual%20Tracking%20with%20Dual-Modal%20Localization&rft.au=Zhou,%20Jinghao&rft.date=2021-03-18&rft_id=info:doi/10.48550/arxiv.2103.10089&rft_dat=%3Carxiv_GOX%3E2103_10089%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true