CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer

In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024-01, Vol.26, p.1-16
Hauptverfasser: Yu, Sheng, Zhai, Di-Hua, Xia, Yuanqing, Li, Dong, Zhao, Shiqi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 16
container_issue
container_start_page 1
container_title IEEE transactions on multimedia
container_volume 26
creator Yu, Sheng
Zhai, Di-Hua
Xia, Yuanqing
Li, Dong
Zhao, Shiqi
description In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving accurate and speedy monocular category-level pose tracking is an essential research goal. In this paper, we propose CatTrack, a new single-stage keypoints-based monocular category-level multi-object pose tracking network. A significant issue in object pose tracking tasks is utilizing the information from the previous frame to guide pose estimation for the next frame. However, as the object poses and camera information in each frame are different, we need to remove irrelevant information and emphasize useful features. To this end, we propose a transformer-based temporal information capture module to leverage the position information of keypoints from the previous frame. Furthermore, we propose a new keypoint matching module to enable the grouping and matching of object keypoints in complex scenes. We have successfully applied CatTrack to the Objectron dataset and achieved superior results in comparison to existing methods. Furthermore, we have also evaluated the generalization of CatTrack and successfully applied it to track the 6D pose of unseen real-world objects. A video is available at https://youtu.be/Yminjdtsgwk .
doi_str_mv 10.1109/TMM.2023.3284598
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2918029130</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10149532</ieee_id><sourcerecordid>2918029130</sourcerecordid><originalsourceid>FETCH-LOGICAL-c245t-46c040e29cff221cefb521d54be667da157c6cf4e50d4454be08f64fe235ea123</originalsourceid><addsrcrecordid>eNpNkM1LAzEQxYMoWKt3Dx4CnrdOssl-eJP6CS0VWr2GNDspW7cbTbaF_vdmbQ9eZh6P35uBR8g1gxFjUN4tptMRB56OUl4IWRYnZMBKwRKAPD-NWnJISs7gnFyEsAZgQkI-INVYdwuvzdc9ndftqsFk3ukV0mjjyvl9MsEdNjR7pLPlGk1H311A-peION3Vmo5du3PNtqtdS3Vb0c869DIybbDOb9BfkjOrm4BXxz0kH89Pi_FrMpm9vI0fJonhQnaJyAwIQF4aazlnBu1SclZJscQsyyvNZG4yYwVKqITobShsJizyVKJmPB2S28Pdb-9-thg6tXZb38aXipesgDhSiBQcKONdCB6t-vb1Rvu9YqD6LlXsUvVdqmOXMXJziNSI-A9nopQpT38B0bBv1Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918029130</pqid></control><display><type>article</type><title>CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer</title><source>IEEE Electronic Library (IEL)</source><creator>Yu, Sheng ; Zhai, Di-Hua ; Xia, Yuanqing ; Li, Dong ; Zhao, Shiqi</creator><creatorcontrib>Yu, Sheng ; Zhai, Di-Hua ; Xia, Yuanqing ; Li, Dong ; Zhao, Shiqi</creatorcontrib><description>In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving accurate and speedy monocular category-level pose tracking is an essential research goal. In this paper, we propose CatTrack, a new single-stage keypoints-based monocular category-level multi-object pose tracking network. A significant issue in object pose tracking tasks is utilizing the information from the previous frame to guide pose estimation for the next frame. However, as the object poses and camera information in each frame are different, we need to remove irrelevant information and emphasize useful features. To this end, we propose a transformer-based temporal information capture module to leverage the position information of keypoints from the previous frame. Furthermore, we propose a new keypoint matching module to enable the grouping and matching of object keypoints in complex scenes. We have successfully applied CatTrack to the Objectron dataset and achieved superior results in comparison to existing methods. Furthermore, we have also evaluated the generalization of CatTrack and successfully applied it to track the 6D pose of unseen real-world objects. A video is available at https://youtu.be/Yminjdtsgwk .</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2023.3284598</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Feature extraction ; Matching ; Modules ; Object tracking ; Pose estimation ; pose tracking ; Target tracking ; Task analysis ; Three dimensional models ; Three-dimensional displays ; Tracking networks ; transformer ; Transformers</subject><ispartof>IEEE transactions on multimedia, 2024-01, Vol.26, p.1-16</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c245t-46c040e29cff221cefb521d54be667da157c6cf4e50d4454be08f64fe235ea123</cites><orcidid>0000-0001-8653-8626 ; 0000-0002-5977-4911 ; 0009-0004-6523-6900 ; 0009-0002-0709-1024</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10149532$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10149532$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yu, Sheng</creatorcontrib><creatorcontrib>Zhai, Di-Hua</creatorcontrib><creatorcontrib>Xia, Yuanqing</creatorcontrib><creatorcontrib>Li, Dong</creatorcontrib><creatorcontrib>Zhao, Shiqi</creatorcontrib><title>CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving accurate and speedy monocular category-level pose tracking is an essential research goal. In this paper, we propose CatTrack, a new single-stage keypoints-based monocular category-level multi-object pose tracking network. A significant issue in object pose tracking tasks is utilizing the information from the previous frame to guide pose estimation for the next frame. However, as the object poses and camera information in each frame are different, we need to remove irrelevant information and emphasize useful features. To this end, we propose a transformer-based temporal information capture module to leverage the position information of keypoints from the previous frame. Furthermore, we propose a new keypoint matching module to enable the grouping and matching of object keypoints in complex scenes. We have successfully applied CatTrack to the Objectron dataset and achieved superior results in comparison to existing methods. Furthermore, we have also evaluated the generalization of CatTrack and successfully applied it to track the 6D pose of unseen real-world objects. A video is available at https://youtu.be/Yminjdtsgwk .</description><subject>Feature extraction</subject><subject>Matching</subject><subject>Modules</subject><subject>Object tracking</subject><subject>Pose estimation</subject><subject>pose tracking</subject><subject>Target tracking</subject><subject>Task analysis</subject><subject>Three dimensional models</subject><subject>Three-dimensional displays</subject><subject>Tracking networks</subject><subject>transformer</subject><subject>Transformers</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkM1LAzEQxYMoWKt3Dx4CnrdOssl-eJP6CS0VWr2GNDspW7cbTbaF_vdmbQ9eZh6P35uBR8g1gxFjUN4tptMRB56OUl4IWRYnZMBKwRKAPD-NWnJISs7gnFyEsAZgQkI-INVYdwuvzdc9ndftqsFk3ukV0mjjyvl9MsEdNjR7pLPlGk1H311A-peION3Vmo5du3PNtqtdS3Vb0c869DIybbDOb9BfkjOrm4BXxz0kH89Pi_FrMpm9vI0fJonhQnaJyAwIQF4aazlnBu1SclZJscQsyyvNZG4yYwVKqITobShsJizyVKJmPB2S28Pdb-9-thg6tXZb38aXipesgDhSiBQcKONdCB6t-vb1Rvu9YqD6LlXsUvVdqmOXMXJziNSI-A9nopQpT38B0bBv1Q</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Yu, Sheng</creator><creator>Zhai, Di-Hua</creator><creator>Xia, Yuanqing</creator><creator>Li, Dong</creator><creator>Zhao, Shiqi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-8653-8626</orcidid><orcidid>https://orcid.org/0000-0002-5977-4911</orcidid><orcidid>https://orcid.org/0009-0004-6523-6900</orcidid><orcidid>https://orcid.org/0009-0002-0709-1024</orcidid></search><sort><creationdate>20240101</creationdate><title>CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer</title><author>Yu, Sheng ; Zhai, Di-Hua ; Xia, Yuanqing ; Li, Dong ; Zhao, Shiqi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c245t-46c040e29cff221cefb521d54be667da157c6cf4e50d4454be08f64fe235ea123</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Feature extraction</topic><topic>Matching</topic><topic>Modules</topic><topic>Object tracking</topic><topic>Pose estimation</topic><topic>pose tracking</topic><topic>Target tracking</topic><topic>Task analysis</topic><topic>Three dimensional models</topic><topic>Three-dimensional displays</topic><topic>Tracking networks</topic><topic>transformer</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yu, Sheng</creatorcontrib><creatorcontrib>Zhai, Di-Hua</creatorcontrib><creatorcontrib>Xia, Yuanqing</creatorcontrib><creatorcontrib>Li, Dong</creatorcontrib><creatorcontrib>Zhao, Shiqi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yu, Sheng</au><au>Zhai, Di-Hua</au><au>Xia, Yuanqing</au><au>Li, Dong</au><au>Zhao, Shiqi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2024-01-01</date><risdate>2024</risdate><volume>26</volume><spage>1</spage><epage>16</epage><pages>1-16</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving accurate and speedy monocular category-level pose tracking is an essential research goal. In this paper, we propose CatTrack, a new single-stage keypoints-based monocular category-level multi-object pose tracking network. A significant issue in object pose tracking tasks is utilizing the information from the previous frame to guide pose estimation for the next frame. However, as the object poses and camera information in each frame are different, we need to remove irrelevant information and emphasize useful features. To this end, we propose a transformer-based temporal information capture module to leverage the position information of keypoints from the previous frame. Furthermore, we propose a new keypoint matching module to enable the grouping and matching of object keypoints in complex scenes. We have successfully applied CatTrack to the Objectron dataset and achieved superior results in comparison to existing methods. Furthermore, we have also evaluated the generalization of CatTrack and successfully applied it to track the 6D pose of unseen real-world objects. A video is available at https://youtu.be/Yminjdtsgwk .</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2023.3284598</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-8653-8626</orcidid><orcidid>https://orcid.org/0000-0002-5977-4911</orcidid><orcidid>https://orcid.org/0009-0004-6523-6900</orcidid><orcidid>https://orcid.org/0009-0002-0709-1024</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2024-01, Vol.26, p.1-16
issn 1520-9210
1941-0077
language eng
recordid cdi_proquest_journals_2918029130
source IEEE Electronic Library (IEL)
subjects Feature extraction
Matching
Modules
Object tracking
Pose estimation
pose tracking
Target tracking
Task analysis
Three dimensional models
Three-dimensional displays
Tracking networks
transformer
Transformers
title CatTrack: Single-Stage Category-Level 6D Object Pose Tracking via Convolution and Vision Transformer
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T05%3A49%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CatTrack:%20Single-Stage%20Category-Level%206D%20Object%20Pose%20Tracking%20via%20Convolution%20and%20Vision%20Transformer&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Yu,%20Sheng&rft.date=2024-01-01&rft.volume=26&rft.spage=1&rft.epage=16&rft.pages=1-16&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2023.3284598&rft_dat=%3Cproquest_RIE%3E2918029130%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918029130&rft_id=info:pmid/&rft_ieee_id=10149532&rfr_iscdi=true