Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection

Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2024, Vol.33, p.4488-4500
Hauptverfasser:	Chen, Zehui, Chen, Zheng, Li, Zhenyu, Zhang, Shiquan, Fang, Liangji, Jiang, Qinhong, Wu, Feng, Zhao, Feng
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Cameras dynamic graph Feature extraction Multi-view 3D object detection Object detection Solid modeling spatio-temporal modeling Three-dimensional displays transformer Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4500
container_issue
container_start_page	4488
container_title	IEEE transactions on image processing
container_volume	33
creator	Chen, Zehui Chen, Zheng Li, Zhenyu Zhang, Shiquan Fang, Liangji Jiang, Qinhong Wu, Feng Zhao, Feng
description	Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .
doi_str_mv	10.1109/TIP.2024.3430473
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10622019</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10622019</ieee_id><sourcerecordid>3087562563</sourcerecordid><originalsourceid>FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</originalsourceid><addsrcrecordid>eNpNkD1PwzAURS0EouVjZ0AoI0uKX-w4NhtqSqnUqggCq-WkL5AqaYKdCPHvSWlBTPdJ79w7HEIugI4AqLpJZo-jgAZ8xDijPGIHZAiKg08pDw77m4aRHwFXA3Li3JpS4CGIYzJgiiomJAzJcmpN8-7Hk-SJx7fec2PaovYTrJramtL7-XqLeoVlsXnz8tp6i65sC_-1wE-Pxd4yXWPWejG2fRT15owc5aZ0eL7PU_JyP0nGD_58OZ2N7-Z-FlDW-jLNgIWKqVTKVAojDYgwQilEKlFwqYyROVDMQQUKeLSKpASpWI9EkAlgp-R6t9vY-qND1-qqcBmWpdlg3TnNqIxCEYSC9SjdoZmtnbOY68YWlbFfGqjeatS9Rr3VqPca-8rVfr1LK1z9FX699cDlDigQ8d-eCAIKin0Djx1zHQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3087562563</pqid></control><display><type>article</type><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</creator><creatorcontrib>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</creatorcontrib><description>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</description><identifier>ISSN: 1057-7149</identifier><identifier>ISSN: 1941-0042</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2024.3430473</identifier><identifier>PMID: 39093681</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Cameras ; dynamic graph ; Feature extraction ; Multi-view 3D object detection ; Object detection ; Solid modeling ; spatio-temporal modeling ; Three-dimensional displays ; transformer ; Transformers</subject><ispartof>IEEE transactions on image processing, 2024, Vol.33, p.4488-4500</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</cites><orcidid>0000-0002-5509-7247 ; 0000-0001-6767-8105 ; 0000-0002-1843-4478 ; 0000-0001-7266-5579</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10622019$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10622019$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39093681$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Zehui</creatorcontrib><creatorcontrib>Chen, Zheng</creatorcontrib><creatorcontrib>Li, Zhenyu</creatorcontrib><creatorcontrib>Zhang, Shiquan</creatorcontrib><creatorcontrib>Fang, Liangji</creatorcontrib><creatorcontrib>Jiang, Qinhong</creatorcontrib><creatorcontrib>Wu, Feng</creatorcontrib><creatorcontrib>Zhao, Feng</creatorcontrib><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</description><subject>Accuracy</subject><subject>Cameras</subject><subject>dynamic graph</subject><subject>Feature extraction</subject><subject>Multi-view 3D object detection</subject><subject>Object detection</subject><subject>Solid modeling</subject><subject>spatio-temporal modeling</subject><subject>Three-dimensional displays</subject><subject>transformer</subject><subject>Transformers</subject><issn>1057-7149</issn><issn>1941-0042</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkD1PwzAURS0EouVjZ0AoI0uKX-w4NhtqSqnUqggCq-WkL5AqaYKdCPHvSWlBTPdJ79w7HEIugI4AqLpJZo-jgAZ8xDijPGIHZAiKg08pDw77m4aRHwFXA3Li3JpS4CGIYzJgiiomJAzJcmpN8-7Hk-SJx7fec2PaovYTrJramtL7-XqLeoVlsXnz8tp6i65sC_-1wE-Pxd4yXWPWejG2fRT15owc5aZ0eL7PU_JyP0nGD_58OZ2N7-Z-FlDW-jLNgIWKqVTKVAojDYgwQilEKlFwqYyROVDMQQUKeLSKpASpWI9EkAlgp-R6t9vY-qND1-qqcBmWpdlg3TnNqIxCEYSC9SjdoZmtnbOY68YWlbFfGqjeatS9Rr3VqPca-8rVfr1LK1z9FX699cDlDigQ8d-eCAIKin0Djx1zHQ</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Chen, Zehui</creator><creator>Chen, Zheng</creator><creator>Li, Zhenyu</creator><creator>Zhang, Shiquan</creator><creator>Fang, Liangji</creator><creator>Jiang, Qinhong</creator><creator>Wu, Feng</creator><creator>Zhao, Feng</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-5509-7247</orcidid><orcidid>https://orcid.org/0000-0001-6767-8105</orcidid><orcidid>https://orcid.org/0000-0002-1843-4478</orcidid><orcidid>https://orcid.org/0000-0001-7266-5579</orcidid></search><sort><creationdate>2024</creationdate><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><author>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Cameras</topic><topic>dynamic graph</topic><topic>Feature extraction</topic><topic>Multi-view 3D object detection</topic><topic>Object detection</topic><topic>Solid modeling</topic><topic>spatio-temporal modeling</topic><topic>Three-dimensional displays</topic><topic>transformer</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Zehui</creatorcontrib><creatorcontrib>Chen, Zheng</creatorcontrib><creatorcontrib>Li, Zhenyu</creatorcontrib><creatorcontrib>Zhang, Shiquan</creatorcontrib><creatorcontrib>Fang, Liangji</creatorcontrib><creatorcontrib>Jiang, Qinhong</creatorcontrib><creatorcontrib>Wu, Feng</creatorcontrib><creatorcontrib>Zhao, Feng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Zehui</au><au>Chen, Zheng</au><au>Li, Zhenyu</au><au>Zhang, Shiquan</au><au>Fang, Liangji</au><au>Jiang, Qinhong</au><au>Wu, Feng</au><au>Zhao, Feng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2024</date><risdate>2024</risdate><volume>33</volume><spage>4488</spage><epage>4500</epage><pages>4488-4500</pages><issn>1057-7149</issn><issn>1941-0042</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>39093681</pmid><doi>10.1109/TIP.2024.3430473</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-5509-7247</orcidid><orcidid>https://orcid.org/0000-0001-6767-8105</orcidid><orcidid>https://orcid.org/0000-0002-1843-4478</orcidid><orcidid>https://orcid.org/0000-0001-7266-5579</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2024, Vol.33, p.4488-4500
issn	1057-7149 1941-0042 1941-0042
language	eng
recordid	cdi_ieee_primary_10622019
source	IEEE Electronic Library (IEL)
subjects	Accuracy Cameras dynamic graph Feature extraction Multi-view 3D object detection Object detection Solid modeling spatio-temporal modeling Three-dimensional displays transformer Transformers
title	Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T17%3A11%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Graph-DETR4D:%20Spatio-Temporal%20Graph%20Modeling%20for%20Multi-View%203D%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Chen,%20Zehui&rft.date=2024&rft.volume=33&rft.spage=4488&rft.epage=4500&rft.pages=4488-4500&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2024.3430473&rft_dat=%3Cproquest_RIE%3E3087562563%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3087562563&rft_id=info:pmid/39093681&rft_ieee_id=10622019&rfr_iscdi=true