Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection

Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2024, Vol.33, p.4488-4500
Hauptverfasser: Chen, Zehui, Chen, Zheng, Li, Zhenyu, Zhang, Shiquan, Fang, Liangji, Jiang, Qinhong, Wu, Feng, Zhao, Feng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4500
container_issue
container_start_page 4488
container_title IEEE transactions on image processing
container_volume 33
creator Chen, Zehui
Chen, Zheng
Li, Zhenyu
Zhang, Shiquan
Fang, Liangji
Jiang, Qinhong
Wu, Feng
Zhao, Feng
description Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .
doi_str_mv 10.1109/TIP.2024.3430473
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10622019</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10622019</ieee_id><sourcerecordid>3087562563</sourcerecordid><originalsourceid>FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</originalsourceid><addsrcrecordid>eNpNkD1PwzAURS0EouVjZ0AoI0uKX-w4NhtqSqnUqggCq-WkL5AqaYKdCPHvSWlBTPdJ79w7HEIugI4AqLpJZo-jgAZ8xDijPGIHZAiKg08pDw77m4aRHwFXA3Li3JpS4CGIYzJgiiomJAzJcmpN8-7Hk-SJx7fec2PaovYTrJramtL7-XqLeoVlsXnz8tp6i65sC_-1wE-Pxd4yXWPWejG2fRT15owc5aZ0eL7PU_JyP0nGD_58OZ2N7-Z-FlDW-jLNgIWKqVTKVAojDYgwQilEKlFwqYyROVDMQQUKeLSKpASpWI9EkAlgp-R6t9vY-qND1-qqcBmWpdlg3TnNqIxCEYSC9SjdoZmtnbOY68YWlbFfGqjeatS9Rr3VqPca-8rVfr1LK1z9FX699cDlDigQ8d-eCAIKin0Djx1zHQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3087562563</pqid></control><display><type>article</type><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</creator><creatorcontrib>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</creatorcontrib><description>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</description><identifier>ISSN: 1057-7149</identifier><identifier>ISSN: 1941-0042</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2024.3430473</identifier><identifier>PMID: 39093681</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Cameras ; dynamic graph ; Feature extraction ; Multi-view 3D object detection ; Object detection ; Solid modeling ; spatio-temporal modeling ; Three-dimensional displays ; transformer ; Transformers</subject><ispartof>IEEE transactions on image processing, 2024, Vol.33, p.4488-4500</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</cites><orcidid>0000-0002-5509-7247 ; 0000-0001-6767-8105 ; 0000-0002-1843-4478 ; 0000-0001-7266-5579</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10622019$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10622019$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39093681$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Zehui</creatorcontrib><creatorcontrib>Chen, Zheng</creatorcontrib><creatorcontrib>Li, Zhenyu</creatorcontrib><creatorcontrib>Zhang, Shiquan</creatorcontrib><creatorcontrib>Fang, Liangji</creatorcontrib><creatorcontrib>Jiang, Qinhong</creatorcontrib><creatorcontrib>Wu, Feng</creatorcontrib><creatorcontrib>Zhao, Feng</creatorcontrib><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</description><subject>Accuracy</subject><subject>Cameras</subject><subject>dynamic graph</subject><subject>Feature extraction</subject><subject>Multi-view 3D object detection</subject><subject>Object detection</subject><subject>Solid modeling</subject><subject>spatio-temporal modeling</subject><subject>Three-dimensional displays</subject><subject>transformer</subject><subject>Transformers</subject><issn>1057-7149</issn><issn>1941-0042</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkD1PwzAURS0EouVjZ0AoI0uKX-w4NhtqSqnUqggCq-WkL5AqaYKdCPHvSWlBTPdJ79w7HEIugI4AqLpJZo-jgAZ8xDijPGIHZAiKg08pDw77m4aRHwFXA3Li3JpS4CGIYzJgiiomJAzJcmpN8-7Hk-SJx7fec2PaovYTrJramtL7-XqLeoVlsXnz8tp6i65sC_-1wE-Pxd4yXWPWejG2fRT15owc5aZ0eL7PU_JyP0nGD_58OZ2N7-Z-FlDW-jLNgIWKqVTKVAojDYgwQilEKlFwqYyROVDMQQUKeLSKpASpWI9EkAlgp-R6t9vY-qND1-qqcBmWpdlg3TnNqIxCEYSC9SjdoZmtnbOY68YWlbFfGqjeatS9Rr3VqPca-8rVfr1LK1z9FX699cDlDigQ8d-eCAIKin0Djx1zHQ</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Chen, Zehui</creator><creator>Chen, Zheng</creator><creator>Li, Zhenyu</creator><creator>Zhang, Shiquan</creator><creator>Fang, Liangji</creator><creator>Jiang, Qinhong</creator><creator>Wu, Feng</creator><creator>Zhao, Feng</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-5509-7247</orcidid><orcidid>https://orcid.org/0000-0001-6767-8105</orcidid><orcidid>https://orcid.org/0000-0002-1843-4478</orcidid><orcidid>https://orcid.org/0000-0001-7266-5579</orcidid></search><sort><creationdate>2024</creationdate><title>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</title><author>Chen, Zehui ; Chen, Zheng ; Li, Zhenyu ; Zhang, Shiquan ; Fang, Liangji ; Jiang, Qinhong ; Wu, Feng ; Zhao, Feng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c203t-8bc135939b88b86a8a1657e866b8e6489aa8f10ef1929147d78818937e871c613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Cameras</topic><topic>dynamic graph</topic><topic>Feature extraction</topic><topic>Multi-view 3D object detection</topic><topic>Object detection</topic><topic>Solid modeling</topic><topic>spatio-temporal modeling</topic><topic>Three-dimensional displays</topic><topic>transformer</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Zehui</creatorcontrib><creatorcontrib>Chen, Zheng</creatorcontrib><creatorcontrib>Li, Zhenyu</creatorcontrib><creatorcontrib>Zhang, Shiquan</creatorcontrib><creatorcontrib>Fang, Liangji</creatorcontrib><creatorcontrib>Jiang, Qinhong</creatorcontrib><creatorcontrib>Wu, Feng</creatorcontrib><creatorcontrib>Zhao, Feng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Zehui</au><au>Chen, Zheng</au><au>Li, Zhenyu</au><au>Zhang, Shiquan</au><au>Fang, Liangji</au><au>Jiang, Qinhong</au><au>Wu, Feng</au><au>Zhao, Feng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2024</date><risdate>2024</risdate><volume>33</volume><spage>4488</spage><epage>4500</epage><pages>4488-4500</pages><issn>1057-7149</issn><issn>1941-0042</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>39093681</pmid><doi>10.1109/TIP.2024.3430473</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-5509-7247</orcidid><orcidid>https://orcid.org/0000-0001-6767-8105</orcidid><orcidid>https://orcid.org/0000-0002-1843-4478</orcidid><orcidid>https://orcid.org/0000-0001-7266-5579</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2024, Vol.33, p.4488-4500
issn 1057-7149
1941-0042
1941-0042
language eng
recordid cdi_ieee_primary_10622019
source IEEE Electronic Library (IEL)
subjects Accuracy
Cameras
dynamic graph
Feature extraction
Multi-view 3D object detection
Object detection
Solid modeling
spatio-temporal modeling
Three-dimensional displays
transformer
Transformers
title Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T17%3A11%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Graph-DETR4D:%20Spatio-Temporal%20Graph%20Modeling%20for%20Multi-View%203D%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Chen,%20Zehui&rft.date=2024&rft.volume=33&rft.spage=4488&rft.epage=4500&rft.pages=4488-4500&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2024.3430473&rft_dat=%3Cproquest_RIE%3E3087562563%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3087562563&rft_id=info:pmid/39093681&rft_ieee_id=10622019&rfr_iscdi=true