RPCS v2.0: Object-detection-based recurrent point cloud selection method for 3D dense captioning

3D dense captioning is the process of generating natural language descriptions for objects in a 3D scene, represented as RGB-D scans or point clouds. Three problems currently limit the potential performance of existing methods. First, existing methods randomly select points from the point cloud, lea...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) 2024-04, Vol.577, p.127350, Article 127350
Hauptverfasser: Hayashi, Shinko, Zhang, Zhiqiang, Zhou, Jinjia
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:3D dense captioning is the process of generating natural language descriptions for objects in a 3D scene, represented as RGB-D scans or point clouds. Three problems currently limit the potential performance of existing methods. First, existing methods randomly select points from the point cloud, leading to the exclusion of important points or inclusion of low-value points, for the detected objects. This, in turn, degrades the quality of generated descriptions. Although our previously proposed method, namely Recurrent Point Clouds Selection (RPCS) mitigates aforementioned issue, it possesses an inaccurate termination criterion that causes unexpected interruptions in the loop. Furthermore, existing methods utilize the older object detector, which limits their inherent performance. Finally, the descriptions generated by existing methods only describe individual detected objects in the scene, which may be inconvenient for viewers from a visual perspective. To address these problems, we propose RPCS v2.0, which features several improvements over our original design. To maintain a high quality of generated descriptions while avoiding the unexpected interruptions inherent to the original RPCS method, we propose a modified termination criterion that continuously compares the element counts of the bad list. To address limitations associated with the older object detector, we implemented a newer object detector to further improve performance. Finally, to ensure the user-friendliness of visuals, we leveraged a language model to summarize the generated descriptions, producing a comprehensive representation of the entire scene. Our proposed approach, referred to as ScanT3D, significantly mitigated the issue of suboptimal description generation and outperformed the state-of-the-art methods by a large margin (65.84%CiDEr@0.5IoU).
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2024.127350