Multi-modal fusion architecture search for camera-based semantic scene completion

Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2024-06, Vol.243, p.122885, Article 122885
Hauptverfasser: Wang, Xuzhi, Feng, Wei, Wan, Liang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB image will help SSC to some extent, the depth prediction quality is far from the demand of SSC. To solve this problem, we propose a NAS-based multi-modal fusion method to incorporate the semantic and geometry information from other intermediate representations (predicted depth and predicted 2D segmentation) to form a more robust 2D feature representation. A key idea of this design is that explicit 2D semantic information could alleviate the misleading information of 3D distortions introduced by estimated depth. Specifically, we propose the Confidence-Block to automatically learn an optimal architecture for routing and obtaining the depth prediction confidence. We propose the two-level fusion search space by decomposing the fusion search space into fusion stage search space and fusion operation search space. Moreover, we propose a confidence-aware 2D–3D projection module to alleviate the 3D projection error. Extensive experiments show that our method outperforms the state-of-the-art method by a large margin using a single RGB image on NYU and NYUCAD datasets.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.122885