Multi-modal fusion architecture search for camera-based semantic scene completion

Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2024-06, Vol.243, p.122885, Article 122885
Hauptverfasser:	Wang, Xuzhi, Feng, Wei, Wan, Liang
Format:	Artikel
Sprache:	eng
Schlagworte:	Multi-modal fusion NAS (neural architecture search) Semantic scene completion
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB image will help SSC to some extent, the depth prediction quality is far from the demand of SSC. To solve this problem, we propose a NAS-based multi-modal fusion method to incorporate the semantic and geometry information from other intermediate representations (predicted depth and predicted 2D segmentation) to form a more robust 2D feature representation. A key idea of this design is that explicit 2D semantic information could alleviate the misleading information of 3D distortions introduced by estimated depth. Specifically, we propose the Confidence-Block to automatically learn an optimal architecture for routing and obtaining the depth prediction confidence. We propose the two-level fusion search space by decomposing the fusion search space into fusion stage search space and fusion operation search space. Moreover, we propose a confidence-aware 2D–3D projection module to alleviate the 3D projection error. Extensive experiments show that our method outperforms the state-of-the-art method by a large margin using a single RGB image on NYU and NYUCAD datasets.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2023.122885