Learning accurate monocular 3D voxel representation via bilateral voxel transformer

Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM)...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Image and vision computing 2024-10, Vol.150, p.105237, Article 105237
Hauptverfasser:	Cheng, Tianheng, Jiang, Haoyi, Chen, Shaoyu, Liao, Bencheng, Zhang, Qian, Liu, Wenyu, Wang, Xinggang
Format:	Artikel
Sprache:	eng
Schlagworte:	3D semantic scene completion Autonomous driving Occupancy networks Scene understanding Voxel transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present Bilateral Voxel Transformer (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on GitHub. •This paper explores the issues of ambiguous projection in monocular semantic scene completion.•This paper presents the Bilateral Voxel Transformer to complete scenes with semantic contexts.•The proposed Bilateral Voxel Transformer obtains superior performance on 3D semantic completion for driving scenes.
ISSN:	0262-8856
DOI:	10.1016/j.imavis.2024.105237