3D multi-modal pre-training method and system based on relation perception

The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: ZHANG MIN, XIE XUNWEI, LI WENXIAO, LUO HAN, TANG JUNHAO, LEI YINJIE
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation model and a CLIP text encoder; the 3D point cloud scene extracts point cloud features through the 3D point cloud backbone network and aligns the point cloud features with the virtual feature point cloud; obtaining a plurality of candidate target features based on the point cloud features and performing regression to obtain a target frame; a 2D mask generator is used for extracting masks of objects in the multi-view images and features of the masks, and the features of the multi-view masks are aligned with the features of the candidate targets; extracting language description features of the target object, fusing the language description features with candidate target features, calculating scores of the candidate obj