3D multi-modal pre-training method and system based on relation perception
The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation m...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation model and a CLIP text encoder; the 3D point cloud scene extracts point cloud features through the 3D point cloud backbone network and aligns the point cloud features with the virtual feature point cloud; obtaining a plurality of candidate target features based on the point cloud features and performing regression to obtain a target frame; a 2D mask generator is used for extracting masks of objects in the multi-view images and features of the masks, and the features of the multi-view masks are aligned with the features of the candidate targets; extracting language description features of the target object, fusing the language description features with candidate target features, calculating scores of the candidate obj |
---|