3D multi-modal pre-training method and system based on relation perception

The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	ZHANG MIN, XIE XUNWEI, LI WENXIAO, LUO HAN, TANG JUNHAO, LEI YINJIE
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention discloses a 3D multi-modal pre-training method and system based on relation perception, and relates to the field of 3D multi-modality, and the method comprises the steps: enabling a multi-view image to obtain a virtual feature point cloud through a 2D pre-training subtitle generation model and a CLIP text encoder; the 3D point cloud scene extracts point cloud features through the 3D point cloud backbone network and aligns the point cloud features with the virtual feature point cloud; obtaining a plurality of candidate target features based on the point cloud features and performing regression to obtain a target frame; a 2D mask generator is used for extracting masks of objects in the multi-view images and features of the masks, and the features of the multi-view masks are aligned with the features of the candidate targets; extracting language description features of the target object, fusing the language description features with candidate target features, calculating scores of the candidate obj