Vision and language-based multi-modal mixed fusion fine-grained recognition method

The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	CHEN YI, ZHU BIN, XIE BO, WANG RUNHUA, ZOU RONGPING, XIA ANNING, YANG HUA
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language mode by using a feature extraction module; wherein the visual features are fed to a visual modal classifier to determine a visual modal classification result, and the language features are fed to a language modal classifier to obtain a language modal classification result; a feature fusion module is utilized to generate joint features based on the visual features and the language features, the joint features are fed to a multi-head self-attention layer, a feature fusion result is obtained after the joint features pass through a full connection layer, and the classification confidence of the feature fusion result is calculated; and a result fusion module is utilized to determine weights for the classification confide