Vision and language-based multi-modal mixed fusion fine-grained recognition method

The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: CHEN YI, ZHU BIN, XIE BO, WANG RUNHUA, ZOU RONGPING, XIA ANNING, YANG HUA
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language mode by using a feature extraction module; wherein the visual features are fed to a visual modal classifier to determine a visual modal classification result, and the language features are fed to a language modal classifier to obtain a language modal classification result; a feature fusion module is utilized to generate joint features based on the visual features and the language features, the joint features are fed to a multi-head self-attention layer, a feature fusion result is obtained after the joint features pass through a full connection layer, and the classification confidence of the feature fusion result is calculated; and a result fusion module is utilized to determine weights for the classification confide