Vision and language-based multi-modal mixed fusion fine-grained recognition method
The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language m...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention provides a vision and language-based multi-modal mixed fusion fine-grained recognition method, and belongs to the technical field of deep learning. The method comprises the following steps: extracting visual features from a visual mode and extracting language features from a language mode by using a feature extraction module; wherein the visual features are fed to a visual modal classifier to determine a visual modal classification result, and the language features are fed to a language modal classifier to obtain a language modal classification result; a feature fusion module is utilized to generate joint features based on the visual features and the language features, the joint features are fed to a multi-head self-attention layer, a feature fusion result is obtained after the joint features pass through a full connection layer, and the classification confidence of the feature fusion result is calculated; and a result fusion module is utilized to determine weights for the classification confide |
---|