Multi-modal fusion user intention recognition method in complex human-computer interaction scene

The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: MA TINGHUAI, JIA LI, HUANG XUEJIAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net