Multi-modal fusion user intention recognition method in complex human-computer interaction scene

The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	MA TINGHUAI, JIA LI, HUANG XUEJIAN
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	ACOUSTICS CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net