Multi-modal fusion user intention recognition method in complex human-computer interaction scene
The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention discloses a multi-modal fusion user intention recognition method in a complex human-computer interaction scene, and the method comprises the steps: obtaining a voice and a video, and converting the voice into a text through a voice recognition module; the method comprises the following steps: respectively extracting text features, voice features and visual features through a pre-training model BERT, a pre-training model Wav2vec 2.0 and a pre-training model Faster R-CNN, and preprocessing the features by utilizing Transform; constructing a modal specific encoder and a modal shared encoder, and carrying out multi-modal cooperative representation learning on text, voice and video features; aiming at the situation that each mode may show different levels of noise at different moments in a complex scene, adaptive fusion is carried out on multi-mode cooperative representation by using an attention mechanism and a gated neural network; and inputting the fusion features into a full-connection neural net |
---|