Multimodal Fusion of Speech and Gesture Recognition based on Deep Learning

This paper proposes a multimodal fusion architecture based on deep learning. The architecture consists of two forms: speech command and hand gesture. First, the speech and gesture commands input by users are recognized by CNN for speech command recognition and LSTM for hand gesture recognition respe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of physics. Conference series 2020-01, Vol.1453 (1), p.12092
Hauptverfasser: Qiu, Xiaoyu, Feng, Zhiquan, Yang, Xiaohui, Tian, Jinglan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper proposes a multimodal fusion architecture based on deep learning. The architecture consists of two forms: speech command and hand gesture. First, the speech and gesture commands input by users are recognized by CNN for speech command recognition and LSTM for hand gesture recognition respectively. Secondly, the obtained results are searched by keywords and compared by similarity degree to obtain recognition results. Finally, the two results are fused to output the final instructions. Experiments show that the proposed multi-mode fusion model is superior to the single-mode fusion model.
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/1453/1/012092