Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal on multimodal user interfaces 2010-03, Vol.3 (1-2), p.33-48
Hauptverfasser: Kessous, Loic, Castellano, Ginevra, Caridakis, George
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.
ISSN:1783-7677
1783-8738
DOI:10.1007/s12193-009-0025-5