Prosody based audiovisual coanalysis for coverbal gesture recognition

Despite recent advances in vision-based gesture recognition, its applications remain largely limited to artificially defined and well-articulated gesture signs used for human-computer interaction. A key reason for this is the low recognition rates for "natural" gesticulation. Previous atte...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2005-04, Vol.7 (2), p.234-242
Hauptverfasser:	Kettebekov, S., Yeasin, M., Sharma, R.
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Audiovisual Bayesian methods Broadcasting Broadcasting. Videocommunications. Audiovisual Channels Computer science Computer science control theory systems Delay Exact sciences and technology Gesture recognition Hidden Markov models Human computer interaction Kinematics multimodal fusion Natural language processing Networks Pattern recognition. Digital image processing. Computational geometry prosody Recognition Speech Speech analysis Speech recognition Studies Synchronism Telecommunications Telecommunications and information theory
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Despite recent advances in vision-based gesture recognition, its applications remain largely limited to artificially defined and well-articulated gesture signs used for human-computer interaction. A key reason for this is the low recognition rates for "natural" gesticulation. Previous attempts of using speech cues to reduce error-proneness of visual classification have been mostly limited to keyword-gesture coanalysis. Such scheme inherits complexity and delays associated with natural language processing. This paper offers a novel "signal-level" perspective, where prosodic manifestations in speech and hand kinematics are considered as a basis for coanalyzing loosely coupled modalities. We present a computational framework for improving continuous gesture recognition based on two phenomena that capture voluntary (coarticulation) and involuntary (physiological) contributions of prosodic synchronization. Physiological constraints, manifested as signal interruptions during multimodal production, are exploited in an audiovisual feature integration framework using hidden Markov models. Coarticulation is analyzed using a Bayesian network of naive classifiers to explore alignment of intonationally prominent speech segments and hand kinematics. The efficacy of the proposed approach was demonstrated on a multimodal corpus created from the Weather Channel broadcast. Both schemas were found to contribute uniquely by reducing different error types, which subsequently improves the performance of continuous gesture recognition.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2004.840590