Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic labe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of signal processing systems 2020-08, Vol.92 (8), p.805-817
Hauptverfasser:	Qian, Yao, Ubale, Rutuja, Lange, Patrick, Evanini, Keelan, Ramanarayanan, Vikram, Soong, Frank K.
Format:	Artikel
Sprache:	eng
Schlagworte:	Accentuation Acoustic noise Acoustic phonetics Automatic speech recognition Circuits and Systems Computer Imaging Conversation Electrical Engineering Engineering English as a second language learning Hypotheses Image Processing and Computer Vision Labeling Learning Levels Pattern Recognition Pattern Recognition and Graphics Pronunciation Semantic analysis Semantic features Semantics Signal,Image and Speech Processing Speech recognition Spoken language Vision Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.
ISSN:	1939-8018 1939-8115
DOI:	10.1007/s11265-019-01484-3