Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic labe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of signal processing systems 2020-08, Vol.92 (8), p.805-817
Hauptverfasser: Qian, Yao, Ubale, Rutuja, Lange, Patrick, Evanini, Keelan, Ramanarayanan, Vikram, Soong, Frank K.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 817
container_issue 8
container_start_page 805
container_title Journal of signal processing systems
container_volume 92
creator Qian, Yao
Ubale, Rutuja
Lange, Patrick
Evanini, Keelan
Ramanarayanan, Vikram
Soong, Frank K.
description Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.
doi_str_mv 10.1007/s11265-019-01484-3
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2423564268</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2423564268</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AU8Fz9VMvtoel0VdoeJB9xzSbFK77iY1aQX_vdEqe5NhmGHmfWbgRegS8DVgXNxEACJ4jqFKyUqW0yM0g4pWeQnAj_96DOUpOotxi7HABYcZks-9fzMuq5VrR9WabO02JsRBuU3n2szbbDXulcsflX7tnMmW3n2kvRo672JmfTiQtVHBfUOLvt91epKcoxOrdtFc_NY5Wt_dvixXef10_7Bc1LmmUA15U1TaKtCcNUyXuCpJwbjWJdfAwBYbSoSyhaKEmRS2YUJx1aRxpYW2GNM5upru9sG_jyYOcuvH4NJLSRihXDAiyqQik0oHH2MwVvah26vwKQHLbyPlZKRMRsofIyVNEJ2gmMSuNeFw-h_qC5SNdw4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2423564268</pqid></control><display><type>article</type><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><source>SpringerNature Journals</source><creator>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</creator><creatorcontrib>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</creatorcontrib><description>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</description><identifier>ISSN: 1939-8018</identifier><identifier>EISSN: 1939-8115</identifier><identifier>DOI: 10.1007/s11265-019-01484-3</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accentuation ; Acoustic noise ; Acoustic phonetics ; Automatic speech recognition ; Circuits and Systems ; Computer Imaging ; Conversation ; Electrical Engineering ; Engineering ; English as a second language learning ; Hypotheses ; Image Processing and Computer Vision ; Labeling ; Learning ; Levels ; Pattern Recognition ; Pattern Recognition and Graphics ; Pronunciation ; Semantic analysis ; Semantic features ; Semantics ; Signal,Image and Speech Processing ; Speech recognition ; Spoken language ; Vision ; Voice recognition</subject><ispartof>Journal of signal processing systems, 2020-08, Vol.92 (8), p.805-817</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</citedby><cites>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</cites><orcidid>0000-0003-1855-9630</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11265-019-01484-3$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11265-019-01484-3$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Qian, Yao</creatorcontrib><creatorcontrib>Ubale, Rutuja</creatorcontrib><creatorcontrib>Lange, Patrick</creatorcontrib><creatorcontrib>Evanini, Keelan</creatorcontrib><creatorcontrib>Ramanarayanan, Vikram</creatorcontrib><creatorcontrib>Soong, Frank K.</creatorcontrib><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><title>Journal of signal processing systems</title><addtitle>J Sign Process Syst</addtitle><description>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</description><subject>Accentuation</subject><subject>Acoustic noise</subject><subject>Acoustic phonetics</subject><subject>Automatic speech recognition</subject><subject>Circuits and Systems</subject><subject>Computer Imaging</subject><subject>Conversation</subject><subject>Electrical Engineering</subject><subject>Engineering</subject><subject>English as a second language learning</subject><subject>Hypotheses</subject><subject>Image Processing and Computer Vision</subject><subject>Labeling</subject><subject>Learning</subject><subject>Levels</subject><subject>Pattern Recognition</subject><subject>Pattern Recognition and Graphics</subject><subject>Pronunciation</subject><subject>Semantic analysis</subject><subject>Semantic features</subject><subject>Semantics</subject><subject>Signal,Image and Speech Processing</subject><subject>Speech recognition</subject><subject>Spoken language</subject><subject>Vision</subject><subject>Voice recognition</subject><issn>1939-8018</issn><issn>1939-8115</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LxDAQhoMouK7-AU8Fz9VMvtoel0VdoeJB9xzSbFK77iY1aQX_vdEqe5NhmGHmfWbgRegS8DVgXNxEACJ4jqFKyUqW0yM0g4pWeQnAj_96DOUpOotxi7HABYcZks-9fzMuq5VrR9WabO02JsRBuU3n2szbbDXulcsflX7tnMmW3n2kvRo672JmfTiQtVHBfUOLvt91epKcoxOrdtFc_NY5Wt_dvixXef10_7Bc1LmmUA15U1TaKtCcNUyXuCpJwbjWJdfAwBYbSoSyhaKEmRS2YUJx1aRxpYW2GNM5upru9sG_jyYOcuvH4NJLSRihXDAiyqQik0oHH2MwVvah26vwKQHLbyPlZKRMRsofIyVNEJ2gmMSuNeFw-h_qC5SNdw4</recordid><startdate>20200801</startdate><enddate>20200801</enddate><creator>Qian, Yao</creator><creator>Ubale, Rutuja</creator><creator>Lange, Patrick</creator><creator>Evanini, Keelan</creator><creator>Ramanarayanan, Vikram</creator><creator>Soong, Frank K.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><orcidid>https://orcid.org/0000-0003-1855-9630</orcidid></search><sort><creationdate>20200801</creationdate><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><author>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accentuation</topic><topic>Acoustic noise</topic><topic>Acoustic phonetics</topic><topic>Automatic speech recognition</topic><topic>Circuits and Systems</topic><topic>Computer Imaging</topic><topic>Conversation</topic><topic>Electrical Engineering</topic><topic>Engineering</topic><topic>English as a second language learning</topic><topic>Hypotheses</topic><topic>Image Processing and Computer Vision</topic><topic>Labeling</topic><topic>Learning</topic><topic>Levels</topic><topic>Pattern Recognition</topic><topic>Pattern Recognition and Graphics</topic><topic>Pronunciation</topic><topic>Semantic analysis</topic><topic>Semantic features</topic><topic>Semantics</topic><topic>Signal,Image and Speech Processing</topic><topic>Speech recognition</topic><topic>Spoken language</topic><topic>Vision</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qian, Yao</creatorcontrib><creatorcontrib>Ubale, Rutuja</creatorcontrib><creatorcontrib>Lange, Patrick</creatorcontrib><creatorcontrib>Evanini, Keelan</creatorcontrib><creatorcontrib>Ramanarayanan, Vikram</creatorcontrib><creatorcontrib>Soong, Frank K.</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>Journal of signal processing systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Qian, Yao</au><au>Ubale, Rutuja</au><au>Lange, Patrick</au><au>Evanini, Keelan</au><au>Ramanarayanan, Vikram</au><au>Soong, Frank K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</atitle><jtitle>Journal of signal processing systems</jtitle><stitle>J Sign Process Syst</stitle><date>2020-08-01</date><risdate>2020</risdate><volume>92</volume><issue>8</issue><spage>805</spage><epage>817</epage><pages>805-817</pages><issn>1939-8018</issn><eissn>1939-8115</eissn><abstract>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11265-019-01484-3</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-1855-9630</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1939-8018
ispartof Journal of signal processing systems, 2020-08, Vol.92 (8), p.805-817
issn 1939-8018
1939-8115
language eng
recordid cdi_proquest_journals_2423564268
source SpringerNature Journals
subjects Accentuation
Acoustic noise
Acoustic phonetics
Automatic speech recognition
Circuits and Systems
Computer Imaging
Conversation
Electrical Engineering
Engineering
English as a second language learning
Hypotheses
Image Processing and Computer Vision
Labeling
Learning
Levels
Pattern Recognition
Pattern Recognition and Graphics
Pronunciation
Semantic analysis
Semantic features
Semantics
Signal,Image and Speech Processing
Speech recognition
Spoken language
Vision
Voice recognition
title Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T21%3A08%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spoken%20Language%20Understanding%20of%20Human-Machine%20Conversations%20for%20Language%20Learning%20Applications&rft.jtitle=Journal%20of%20signal%20processing%20systems&rft.au=Qian,%20Yao&rft.date=2020-08-01&rft.volume=92&rft.issue=8&rft.spage=805&rft.epage=817&rft.pages=805-817&rft.issn=1939-8018&rft.eissn=1939-8115&rft_id=info:doi/10.1007/s11265-019-01484-3&rft_dat=%3Cproquest_cross%3E2423564268%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2423564268&rft_id=info:pmid/&rfr_iscdi=true