Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications
Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic labe...
Gespeichert in:
Veröffentlicht in: | Journal of signal processing systems 2020-08, Vol.92 (8), p.805-817 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 817 |
---|---|
container_issue | 8 |
container_start_page | 805 |
container_title | Journal of signal processing systems |
container_volume | 92 |
creator | Qian, Yao Ubale, Rutuja Lange, Patrick Evanini, Keelan Ramanarayanan, Vikram Soong, Frank K. |
description | Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance. |
doi_str_mv | 10.1007/s11265-019-01484-3 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2423564268</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2423564268</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AU8Fz9VMvtoel0VdoeJB9xzSbFK77iY1aQX_vdEqe5NhmGHmfWbgRegS8DVgXNxEACJ4jqFKyUqW0yM0g4pWeQnAj_96DOUpOotxi7HABYcZks-9fzMuq5VrR9WabO02JsRBuU3n2szbbDXulcsflX7tnMmW3n2kvRo672JmfTiQtVHBfUOLvt91epKcoxOrdtFc_NY5Wt_dvixXef10_7Bc1LmmUA15U1TaKtCcNUyXuCpJwbjWJdfAwBYbSoSyhaKEmRS2YUJx1aRxpYW2GNM5upru9sG_jyYOcuvH4NJLSRihXDAiyqQik0oHH2MwVvah26vwKQHLbyPlZKRMRsofIyVNEJ2gmMSuNeFw-h_qC5SNdw4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2423564268</pqid></control><display><type>article</type><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><source>SpringerNature Journals</source><creator>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</creator><creatorcontrib>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</creatorcontrib><description>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</description><identifier>ISSN: 1939-8018</identifier><identifier>EISSN: 1939-8115</identifier><identifier>DOI: 10.1007/s11265-019-01484-3</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accentuation ; Acoustic noise ; Acoustic phonetics ; Automatic speech recognition ; Circuits and Systems ; Computer Imaging ; Conversation ; Electrical Engineering ; Engineering ; English as a second language learning ; Hypotheses ; Image Processing and Computer Vision ; Labeling ; Learning ; Levels ; Pattern Recognition ; Pattern Recognition and Graphics ; Pronunciation ; Semantic analysis ; Semantic features ; Semantics ; Signal,Image and Speech Processing ; Speech recognition ; Spoken language ; Vision ; Voice recognition</subject><ispartof>Journal of signal processing systems, 2020-08, Vol.92 (8), p.805-817</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</citedby><cites>FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</cites><orcidid>0000-0003-1855-9630</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11265-019-01484-3$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11265-019-01484-3$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Qian, Yao</creatorcontrib><creatorcontrib>Ubale, Rutuja</creatorcontrib><creatorcontrib>Lange, Patrick</creatorcontrib><creatorcontrib>Evanini, Keelan</creatorcontrib><creatorcontrib>Ramanarayanan, Vikram</creatorcontrib><creatorcontrib>Soong, Frank K.</creatorcontrib><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><title>Journal of signal processing systems</title><addtitle>J Sign Process Syst</addtitle><description>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</description><subject>Accentuation</subject><subject>Acoustic noise</subject><subject>Acoustic phonetics</subject><subject>Automatic speech recognition</subject><subject>Circuits and Systems</subject><subject>Computer Imaging</subject><subject>Conversation</subject><subject>Electrical Engineering</subject><subject>Engineering</subject><subject>English as a second language learning</subject><subject>Hypotheses</subject><subject>Image Processing and Computer Vision</subject><subject>Labeling</subject><subject>Learning</subject><subject>Levels</subject><subject>Pattern Recognition</subject><subject>Pattern Recognition and Graphics</subject><subject>Pronunciation</subject><subject>Semantic analysis</subject><subject>Semantic features</subject><subject>Semantics</subject><subject>Signal,Image and Speech Processing</subject><subject>Speech recognition</subject><subject>Spoken language</subject><subject>Vision</subject><subject>Voice recognition</subject><issn>1939-8018</issn><issn>1939-8115</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LxDAQhoMouK7-AU8Fz9VMvtoel0VdoeJB9xzSbFK77iY1aQX_vdEqe5NhmGHmfWbgRegS8DVgXNxEACJ4jqFKyUqW0yM0g4pWeQnAj_96DOUpOotxi7HABYcZks-9fzMuq5VrR9WabO02JsRBuU3n2szbbDXulcsflX7tnMmW3n2kvRo672JmfTiQtVHBfUOLvt91epKcoxOrdtFc_NY5Wt_dvixXef10_7Bc1LmmUA15U1TaKtCcNUyXuCpJwbjWJdfAwBYbSoSyhaKEmRS2YUJx1aRxpYW2GNM5upru9sG_jyYOcuvH4NJLSRihXDAiyqQik0oHH2MwVvah26vwKQHLbyPlZKRMRsofIyVNEJ2gmMSuNeFw-h_qC5SNdw4</recordid><startdate>20200801</startdate><enddate>20200801</enddate><creator>Qian, Yao</creator><creator>Ubale, Rutuja</creator><creator>Lange, Patrick</creator><creator>Evanini, Keelan</creator><creator>Ramanarayanan, Vikram</creator><creator>Soong, Frank K.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><orcidid>https://orcid.org/0000-0003-1855-9630</orcidid></search><sort><creationdate>20200801</creationdate><title>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</title><author>Qian, Yao ; Ubale, Rutuja ; Lange, Patrick ; Evanini, Keelan ; Ramanarayanan, Vikram ; Soong, Frank K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-b79cfa1c54b4c80982745cc85c141f7d326af7a324e4e4fb46a5abd329c6cf003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accentuation</topic><topic>Acoustic noise</topic><topic>Acoustic phonetics</topic><topic>Automatic speech recognition</topic><topic>Circuits and Systems</topic><topic>Computer Imaging</topic><topic>Conversation</topic><topic>Electrical Engineering</topic><topic>Engineering</topic><topic>English as a second language learning</topic><topic>Hypotheses</topic><topic>Image Processing and Computer Vision</topic><topic>Labeling</topic><topic>Learning</topic><topic>Levels</topic><topic>Pattern Recognition</topic><topic>Pattern Recognition and Graphics</topic><topic>Pronunciation</topic><topic>Semantic analysis</topic><topic>Semantic features</topic><topic>Semantics</topic><topic>Signal,Image and Speech Processing</topic><topic>Speech recognition</topic><topic>Spoken language</topic><topic>Vision</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qian, Yao</creatorcontrib><creatorcontrib>Ubale, Rutuja</creatorcontrib><creatorcontrib>Lange, Patrick</creatorcontrib><creatorcontrib>Evanini, Keelan</creatorcontrib><creatorcontrib>Ramanarayanan, Vikram</creatorcontrib><creatorcontrib>Soong, Frank K.</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>Journal of signal processing systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Qian, Yao</au><au>Ubale, Rutuja</au><au>Lange, Patrick</au><au>Evanini, Keelan</au><au>Ramanarayanan, Vikram</au><au>Soong, Frank K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications</atitle><jtitle>Journal of signal processing systems</jtitle><stitle>J Sign Process Syst</stitle><date>2020-08-01</date><risdate>2020</risdate><volume>92</volume><issue>8</issue><spage>805</spage><epage>817</epage><pages>805-817</pages><issn>1939-8018</issn><eissn>1939-8115</eissn><abstract>Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11265-019-01484-3</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-1855-9630</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1939-8018 |
ispartof | Journal of signal processing systems, 2020-08, Vol.92 (8), p.805-817 |
issn | 1939-8018 1939-8115 |
language | eng |
recordid | cdi_proquest_journals_2423564268 |
source | SpringerNature Journals |
subjects | Accentuation Acoustic noise Acoustic phonetics Automatic speech recognition Circuits and Systems Computer Imaging Conversation Electrical Engineering Engineering English as a second language learning Hypotheses Image Processing and Computer Vision Labeling Learning Levels Pattern Recognition Pattern Recognition and Graphics Pronunciation Semantic analysis Semantic features Semantics Signal,Image and Speech Processing Speech recognition Spoken language Vision Voice recognition |
title | Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T21%3A08%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spoken%20Language%20Understanding%20of%20Human-Machine%20Conversations%20for%20Language%20Learning%20Applications&rft.jtitle=Journal%20of%20signal%20processing%20systems&rft.au=Qian,%20Yao&rft.date=2020-08-01&rft.volume=92&rft.issue=8&rft.spage=805&rft.epage=817&rft.pages=805-817&rft.issn=1939-8018&rft.eissn=1939-8115&rft_id=info:doi/10.1007/s11265-019-01484-3&rft_dat=%3Cproquest_cross%3E2423564268%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2423564268&rft_id=info:pmid/&rfr_iscdi=true |