EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing

Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2020, Vol.8, p.81468-81478
Hauptverfasser:	Yang, Shunzhi, Gong, Zheng, Ye, Kai, Wei, Yungen, Huang, Zhenhua, Huang, Zheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Artificial neural networks Computational modeling Datasets Edge computing Emotion recognition Emotions Feature extraction Frequency domain analysis Internet of Things Keywords Mel frequency cepstral coefficient Neural networks Recall Recurrent neural networks RNN Spatial data Speech speech emotion recognition speech keywords recognition Speech recognition Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	81478
container_issue
container_start_page	81468
container_title	IEEE access
container_volume	8
creator	Yang, Shunzhi Gong, Zheng Ye, Kai Wei, Yungen Huang, Zhenhua Huang, Zheng
description	Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.
doi_str_mv	10.1109/ACCESS.2020.2990974
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9081948</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9081948</ieee_id><doaj_id>oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e</doaj_id><sourcerecordid>2454092925</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</originalsourceid><addsrcrecordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454092925</pqid></control><display><type>article</type><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creator><creatorcontrib>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creatorcontrib><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.2990974</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Artificial neural networks ; Computational modeling ; Datasets ; Edge computing ; Emotion recognition ; Emotions ; Feature extraction ; Frequency domain analysis ; Internet of Things ; Keywords ; Mel frequency cepstral coefficient ; Neural networks ; Recall ; Recurrent neural networks ; RNN ; Spatial data ; Speech ; speech emotion recognition ; speech keywords recognition ; Speech recognition ; Voice recognition</subject><ispartof>IEEE access, 2020, Vol.8, p.81468-81478</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</citedby><cites>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</cites><orcidid>0000-0002-1794-816X ; 0000-0001-8659-4062</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9081948$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><title>IEEE access</title><addtitle>Access</addtitle><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Computational modeling</subject><subject>Datasets</subject><subject>Edge computing</subject><subject>Emotion recognition</subject><subject>Emotions</subject><subject>Feature extraction</subject><subject>Frequency domain analysis</subject><subject>Internet of Things</subject><subject>Keywords</subject><subject>Mel frequency cepstral coefficient</subject><subject>Neural networks</subject><subject>Recall</subject><subject>Recurrent neural networks</subject><subject>RNN</subject><subject>Spatial data</subject><subject>Speech</subject><subject>speech emotion recognition</subject><subject>speech keywords recognition</subject><subject>Speech recognition</subject><subject>Voice recognition</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Yang, Shunzhi</creator><creator>Gong, Zheng</creator><creator>Ye, Kai</creator><creator>Wei, Yungen</creator><creator>Huang, Zhenhua</creator><creator>Huang, Zheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid></search><sort><creationdate>2020</creationdate><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><author>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Computational modeling</topic><topic>Datasets</topic><topic>Edge computing</topic><topic>Emotion recognition</topic><topic>Emotions</topic><topic>Feature extraction</topic><topic>Frequency domain analysis</topic><topic>Internet of Things</topic><topic>Keywords</topic><topic>Mel frequency cepstral coefficient</topic><topic>Neural networks</topic><topic>Recall</topic><topic>Recurrent neural networks</topic><topic>RNN</topic><topic>Spatial data</topic><topic>Speech</topic><topic>speech emotion recognition</topic><topic>speech keywords recognition</topic><topic>Speech recognition</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Shunzhi</au><au>Gong, Zheng</au><au>Ye, Kai</au><au>Wei, Yungen</au><au>Huang, Zhenhua</au><au>Huang, Zheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>81468</spage><epage>81478</epage><pages>81468-81478</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.2990974</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2020, Vol.8, p.81468-81478
issn	2169-3536 2169-3536
language	eng
recordid	cdi_ieee_primary_9081948
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects	Acoustics Artificial neural networks Computational modeling Datasets Edge computing Emotion recognition Emotions Feature extraction Frequency domain analysis Internet of Things Keywords Mel frequency cepstral coefficient Neural networks Recall Recurrent neural networks RNN Spatial data Speech speech emotion recognition speech keywords recognition Speech recognition Voice recognition
title	EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T07%3A24%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=EdgeRNN:%20A%20Compact%20Speech%20Recognition%20Network%20With%20Spatio-Temporal%20Features%20for%20Edge%20Computing&rft.jtitle=IEEE%20access&rft.au=Yang,%20Shunzhi&rft.date=2020&rft.volume=8&rft.spage=81468&rft.epage=81478&rft.pages=81468-81478&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.2990974&rft_dat=%3Cproquest_ieee_%3E2454092925%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454092925&rft_id=info:pmid/&rft_ieee_id=9081948&rft_doaj_id=oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e&rfr_iscdi=true