EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing

Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.81468-81478
Hauptverfasser: Yang, Shunzhi, Gong, Zheng, Ye, Kai, Wei, Yungen, Huang, Zhenhua, Huang, Zheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 81478
container_issue
container_start_page 81468
container_title IEEE access
container_volume 8
creator Yang, Shunzhi
Gong, Zheng
Ye, Kai
Wei, Yungen
Huang, Zhenhua
Huang, Zheng
description Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.
doi_str_mv 10.1109/ACCESS.2020.2990974
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9081948</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9081948</ieee_id><doaj_id>oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e</doaj_id><sourcerecordid>2454092925</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</originalsourceid><addsrcrecordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454092925</pqid></control><display><type>article</type><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creator><creatorcontrib>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creatorcontrib><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.2990974</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Artificial neural networks ; Computational modeling ; Datasets ; Edge computing ; Emotion recognition ; Emotions ; Feature extraction ; Frequency domain analysis ; Internet of Things ; Keywords ; Mel frequency cepstral coefficient ; Neural networks ; Recall ; Recurrent neural networks ; RNN ; Spatial data ; Speech ; speech emotion recognition ; speech keywords recognition ; Speech recognition ; Voice recognition</subject><ispartof>IEEE access, 2020, Vol.8, p.81468-81478</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</citedby><cites>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</cites><orcidid>0000-0002-1794-816X ; 0000-0001-8659-4062</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9081948$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><title>IEEE access</title><addtitle>Access</addtitle><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Computational modeling</subject><subject>Datasets</subject><subject>Edge computing</subject><subject>Emotion recognition</subject><subject>Emotions</subject><subject>Feature extraction</subject><subject>Frequency domain analysis</subject><subject>Internet of Things</subject><subject>Keywords</subject><subject>Mel frequency cepstral coefficient</subject><subject>Neural networks</subject><subject>Recall</subject><subject>Recurrent neural networks</subject><subject>RNN</subject><subject>Spatial data</subject><subject>Speech</subject><subject>speech emotion recognition</subject><subject>speech keywords recognition</subject><subject>Speech recognition</subject><subject>Voice recognition</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Yang, Shunzhi</creator><creator>Gong, Zheng</creator><creator>Ye, Kai</creator><creator>Wei, Yungen</creator><creator>Huang, Zhenhua</creator><creator>Huang, Zheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid></search><sort><creationdate>2020</creationdate><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><author>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Computational modeling</topic><topic>Datasets</topic><topic>Edge computing</topic><topic>Emotion recognition</topic><topic>Emotions</topic><topic>Feature extraction</topic><topic>Frequency domain analysis</topic><topic>Internet of Things</topic><topic>Keywords</topic><topic>Mel frequency cepstral coefficient</topic><topic>Neural networks</topic><topic>Recall</topic><topic>Recurrent neural networks</topic><topic>RNN</topic><topic>Spatial data</topic><topic>Speech</topic><topic>speech emotion recognition</topic><topic>speech keywords recognition</topic><topic>Speech recognition</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Shunzhi</au><au>Gong, Zheng</au><au>Ye, Kai</au><au>Wei, Yungen</au><au>Huang, Zhenhua</au><au>Huang, Zheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>81468</spage><epage>81478</epage><pages>81468-81478</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.2990974</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2020, Vol.8, p.81468-81478
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_9081948
source IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects Acoustics
Artificial neural networks
Computational modeling
Datasets
Edge computing
Emotion recognition
Emotions
Feature extraction
Frequency domain analysis
Internet of Things
Keywords
Mel frequency cepstral coefficient
Neural networks
Recall
Recurrent neural networks
RNN
Spatial data
Speech
speech emotion recognition
speech keywords recognition
Speech recognition
Voice recognition
title EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T07%3A24%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=EdgeRNN:%20A%20Compact%20Speech%20Recognition%20Network%20With%20Spatio-Temporal%20Features%20for%20Edge%20Computing&rft.jtitle=IEEE%20access&rft.au=Yang,%20Shunzhi&rft.date=2020&rft.volume=8&rft.spage=81468&rft.epage=81478&rft.pages=81468-81478&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.2990974&rft_dat=%3Cproquest_ieee_%3E2454092925%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454092925&rft_id=info:pmid/&rft_ieee_id=9081948&rft_doaj_id=oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e&rfr_iscdi=true