EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing
Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of...
Gespeichert in:
Veröffentlicht in: | IEEE access 2020, Vol.8, p.81468-81478 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 81478 |
---|---|
container_issue | |
container_start_page | 81468 |
container_title | IEEE access |
container_volume | 8 |
creator | Yang, Shunzhi Gong, Zheng Ye, Kai Wei, Yungen Huang, Zhenhua Huang, Zheng |
description | Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition. |
doi_str_mv | 10.1109/ACCESS.2020.2990974 |
format | Article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9081948</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9081948</ieee_id><doaj_id>oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e</doaj_id><sourcerecordid>2454092925</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</originalsourceid><addsrcrecordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454092925</pqid></control><display><type>article</type><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creator><creatorcontrib>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</creatorcontrib><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.2990974</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Artificial neural networks ; Computational modeling ; Datasets ; Edge computing ; Emotion recognition ; Emotions ; Feature extraction ; Frequency domain analysis ; Internet of Things ; Keywords ; Mel frequency cepstral coefficient ; Neural networks ; Recall ; Recurrent neural networks ; RNN ; Spatial data ; Speech ; speech emotion recognition ; speech keywords recognition ; Speech recognition ; Voice recognition</subject><ispartof>IEEE access, 2020, Vol.8, p.81468-81478</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</citedby><cites>FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</cites><orcidid>0000-0002-1794-816X ; 0000-0001-8659-4062</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9081948$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><title>IEEE access</title><addtitle>Access</addtitle><description>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Computational modeling</subject><subject>Datasets</subject><subject>Edge computing</subject><subject>Emotion recognition</subject><subject>Emotions</subject><subject>Feature extraction</subject><subject>Frequency domain analysis</subject><subject>Internet of Things</subject><subject>Keywords</subject><subject>Mel frequency cepstral coefficient</subject><subject>Neural networks</subject><subject>Recall</subject><subject>Recurrent neural networks</subject><subject>RNN</subject><subject>Spatial data</subject><subject>Speech</subject><subject>speech emotion recognition</subject><subject>speech keywords recognition</subject><subject>Speech recognition</subject><subject>Voice recognition</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkU9rGzEQxZfQQEOaT5CLIOd1pdFqLfVmFqcNBBfihNwq9GfkrGtbW61M6bevnA0hc9HwNO83Eq-qrhmdMUbV10XXLdfrGVCgM1CKqnlzVl0Aa1XNBW8_feg_V1fjuKWlZJHE_KL6tfQbfFitvpEF6eJ-MC6T9YDoXsgDurg59LmPB7LC_Dem3-S5zy_l3hSxfsT9EJPZkVs0-ZhwJCEmcuK9ko65P2y-VOfB7Ea8ejsvq6fb5WP3o77_-f2uW9zXrqEy18Bt8EJ61jCBDK1TYL0HmEtrpLLSCkA0rQ8cIYRgbctgbrFtvGskby2_rO4mro9mq4fU7036p6Pp9asQ00ablHu3Qy1acBSlZYaKhhphURoHgXPVGvASC-tmYg0p_jnimPU2HtOhPF9DUywKFIgyxacpl-I4JgzvWxnVp1z0lIs-5aLfcimu68nVI-K7Q5U4VPnIf75tiYg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Yang, Shunzhi</creator><creator>Gong, Zheng</creator><creator>Ye, Kai</creator><creator>Wei, Yungen</creator><creator>Huang, Zhenhua</creator><creator>Huang, Zheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid></search><sort><creationdate>2020</creationdate><title>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</title><author>Yang, Shunzhi ; Gong, Zheng ; Ye, Kai ; Wei, Yungen ; Huang, Zhenhua ; Huang, Zheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-23bfd58d1415e1ebc92bdd2278ba89b8b52eea6df3e2fffbb6127be64dc4836b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Computational modeling</topic><topic>Datasets</topic><topic>Edge computing</topic><topic>Emotion recognition</topic><topic>Emotions</topic><topic>Feature extraction</topic><topic>Frequency domain analysis</topic><topic>Internet of Things</topic><topic>Keywords</topic><topic>Mel frequency cepstral coefficient</topic><topic>Neural networks</topic><topic>Recall</topic><topic>Recurrent neural networks</topic><topic>RNN</topic><topic>Spatial data</topic><topic>Speech</topic><topic>speech emotion recognition</topic><topic>speech keywords recognition</topic><topic>Speech recognition</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Shunzhi</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Ye, Kai</creatorcontrib><creatorcontrib>Wei, Yungen</creatorcontrib><creatorcontrib>Huang, Zhenhua</creatorcontrib><creatorcontrib>Huang, Zheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Shunzhi</au><au>Gong, Zheng</au><au>Ye, Kai</au><au>Wei, Yungen</au><au>Huang, Zhenhua</au><au>Huang, Zheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>81468</spage><epage>81478</epage><pages>81468-81478</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.2990974</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-1794-816X</orcidid><orcidid>https://orcid.org/0000-0001-8659-4062</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2020, Vol.8, p.81468-81478 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_ieee_primary_9081948 |
source | IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals |
subjects | Acoustics Artificial neural networks Computational modeling Datasets Edge computing Emotion recognition Emotions Feature extraction Frequency domain analysis Internet of Things Keywords Mel frequency cepstral coefficient Neural networks Recall Recurrent neural networks RNN Spatial data Speech speech emotion recognition speech keywords recognition Speech recognition Voice recognition |
title | EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T07%3A24%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=EdgeRNN:%20A%20Compact%20Speech%20Recognition%20Network%20With%20Spatio-Temporal%20Features%20for%20Edge%20Computing&rft.jtitle=IEEE%20access&rft.au=Yang,%20Shunzhi&rft.date=2020&rft.volume=8&rft.spage=81468&rft.epage=81478&rft.pages=81468-81478&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.2990974&rft_dat=%3Cproquest_ieee_%3E2454092925%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454092925&rft_id=info:pmid/&rft_ieee_id=9081948&rft_doaj_id=oai_doaj_org_article_562c0e8b1a0540a5be8ac2f3396a2d8e&rfr_iscdi=true |