Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2019-12, Vol.27 (12), p.2080-2091
Hauptverfasser:	Tu, Yan-Hui, Du, Jun, Lee, Chin-Hui
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustic noise Adaptation models Artificial neural networks Computational modeling Computer simulation Deep learning deep learning based speech enhancement Error reduction improved minima controlled recursive averaging improved speech presence probability Machine learning Masks Model accuracy Noise Noise measurement noise-robust speech recognition Performance degradation Regression models Retraining Speech Speech enhancement Speech processing Speech recognition Statistical analysis Target masking Teacher-student learning Teachers Training Voice recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2091
container_issue	12
container_start_page	2080
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	27
creator	Tu, Yan-Hui Du, Jun Lee, Chin-Hui
description	In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.
doi_str_mv	10.1109/TASLP.2019.2940662
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_8834827</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8834827</ieee_id><sourcerecordid>2298712591</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-d00f4a20d6993a57fa30d76edfb661076322b0d2a2317a2d85a8f419457c4a233</originalsourceid><addsrcrecordid>eNo9UMtOwzAQtBBIVKU_AJdInFPsdV4-llKgUgRVH-fISTZtqtYOdorUP-CzcWjLZXeknZndHULuGR0yRsXTcrRIZ0OgTAxBBDSK4Ir0gIPwBafB9QWDoLdkYO2WUspoLEQc9MjPokEsNt5EbaQqcI-q9Z6lxdLTyluiLDZo_EV7KLvBC2LjpSiNqtXaW9muTveN0d-OfzaaGbTonBzQuczrXd0evUob70PXFv25zg-2vZDnWOi1qttaqztyU8mdxcG598nqdbIcv_vp59t0PEr9AkTY-iWlVSCBlpEQXIZxJTkt4wjLKo8i91XEAXJaggTOYgllEsqkCpgIwrhwOs775PHk687-OqBts60-GOVWZgAiiRmEgjkWnFiF0dYarLLG1HtpjhmjWRd69hd61oWenUN3ooeTqEbEf0GS8CCBmP8CDzp-gw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2298712591</pqid></control><display><type>article</type><title>Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition</title><source>IEEE/IET Electronic Library</source><creator>Tu, Yan-Hui ; Du, Jun ; Lee, Chin-Hui</creator><creatorcontrib>Tu, Yan-Hui ; Du, Jun ; Lee, Chin-Hui</creatorcontrib><description>In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2019.2940662</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustic noise ; Adaptation models ; Artificial neural networks ; Computational modeling ; Computer simulation ; Deep learning ; deep learning based speech enhancement ; Error reduction ; improved minima controlled recursive averaging ; improved speech presence probability ; Machine learning ; Masks ; Model accuracy ; Noise ; Noise measurement ; noise-robust speech recognition ; Performance degradation ; Regression models ; Retraining ; Speech ; Speech enhancement ; Speech processing ; Speech recognition ; Statistical analysis ; Target masking ; Teacher-student learning ; Teachers ; Training ; Voice recognition</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2019-12, Vol.27 (12), p.2080-2091</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-d00f4a20d6993a57fa30d76edfb661076322b0d2a2317a2d85a8f419457c4a233</citedby><cites>FETCH-LOGICAL-c295t-d00f4a20d6993a57fa30d76edfb661076322b0d2a2317a2d85a8f419457c4a233</cites><orcidid>0000-0002-2387-0389 ; 0000-0002-1892-2551 ; 0000-0001-5219-8205</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8834827$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8834827$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tu, Yan-Hui</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><title>Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.</description><subject>Acoustic noise</subject><subject>Adaptation models</subject><subject>Artificial neural networks</subject><subject>Computational modeling</subject><subject>Computer simulation</subject><subject>Deep learning</subject><subject>deep learning based speech enhancement</subject><subject>Error reduction</subject><subject>improved minima controlled recursive averaging</subject><subject>improved speech presence probability</subject><subject>Machine learning</subject><subject>Masks</subject><subject>Model accuracy</subject><subject>Noise</subject><subject>Noise measurement</subject><subject>noise-robust speech recognition</subject><subject>Performance degradation</subject><subject>Regression models</subject><subject>Retraining</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>Statistical analysis</subject><subject>Target masking</subject><subject>Teacher-student learning</subject><subject>Teachers</subject><subject>Training</subject><subject>Voice recognition</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9UMtOwzAQtBBIVKU_AJdInFPsdV4-llKgUgRVH-fISTZtqtYOdorUP-CzcWjLZXeknZndHULuGR0yRsXTcrRIZ0OgTAxBBDSK4Ir0gIPwBafB9QWDoLdkYO2WUspoLEQc9MjPokEsNt5EbaQqcI-q9Z6lxdLTyluiLDZo_EV7KLvBC2LjpSiNqtXaW9muTveN0d-OfzaaGbTonBzQuczrXd0evUob70PXFv25zg-2vZDnWOi1qttaqztyU8mdxcG598nqdbIcv_vp59t0PEr9AkTY-iWlVSCBlpEQXIZxJTkt4wjLKo8i91XEAXJaggTOYgllEsqkCpgIwrhwOs775PHk687-OqBts60-GOVWZgAiiRmEgjkWnFiF0dYarLLG1HtpjhmjWRd69hd61oWenUN3ooeTqEbEf0GS8CCBmP8CDzp-gw</recordid><startdate>20191201</startdate><enddate>20191201</enddate><creator>Tu, Yan-Hui</creator><creator>Du, Jun</creator><creator>Lee, Chin-Hui</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-2387-0389</orcidid><orcidid>https://orcid.org/0000-0002-1892-2551</orcidid><orcidid>https://orcid.org/0000-0001-5219-8205</orcidid></search><sort><creationdate>20191201</creationdate><title>Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition</title><author>Tu, Yan-Hui ; Du, Jun ; Lee, Chin-Hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-d00f4a20d6993a57fa30d76edfb661076322b0d2a2317a2d85a8f419457c4a233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Acoustic noise</topic><topic>Adaptation models</topic><topic>Artificial neural networks</topic><topic>Computational modeling</topic><topic>Computer simulation</topic><topic>Deep learning</topic><topic>deep learning based speech enhancement</topic><topic>Error reduction</topic><topic>improved minima controlled recursive averaging</topic><topic>improved speech presence probability</topic><topic>Machine learning</topic><topic>Masks</topic><topic>Model accuracy</topic><topic>Noise</topic><topic>Noise measurement</topic><topic>noise-robust speech recognition</topic><topic>Performance degradation</topic><topic>Regression models</topic><topic>Retraining</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>Statistical analysis</topic><topic>Target masking</topic><topic>Teacher-student learning</topic><topic>Teachers</topic><topic>Training</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tu, Yan-Hui</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tu, Yan-Hui</au><au>Du, Jun</au><au>Lee, Chin-Hui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2019-12-01</date><risdate>2019</risdate><volume>27</volume><issue>12</issue><spage>2080</spage><epage>2091</epage><pages>2080-2091</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2019.2940662</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-2387-0389</orcidid><orcidid>https://orcid.org/0000-0002-1892-2551</orcidid><orcidid>https://orcid.org/0000-0001-5219-8205</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2019-12, Vol.27 (12), p.2080-2091
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_8834827
source	IEEE/IET Electronic Library
subjects	Acoustic noise Adaptation models Artificial neural networks Computational modeling Computer simulation Deep learning deep learning based speech enhancement Error reduction improved minima controlled recursive averaging improved speech presence probability Machine learning Masks Model accuracy Noise Noise measurement noise-robust speech recognition Performance degradation Regression models Retraining Speech Speech enhancement Speech processing Speech recognition Statistical analysis Target masking Teacher-student learning Teachers Training Voice recognition
title	Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T15%3A04%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Speech%20Enhancement%20Based%20on%20Teacher-Student%20Deep%20Learning%20Using%20Improved%20Speech%20Presence%20Probability%20for%20Noise-Robust%20Speech%20Recognition&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Tu,%20Yan-Hui&rft.date=2019-12-01&rft.volume=27&rft.issue=12&rft.spage=2080&rft.epage=2091&rft.pages=2080-2091&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2019.2940662&rft_dat=%3Cproquest_RIE%3E2298712591%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2298712591&rft_id=info:pmid/&rft_ieee_id=8834827&rfr_iscdi=true