Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition

Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Speech communication 2010, Vol.52 (1), p.1-11
Hauptverfasser:	Lu, X., Matsuda, S., Unoki, M., Nakamura, S.
Format:	Artikel
Sprache:	eng
Schlagworte:	Additives Algorithms Applied sciences Cleaning Detection, estimation, filtering, equalization, prediction Edge-preserved smoothing Exact sciences and technology Information, signal and communications theory Mean and variance normalization Miscellaneous Modulation Modulation object Modulation, demodulation Noise Robust speech recognition Signal and communications theory Signal processing Signal, noise Speech Speech processing Speech recognition Telecommunications and information theory Temporal logic Temporal modulation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	11
container_issue	1
container_start_page	1
container_title	Speech communication
container_volume	52
creator	Lu, X. Matsuda, S. Unoki, M. Nakamura, S.
description	Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm.
doi_str_mv	10.1016/j.specom.2009.08.006
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_864386505</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0167639309001290</els_id><sourcerecordid>864386505</sourcerecordid><originalsourceid>FETCH-LOGICAL-c497t-4bd787a5b8091d3a37db2ec0363b54d58f5bbd778282db378bb07f3f8379d4763</originalsourceid><addsrcrecordid>eNqFkctu1TAURS1EJS6FP2CQCTBKcPzOBAlVLSBVYtKOLb9y66vEDrZTCT6A766jXBiWkSVrrbOPzgbgXQ-7Hvbs06nLizNx7hCEQwdFByF7AQ694KjlvUAvwaFivGV4wK_A65xPEEIiBDqAP3duXmJSU2NiKEnl0oSYZjX536r4GBoVbOPs0bVLctmlR2ebPMdYHnw4NnFsyl9_jnaddieXtJqyVmEj6m7OPDRjTE2Keq0J559Udz4GvylvwMWopuzent9LcH9zfXf1rb398fX71Zfb1pCBl5ZoywVXVAs49BYrzK1GzkDMsKbEUjFSXREukEBWYy60hnzEo8B8sIQzfAk-7nOXFH-uLhc5-2zcNKng4pqlYAQLRiGt5IfnScohIeL_IzlBnFPKeCXJTpoUc05ulEvys0q_ZA_lVqQ8yb1IuRUpoZC1yKq9PweobNQ0JhWMz_9chDjFbNg2_rxzrh7w0bsks_EuGGd9vXSRNvrng54AHoO5Yg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>742775567</pqid></control><display><type>article</type><title>Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition</title><source>Access via ScienceDirect (Elsevier)</source><creator>Lu, X. ; Matsuda, S. ; Unoki, M. ; Nakamura, S.</creator><creatorcontrib>Lu, X. ; Matsuda, S. ; Unoki, M. ; Nakamura, S.</creatorcontrib><description>Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm.</description><identifier>ISSN: 0167-6393</identifier><identifier>EISSN: 1872-7182</identifier><identifier>DOI: 10.1016/j.specom.2009.08.006</identifier><identifier>CODEN: SCOMDH</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Additives ; Algorithms ; Applied sciences ; Cleaning ; Detection, estimation, filtering, equalization, prediction ; Edge-preserved smoothing ; Exact sciences and technology ; Information, signal and communications theory ; Mean and variance normalization ; Miscellaneous ; Modulation ; Modulation object ; Modulation, demodulation ; Noise ; Robust speech recognition ; Signal and communications theory ; Signal processing ; Signal, noise ; Speech ; Speech processing ; Speech recognition ; Telecommunications and information theory ; Temporal logic ; Temporal modulation</subject><ispartof>Speech communication, 2010, Vol.52 (1), p.1-11</ispartof><rights>2009 Elsevier B.V.</rights><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c497t-4bd787a5b8091d3a37db2ec0363b54d58f5bbd778282db378bb07f3f8379d4763</citedby><cites>FETCH-LOGICAL-c497t-4bd787a5b8091d3a37db2ec0363b54d58f5bbd778282db378bb07f3f8379d4763</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.specom.2009.08.006$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,4024,27923,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=22753695$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Lu, X.</creatorcontrib><creatorcontrib>Matsuda, S.</creatorcontrib><creatorcontrib>Unoki, M.</creatorcontrib><creatorcontrib>Nakamura, S.</creatorcontrib><title>Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition</title><title>Speech communication</title><description>Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm.</description><subject>Additives</subject><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Cleaning</subject><subject>Detection, estimation, filtering, equalization, prediction</subject><subject>Edge-preserved smoothing</subject><subject>Exact sciences and technology</subject><subject>Information, signal and communications theory</subject><subject>Mean and variance normalization</subject><subject>Miscellaneous</subject><subject>Modulation</subject><subject>Modulation object</subject><subject>Modulation, demodulation</subject><subject>Noise</subject><subject>Robust speech recognition</subject><subject>Signal and communications theory</subject><subject>Signal processing</subject><subject>Signal, noise</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>Telecommunications and information theory</subject><subject>Temporal logic</subject><subject>Temporal modulation</subject><issn>0167-6393</issn><issn>1872-7182</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><recordid>eNqFkctu1TAURS1EJS6FP2CQCTBKcPzOBAlVLSBVYtKOLb9y66vEDrZTCT6A766jXBiWkSVrrbOPzgbgXQ-7Hvbs06nLizNx7hCEQwdFByF7AQ694KjlvUAvwaFivGV4wK_A65xPEEIiBDqAP3duXmJSU2NiKEnl0oSYZjX536r4GBoVbOPs0bVLctmlR2ebPMdYHnw4NnFsyl9_jnaddieXtJqyVmEj6m7OPDRjTE2Keq0J559Udz4GvylvwMWopuzent9LcH9zfXf1rb398fX71Zfb1pCBl5ZoywVXVAs49BYrzK1GzkDMsKbEUjFSXREukEBWYy60hnzEo8B8sIQzfAk-7nOXFH-uLhc5-2zcNKng4pqlYAQLRiGt5IfnScohIeL_IzlBnFPKeCXJTpoUc05ulEvys0q_ZA_lVqQ8yb1IuRUpoZC1yKq9PweobNQ0JhWMz_9chDjFbNg2_rxzrh7w0bsks_EuGGd9vXSRNvrng54AHoO5Yg</recordid><startdate>2010</startdate><enddate>2010</enddate><creator>Lu, X.</creator><creator>Matsuda, S.</creator><creator>Unoki, M.</creator><creator>Nakamura, S.</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8BM</scope><scope>7T9</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>2010</creationdate><title>Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition</title><author>Lu, X. ; Matsuda, S. ; Unoki, M. ; Nakamura, S.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c497t-4bd787a5b8091d3a37db2ec0363b54d58f5bbd778282db378bb07f3f8379d4763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Additives</topic><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Cleaning</topic><topic>Detection, estimation, filtering, equalization, prediction</topic><topic>Edge-preserved smoothing</topic><topic>Exact sciences and technology</topic><topic>Information, signal and communications theory</topic><topic>Mean and variance normalization</topic><topic>Miscellaneous</topic><topic>Modulation</topic><topic>Modulation object</topic><topic>Modulation, demodulation</topic><topic>Noise</topic><topic>Robust speech recognition</topic><topic>Signal and communications theory</topic><topic>Signal processing</topic><topic>Signal, noise</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>Telecommunications and information theory</topic><topic>Temporal logic</topic><topic>Temporal modulation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lu, X.</creatorcontrib><creatorcontrib>Matsuda, S.</creatorcontrib><creatorcontrib>Unoki, M.</creatorcontrib><creatorcontrib>Nakamura, S.</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>ComDisDome</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Speech communication</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lu, X.</au><au>Matsuda, S.</au><au>Unoki, M.</au><au>Nakamura, S.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition</atitle><jtitle>Speech communication</jtitle><date>2010</date><risdate>2010</risdate><volume>52</volume><issue>1</issue><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>0167-6393</issn><eissn>1872-7182</eissn><coden>SCOMDH</coden><abstract>Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.specom.2009.08.006</doi><tpages>11</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0167-6393
ispartof	Speech communication, 2010, Vol.52 (1), p.1-11
issn	0167-6393 1872-7182
language	eng
recordid	cdi_proquest_miscellaneous_864386505
source	Access via ScienceDirect (Elsevier)
subjects	Additives Algorithms Applied sciences Cleaning Detection, estimation, filtering, equalization, prediction Edge-preserved smoothing Exact sciences and technology Information, signal and communications theory Mean and variance normalization Miscellaneous Modulation Modulation object Modulation, demodulation Noise Robust speech recognition Signal and communications theory Signal processing Signal, noise Speech Speech processing Speech recognition Telecommunications and information theory Temporal logic Temporal modulation
title	Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T06%3A46%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20contrast%20normalization%20and%20edge-preserved%20smoothing%20of%20temporal%20modulation%20structures%20of%20speech%20for%20robust%20speech%20recognition&rft.jtitle=Speech%20communication&rft.au=Lu,%20X.&rft.date=2010&rft.volume=52&rft.issue=1&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=0167-6393&rft.eissn=1872-7182&rft.coden=SCOMDH&rft_id=info:doi/10.1016/j.specom.2009.08.006&rft_dat=%3Cproquest_cross%3E864386505%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=742775567&rft_id=info:pmid/&rft_els_id=S0167639309001290&rfr_iscdi=true