Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and subop...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023, Vol.31, p.876-890
Hauptverfasser: Loweimi, Erfan, Yue, Zhengjun, Bell, Peter, Renals, Steve, Cvetkovic, Zoran
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 890
container_issue
container_start_page 876
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 31
creator Loweimi, Erfan
Yue, Zhengjun
Bell, Peter
Renals, Steve
Cvetkovic, Zoran
description In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and suboptimal information fusion. We discuss and quantify the importance of such information in terms of speech quality and intelligibility. In the proposed framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and then combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully-connected layers. The optimal level of information fusion in various architectures, training dynamics in terms of cross-entropy loss, frame classification accuracy and WER as well as the shape and properties of the filters learned in the first convolutional layer of single- and multi-stream models are analysed. We investigated the effectiveness of the proposed systems in various tasks: TIMIT/NTIMIT (phone recognition), Aurora-4 (noise robustness), WSJ (read speech), AMI (meeting) and TORGO (dysarthric speech). Across all tasks we achieved competitive performance: in Aurora-4, down to 4.6% WER on average, in WSJ down to 4.6% and 6.2% WERs for Eval-92 and Eval-93, for Dev/Eval sets of the AMI-IHM down to 23.3%/23.8% WERs and in the AMI-SDM down to 43.7%/47.6% WERs have been achieved. In TORGO, for dysarthric and typical speech we achieved down to 31.7% and 10.2% WERs, respectively.
doi_str_mv 10.1109/TASLP.2023.3237167
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2770779409</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10026604</ieee_id><sourcerecordid>2770779409</sourcerecordid><originalsourceid>FETCH-LOGICAL-c340t-8ed439b8707f5ba52574b3a7e7fb4c3125de1dde5c09a0ae86cec0ef029df3e3</originalsourceid><addsrcrecordid>eNpNkE9Lw0AQxYMoWGq_gHhY8Jw6-yfZ7rEUq4UWSxtPHpbNZrampEndTRC_valV8DIzh_fezPyi6JbCmFJQD9l0u1yPGTA-5oxLmsqLaMA4U7HiIC7_ZqbgOhqFsAcAClIpKQbR26qr2jLeth7NgUxt04W2tGTVFFhVZb0jr-FUN-aTbNBUxNQFWRzMrqyN_yJr49tAGkfadyTzpvMlepJ5UwfX-MNNdOVMFXD024dRNn_MZs_x8uVpMZsuY8sFtPEEC8FVPpEgXZKbhCVS5NxIlC4XllOWFEiLAhMLyoDBSWrRAjpgqnAc-TC6P8ceffPRYWj1vr-k7jdqJvtQqQSoXsXOKuubEDw6ffTloX9CU9AnjPoHoz5h1L8Ye9Pd2VQi4j8DsDQFwb8BUKtvBQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2770779409</pqid></control><display><type>article</type><title>Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform</title><source>IEEE Electronic Library (IEL)</source><creator>Loweimi, Erfan ; Yue, Zhengjun ; Bell, Peter ; Renals, Steve ; Cvetkovic, Zoran</creator><creatorcontrib>Loweimi, Erfan ; Yue, Zhengjun ; Bell, Peter ; Renals, Steve ; Cvetkovic, Zoran</creatorcontrib><description>In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and suboptimal information fusion. We discuss and quantify the importance of such information in terms of speech quality and intelligibility. In the proposed framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and then combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully-connected layers. The optimal level of information fusion in various architectures, training dynamics in terms of cross-entropy loss, frame classification accuracy and WER as well as the shape and properties of the filters learned in the first convolutional layer of single- and multi-stream models are analysed. We investigated the effectiveness of the proposed systems in various tasks: TIMIT/NTIMIT (phone recognition), Aurora-4 (noise robustness), WSJ (read speech), AMI (meeting) and TORGO (dysarthric speech). Across all tasks we achieved competitive performance: in Aurora-4, down to 4.6% WER on average, in WSJ down to 4.6% and 6.2% WERs for Eval-92 and Eval-93, for Dev/Eval sets of the AMI-IHM down to 23.3%/23.8% WERs and in the AMI-SDM down to 43.7%/47.6% WERs have been achieved. In TORGO, for dysarthric and typical speech we achieved down to 31.7% and 10.2% WERs, respectively.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3237167</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; automatic speech recognition ; Data integration ; fourier transform ; Fourier transforms ; Information filters ; Intelligibility ; Modelling ; multi-stream acoustic modelling ; Raw signal representation ; Shape ; Speech recognition ; Streaming media ; System effectiveness ; Training</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.876-890</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c340t-8ed439b8707f5ba52574b3a7e7fb4c3125de1dde5c09a0ae86cec0ef029df3e3</citedby><cites>FETCH-LOGICAL-c340t-8ed439b8707f5ba52574b3a7e7fb4c3125de1dde5c09a0ae86cec0ef029df3e3</cites><orcidid>0000-0002-8790-3389 ; 0000-0002-8761-021X ; 0000-0002-9597-9615 ; 0000-0002-5128-5099 ; 0000-0002-1101-549X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10026604$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10026604$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Loweimi, Erfan</creatorcontrib><creatorcontrib>Yue, Zhengjun</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Cvetkovic, Zoran</creatorcontrib><title>Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and suboptimal information fusion. We discuss and quantify the importance of such information in terms of speech quality and intelligibility. In the proposed framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and then combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully-connected layers. The optimal level of information fusion in various architectures, training dynamics in terms of cross-entropy loss, frame classification accuracy and WER as well as the shape and properties of the filters learned in the first convolutional layer of single- and multi-stream models are analysed. We investigated the effectiveness of the proposed systems in various tasks: TIMIT/NTIMIT (phone recognition), Aurora-4 (noise robustness), WSJ (read speech), AMI (meeting) and TORGO (dysarthric speech). Across all tasks we achieved competitive performance: in Aurora-4, down to 4.6% WER on average, in WSJ down to 4.6% and 6.2% WERs for Eval-92 and Eval-93, for Dev/Eval sets of the AMI-IHM down to 23.3%/23.8% WERs and in the AMI-SDM down to 43.7%/47.6% WERs have been achieved. In TORGO, for dysarthric and typical speech we achieved down to 31.7% and 10.2% WERs, respectively.</description><subject>Acoustics</subject><subject>automatic speech recognition</subject><subject>Data integration</subject><subject>fourier transform</subject><subject>Fourier transforms</subject><subject>Information filters</subject><subject>Intelligibility</subject><subject>Modelling</subject><subject>multi-stream acoustic modelling</subject><subject>Raw signal representation</subject><subject>Shape</subject><subject>Speech recognition</subject><subject>Streaming media</subject><subject>System effectiveness</subject><subject>Training</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9Lw0AQxYMoWGq_gHhY8Jw6-yfZ7rEUq4UWSxtPHpbNZrampEndTRC_valV8DIzh_fezPyi6JbCmFJQD9l0u1yPGTA-5oxLmsqLaMA4U7HiIC7_ZqbgOhqFsAcAClIpKQbR26qr2jLeth7NgUxt04W2tGTVFFhVZb0jr-FUN-aTbNBUxNQFWRzMrqyN_yJr49tAGkfadyTzpvMlepJ5UwfX-MNNdOVMFXD024dRNn_MZs_x8uVpMZsuY8sFtPEEC8FVPpEgXZKbhCVS5NxIlC4XllOWFEiLAhMLyoDBSWrRAjpgqnAc-TC6P8ceffPRYWj1vr-k7jdqJvtQqQSoXsXOKuubEDw6ffTloX9CU9AnjPoHoz5h1L8Ye9Pd2VQi4j8DsDQFwb8BUKtvBQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Loweimi, Erfan</creator><creator>Yue, Zhengjun</creator><creator>Bell, Peter</creator><creator>Renals, Steve</creator><creator>Cvetkovic, Zoran</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-8790-3389</orcidid><orcidid>https://orcid.org/0000-0002-8761-021X</orcidid><orcidid>https://orcid.org/0000-0002-9597-9615</orcidid><orcidid>https://orcid.org/0000-0002-5128-5099</orcidid><orcidid>https://orcid.org/0000-0002-1101-549X</orcidid></search><sort><creationdate>2023</creationdate><title>Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform</title><author>Loweimi, Erfan ; Yue, Zhengjun ; Bell, Peter ; Renals, Steve ; Cvetkovic, Zoran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c340t-8ed439b8707f5ba52574b3a7e7fb4c3125de1dde5c09a0ae86cec0ef029df3e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustics</topic><topic>automatic speech recognition</topic><topic>Data integration</topic><topic>fourier transform</topic><topic>Fourier transforms</topic><topic>Information filters</topic><topic>Intelligibility</topic><topic>Modelling</topic><topic>multi-stream acoustic modelling</topic><topic>Raw signal representation</topic><topic>Shape</topic><topic>Speech recognition</topic><topic>Streaming media</topic><topic>System effectiveness</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Loweimi, Erfan</creatorcontrib><creatorcontrib>Yue, Zhengjun</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Cvetkovic, Zoran</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Loweimi, Erfan</au><au>Yue, Zhengjun</au><au>Bell, Peter</au><au>Renals, Steve</au><au>Cvetkovic, Zoran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023</date><risdate>2023</risdate><volume>31</volume><spage>876</spage><epage>890</epage><pages>876-890</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and suboptimal information fusion. We discuss and quantify the importance of such information in terms of speech quality and intelligibility. In the proposed framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and then combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully-connected layers. The optimal level of information fusion in various architectures, training dynamics in terms of cross-entropy loss, frame classification accuracy and WER as well as the shape and properties of the filters learned in the first convolutional layer of single- and multi-stream models are analysed. We investigated the effectiveness of the proposed systems in various tasks: TIMIT/NTIMIT (phone recognition), Aurora-4 (noise robustness), WSJ (read speech), AMI (meeting) and TORGO (dysarthric speech). Across all tasks we achieved competitive performance: in Aurora-4, down to 4.6% WER on average, in WSJ down to 4.6% and 6.2% WERs for Eval-92 and Eval-93, for Dev/Eval sets of the AMI-IHM down to 23.3%/23.8% WERs and in the AMI-SDM down to 43.7%/47.6% WERs have been achieved. In TORGO, for dysarthric and typical speech we achieved down to 31.7% and 10.2% WERs, respectively.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3237167</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8790-3389</orcidid><orcidid>https://orcid.org/0000-0002-8761-021X</orcidid><orcidid>https://orcid.org/0000-0002-9597-9615</orcidid><orcidid>https://orcid.org/0000-0002-5128-5099</orcidid><orcidid>https://orcid.org/0000-0002-1101-549X</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.876-890
issn 2329-9290
2329-9304
language eng
recordid cdi_proquest_journals_2770779409
source IEEE Electronic Library (IEL)
subjects Acoustics
automatic speech recognition
Data integration
fourier transform
Fourier transforms
Information filters
Intelligibility
Modelling
multi-stream acoustic modelling
Raw signal representation
Shape
Speech recognition
Streaming media
System effectiveness
Training
title Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T07%3A37%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Stream%20Acoustic%20Modelling%20Using%20Raw%20Real%20and%20Imaginary%20Parts%20of%20the%20Fourier%20Transform&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Loweimi,%20Erfan&rft.date=2023&rft.volume=31&rft.spage=876&rft.epage=890&rft.pages=876-890&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3237167&rft_dat=%3Cproquest_RIE%3E2770779409%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2770779409&rft_id=info:pmid/&rft_ieee_id=10026604&rfr_iscdi=true