AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction

Blind separation of the sounds in an Ambisonic sound scene is a challenging problem, especially when the spatial impression of these sounds needs to be preserved. In this work, we consider Ambisonic-to-Ambisonic separation of reverberant speech mixtures, optionally containing noise. A supervised lea...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-13
Hauptverfasser:	Herzog, Adrian, Chetupalli, Srikanth Raj, Habets, Emanuel A. P.
Format:	Artikel
Sprache:	eng
Schlagworte:	Ambisonics Artificial neural networks Decoding Encoding Machine learning Memory management Mixtures Noise reduction Reverberation Separation Speech Speech processing speech separation Supervised learning Training Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	13
container_issue
container_start_page	1
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	31
creator	Herzog, Adrian Chetupalli, Srikanth Raj Habets, Emanuel A. P.
description	Blind separation of the sounds in an Ambisonic sound scene is a challenging problem, especially when the spatial impression of these sounds needs to be preserved. In this work, we consider Ambisonic-to-Ambisonic separation of reverberant speech mixtures, optionally containing noise. A supervised learning approach is adopted utilizing a transformer-based deep neural network, denoted by AmbiSep. AmbiSep takes mutichannel Ambisonic signals as input and estimates separate multichannel Ambisonic signals for each speaker while preserving their spatial images including reverberation. The GPU memory requirement of AmbiSep during training increases with the number of Ambisonic channels. To overcome this issue, we propose different aggregation methods.The model is trained and evaluated for first-order and second-order Ambisonics using simulated speech mixtures. Experimental results show that the model performs well on clean and noisy reverberant speech mixtures, and also generalizes to mixtures generated with measured Ambisonic impulse responses.
doi_str_mv	10.1109/TASLP.2023.3297954
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10192276</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10192276</ieee_id><sourcerecordid>2851362241</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-f91c63d98f5f476ef54a9f5b64fc9bbfab05804b02947493a2f990fcbce2af243</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWGq_gHgIeN6aTLKbjrdStP4pKraeQzZNMMVu1s324Ld311bxNO8N783Aj5BzzsacM7xaTZeLlzEwEGMBqDCXR2QAncxQMHn8qwHZKRmltGGMcaYQlRyQx-m2DEtXX9OHGKqW9jbFKtisjdmfocvaOftOu6BpTBtiRU21pk8xJEdf3Xpn-90ZOfHmI7nRYQ7J2-3NanaXLZ7n97PpIrMgVZt55LYQa5z43EtVOJ9Lgz4vC-ktlqU3JcsnTJYMUCqJwoBHZN6W1oHxIMWQXO7v1k383LnU6k3cNVX3UsMk56IAkLxLwT5lm5hS47yum7A1zZfmTPfc9A833XPTB25d6WJfCs65fwWOAKoQ37URaSs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2851362241</pqid></control><display><type>article</type><title>AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction</title><source>IEEE Electronic Library (IEL)</source><creator>Herzog, Adrian ; Chetupalli, Srikanth Raj ; Habets, Emanuel A. P.</creator><creatorcontrib>Herzog, Adrian ; Chetupalli, Srikanth Raj ; Habets, Emanuel A. P.</creatorcontrib><description>Blind separation of the sounds in an Ambisonic sound scene is a challenging problem, especially when the spatial impression of these sounds needs to be preserved. In this work, we consider Ambisonic-to-Ambisonic separation of reverberant speech mixtures, optionally containing noise. A supervised learning approach is adopted utilizing a transformer-based deep neural network, denoted by AmbiSep. AmbiSep takes mutichannel Ambisonic signals as input and estimates separate multichannel Ambisonic signals for each speaker while preserving their spatial images including reverberation. The GPU memory requirement of AmbiSep during training increases with the number of Ambisonic channels. To overcome this issue, we propose different aggregation methods.The model is trained and evaluated for first-order and second-order Ambisonics using simulated speech mixtures. Experimental results show that the model performs well on clean and noisy reverberant speech mixtures, and also generalizes to mixtures generated with measured Ambisonic impulse responses.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3297954</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Ambisonics ; Artificial neural networks ; Decoding ; Encoding ; Machine learning ; Memory management ; Mixtures ; Noise reduction ; Reverberation ; Separation ; Speech ; Speech processing ; speech separation ; Supervised learning ; Training ; Transformers</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-13</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-f91c63d98f5f476ef54a9f5b64fc9bbfab05804b02947493a2f990fcbce2af243</cites><orcidid>0000-0003-0827-8333 ; 0000-0002-2613-8046 ; 0000-0002-2186-5420</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10192276$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10192276$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Herzog, Adrian</creatorcontrib><creatorcontrib>Chetupalli, Srikanth Raj</creatorcontrib><creatorcontrib>Habets, Emanuel A. P.</creatorcontrib><title>AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Blind separation of the sounds in an Ambisonic sound scene is a challenging problem, especially when the spatial impression of these sounds needs to be preserved. In this work, we consider Ambisonic-to-Ambisonic separation of reverberant speech mixtures, optionally containing noise. A supervised learning approach is adopted utilizing a transformer-based deep neural network, denoted by AmbiSep. AmbiSep takes mutichannel Ambisonic signals as input and estimates separate multichannel Ambisonic signals for each speaker while preserving their spatial images including reverberation. The GPU memory requirement of AmbiSep during training increases with the number of Ambisonic channels. To overcome this issue, we propose different aggregation methods.The model is trained and evaluated for first-order and second-order Ambisonics using simulated speech mixtures. Experimental results show that the model performs well on clean and noisy reverberant speech mixtures, and also generalizes to mixtures generated with measured Ambisonic impulse responses.</description><subject>Ambisonics</subject><subject>Artificial neural networks</subject><subject>Decoding</subject><subject>Encoding</subject><subject>Machine learning</subject><subject>Memory management</subject><subject>Mixtures</subject><subject>Noise reduction</subject><subject>Reverberation</subject><subject>Separation</subject><subject>Speech</subject><subject>Speech processing</subject><subject>speech separation</subject><subject>Supervised learning</subject><subject>Training</subject><subject>Transformers</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9LAzEQxYMoWGq_gHgIeN6aTLKbjrdStP4pKraeQzZNMMVu1s324Ld311bxNO8N783Aj5BzzsacM7xaTZeLlzEwEGMBqDCXR2QAncxQMHn8qwHZKRmltGGMcaYQlRyQx-m2DEtXX9OHGKqW9jbFKtisjdmfocvaOftOu6BpTBtiRU21pk8xJEdf3Xpn-90ZOfHmI7nRYQ7J2-3NanaXLZ7n97PpIrMgVZt55LYQa5z43EtVOJ9Lgz4vC-ktlqU3JcsnTJYMUCqJwoBHZN6W1oHxIMWQXO7v1k383LnU6k3cNVX3UsMk56IAkLxLwT5lm5hS47yum7A1zZfmTPfc9A833XPTB25d6WJfCs65fwWOAKoQ37URaSs</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Herzog, Adrian</creator><creator>Chetupalli, Srikanth Raj</creator><creator>Habets, Emanuel A. P.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-0827-8333</orcidid><orcidid>https://orcid.org/0000-0002-2613-8046</orcidid><orcidid>https://orcid.org/0000-0002-2186-5420</orcidid></search><sort><creationdate>20230101</creationdate><title>AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction</title><author>Herzog, Adrian ; Chetupalli, Srikanth Raj ; Habets, Emanuel A. P.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-f91c63d98f5f476ef54a9f5b64fc9bbfab05804b02947493a2f990fcbce2af243</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Ambisonics</topic><topic>Artificial neural networks</topic><topic>Decoding</topic><topic>Encoding</topic><topic>Machine learning</topic><topic>Memory management</topic><topic>Mixtures</topic><topic>Noise reduction</topic><topic>Reverberation</topic><topic>Separation</topic><topic>Speech</topic><topic>Speech processing</topic><topic>speech separation</topic><topic>Supervised learning</topic><topic>Training</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Herzog, Adrian</creatorcontrib><creatorcontrib>Chetupalli, Srikanth Raj</creatorcontrib><creatorcontrib>Habets, Emanuel A. P.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Herzog, Adrian</au><au>Chetupalli, Srikanth Raj</au><au>Habets, Emanuel A. P.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>13</epage><pages>1-13</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Blind separation of the sounds in an Ambisonic sound scene is a challenging problem, especially when the spatial impression of these sounds needs to be preserved. In this work, we consider Ambisonic-to-Ambisonic separation of reverberant speech mixtures, optionally containing noise. A supervised learning approach is adopted utilizing a transformer-based deep neural network, denoted by AmbiSep. AmbiSep takes mutichannel Ambisonic signals as input and estimates separate multichannel Ambisonic signals for each speaker while preserving their spatial images including reverberation. The GPU memory requirement of AmbiSep during training increases with the number of Ambisonic channels. To overcome this issue, we propose different aggregation methods.The model is trained and evaluated for first-order and second-order Ambisonics using simulated speech mixtures. Experimental results show that the model performs well on clean and noisy reverberant speech mixtures, and also generalizes to mixtures generated with measured Ambisonic impulse responses.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3297954</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-0827-8333</orcidid><orcidid>https://orcid.org/0000-0002-2613-8046</orcidid><orcidid>https://orcid.org/0000-0002-2186-5420</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-13
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_10192276
source	IEEE Electronic Library (IEL)
subjects	Ambisonics Artificial neural networks Decoding Encoding Machine learning Memory management Mixtures Noise reduction Reverberation Separation Speech Speech processing speech separation Supervised learning Training Transformers
title	AmbiSep: Joint Ambisonic-to-Ambisonic Speech Separation and Noise Reduction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T20%3A06%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=AmbiSep:%20Joint%20Ambisonic-to-Ambisonic%20Speech%20Separation%20and%20Noise%20Reduction&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Herzog,%20Adrian&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=13&rft.pages=1-13&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3297954&rft_dat=%3Cproquest_RIE%3E2851362241%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2851362241&rft_id=info:pmid/&rft_ieee_id=10192276&rfr_iscdi=true