Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order

The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Symmetry (Basel) 2022-12, Vol.14 (12), p.2514
Hauptverfasser:	Liao, Lele, Cheng, Guoliang, Ruan, Haoxin, Chen, Kai, Lu, Jing
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Coders Deep learning Fourier transforms Permutations Representations Separation Speech
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	12
container_start_page	2514
container_title	Symmetry (Basel)
container_volume	14
creator	Liao, Lele Cheng, Guoliang Ruan, Haoxin Chen, Kai Lu, Jing
description	The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.
doi_str_mv	10.3390/sym14122514
format	Article
fullrecord	<record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_2756784856</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A752149257</galeid><sourcerecordid>A752149257</sourcerecordid><originalsourceid>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</originalsourceid><addsrcrecordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2756784856</pqid></control><display><type>article</type><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><source>MDPI - Multidisciplinary Digital Publishing Institute</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creator><creatorcontrib>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creatorcontrib><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><identifier>ISSN: 2073-8994</identifier><identifier>EISSN: 2073-8994</identifier><identifier>DOI: 10.3390/sym14122514</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Algorithms ; Coders ; Deep learning ; Fourier transforms ; Permutations ; Representations ; Separation ; Speech</subject><ispartof>Symmetry (Basel), 2022-12, Vol.14 (12), p.2514</ispartof><rights>COPYRIGHT 2022 MDPI AG</rights><rights>2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</citedby><cites>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</cites><orcidid>0000-0001-9683-3768</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><title>Symmetry (Basel)</title><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><subject>Algorithms</subject><subject>Coders</subject><subject>Deep learning</subject><subject>Fourier transforms</subject><subject>Permutations</subject><subject>Representations</subject><subject>Separation</subject><subject>Speech</subject><issn>2073-8994</issn><issn>2073-8994</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Liao, Lele</creator><creator>Cheng, Guoliang</creator><creator>Ruan, Haoxin</creator><creator>Chen, Kai</creator><creator>Lu, Jing</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>H8D</scope><scope>HCIFZ</scope><scope>JG9</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid></search><sort><creationdate>20221201</creationdate><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><author>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Coders</topic><topic>Deep learning</topic><topic>Fourier transforms</topic><topic>Permutations</topic><topic>Representations</topic><topic>Separation</topic><topic>Speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Aerospace Database</collection><collection>SciTech Premium Collection</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>Symmetry (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Lele</au><au>Cheng, Guoliang</au><au>Ruan, Haoxin</au><au>Chen, Kai</au><au>Lu, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</atitle><jtitle>Symmetry (Basel)</jtitle><date>2022-12-01</date><risdate>2022</risdate><volume>14</volume><issue>12</issue><spage>2514</spage><pages>2514-</pages><issn>2073-8994</issn><eissn>2073-8994</eissn><abstract>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/sym14122514</doi><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2073-8994
ispartof	Symmetry (Basel), 2022-12, Vol.14 (12), p.2514
issn	2073-8994 2073-8994
language	eng
recordid	cdi_proquest_journals_2756784856
source	MDPI - Multidisciplinary Digital Publishing Institute; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects	Algorithms Coders Deep learning Fourier transforms Permutations Representations Separation Speech
title	Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T06%3A04%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multichannel%20Variational%20Autoencoder-Based%20Speech%20Separation%20in%20Designated%20Speaker%20Order&rft.jtitle=Symmetry%20(Basel)&rft.au=Liao,%20Lele&rft.date=2022-12-01&rft.volume=14&rft.issue=12&rft.spage=2514&rft.pages=2514-&rft.issn=2073-8994&rft.eissn=2073-8994&rft_id=info:doi/10.3390/sym14122514&rft_dat=%3Cgale_proqu%3EA752149257%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2756784856&rft_id=info:pmid/&rft_galeid=A752149257&rfr_iscdi=true