Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order

The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Symmetry (Basel) 2022-12, Vol.14 (12), p.2514
Hauptverfasser: Liao, Lele, Cheng, Guoliang, Ruan, Haoxin, Chen, Kai, Lu, Jing
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 12
container_start_page 2514
container_title Symmetry (Basel)
container_volume 14
creator Liao, Lele
Cheng, Guoliang
Ruan, Haoxin
Chen, Kai
Lu, Jing
description The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.
doi_str_mv 10.3390/sym14122514
format Article
fullrecord <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_2756784856</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A752149257</galeid><sourcerecordid>A752149257</sourcerecordid><originalsourceid>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</originalsourceid><addsrcrecordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2756784856</pqid></control><display><type>article</type><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><source>MDPI - Multidisciplinary Digital Publishing Institute</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creator><creatorcontrib>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creatorcontrib><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><identifier>ISSN: 2073-8994</identifier><identifier>EISSN: 2073-8994</identifier><identifier>DOI: 10.3390/sym14122514</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Algorithms ; Coders ; Deep learning ; Fourier transforms ; Permutations ; Representations ; Separation ; Speech</subject><ispartof>Symmetry (Basel), 2022-12, Vol.14 (12), p.2514</ispartof><rights>COPYRIGHT 2022 MDPI AG</rights><rights>2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</citedby><cites>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</cites><orcidid>0000-0001-9683-3768</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><title>Symmetry (Basel)</title><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><subject>Algorithms</subject><subject>Coders</subject><subject>Deep learning</subject><subject>Fourier transforms</subject><subject>Permutations</subject><subject>Representations</subject><subject>Separation</subject><subject>Speech</subject><issn>2073-8994</issn><issn>2073-8994</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Liao, Lele</creator><creator>Cheng, Guoliang</creator><creator>Ruan, Haoxin</creator><creator>Chen, Kai</creator><creator>Lu, Jing</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>H8D</scope><scope>HCIFZ</scope><scope>JG9</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid></search><sort><creationdate>20221201</creationdate><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><author>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Coders</topic><topic>Deep learning</topic><topic>Fourier transforms</topic><topic>Permutations</topic><topic>Representations</topic><topic>Separation</topic><topic>Speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Aerospace Database</collection><collection>SciTech Premium Collection</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>Symmetry (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Lele</au><au>Cheng, Guoliang</au><au>Ruan, Haoxin</au><au>Chen, Kai</au><au>Lu, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</atitle><jtitle>Symmetry (Basel)</jtitle><date>2022-12-01</date><risdate>2022</risdate><volume>14</volume><issue>12</issue><spage>2514</spage><pages>2514-</pages><issn>2073-8994</issn><eissn>2073-8994</eissn><abstract>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/sym14122514</doi><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2073-8994
ispartof Symmetry (Basel), 2022-12, Vol.14 (12), p.2514
issn 2073-8994
2073-8994
language eng
recordid cdi_proquest_journals_2756784856
source MDPI - Multidisciplinary Digital Publishing Institute; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects Algorithms
Coders
Deep learning
Fourier transforms
Permutations
Representations
Separation
Speech
title Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T06%3A04%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multichannel%20Variational%20Autoencoder-Based%20Speech%20Separation%20in%20Designated%20Speaker%20Order&rft.jtitle=Symmetry%20(Basel)&rft.au=Liao,%20Lele&rft.date=2022-12-01&rft.volume=14&rft.issue=12&rft.spage=2514&rft.pages=2514-&rft.issn=2073-8994&rft.eissn=2073-8994&rft_id=info:doi/10.3390/sym14122514&rft_dat=%3Cgale_proqu%3EA752149257%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2756784856&rft_id=info:pmid/&rft_galeid=A752149257&rfr_iscdi=true