Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order
The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in...
Gespeichert in:
Veröffentlicht in: | Symmetry (Basel) 2022-12, Vol.14 (12), p.2514 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 12 |
container_start_page | 2514 |
container_title | Symmetry (Basel) |
container_volume | 14 |
creator | Liao, Lele Cheng, Guoliang Ruan, Haoxin Chen, Kai Lu, Jing |
description | The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments. |
doi_str_mv | 10.3390/sym14122514 |
format | Article |
fullrecord | <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_2756784856</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A752149257</galeid><sourcerecordid>A752149257</sourcerecordid><originalsourceid>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</originalsourceid><addsrcrecordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2756784856</pqid></control><display><type>article</type><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><source>MDPI - Multidisciplinary Digital Publishing Institute</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creator><creatorcontrib>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</creatorcontrib><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><identifier>ISSN: 2073-8994</identifier><identifier>EISSN: 2073-8994</identifier><identifier>DOI: 10.3390/sym14122514</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Algorithms ; Coders ; Deep learning ; Fourier transforms ; Permutations ; Representations ; Separation ; Speech</subject><ispartof>Symmetry (Basel), 2022-12, Vol.14 (12), p.2514</ispartof><rights>COPYRIGHT 2022 MDPI AG</rights><rights>2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</citedby><cites>FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</cites><orcidid>0000-0001-9683-3768</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><title>Symmetry (Basel)</title><description>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</description><subject>Algorithms</subject><subject>Coders</subject><subject>Deep learning</subject><subject>Fourier transforms</subject><subject>Permutations</subject><subject>Representations</subject><subject>Separation</subject><subject>Speech</subject><issn>2073-8994</issn><issn>2073-8994</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpNUMtOwzAQtBBIVKUnfiASR5TiZ5wcS3lKoB4KiFvkOOvWJXWCnRz69xjCobuHHe3OjLSD0CXBc8YKfBMOe8IJpYLwEzShWLI0Lwp-eoTP0SyEHY4lsOAZnqDP16Hprd4q56BJPpS3qretU02yGPoWnG5r8OmtClAn6w5Ab5M1dMr_sRLrkjsIduNUP97VF_hk5aPmAp0Z1QSY_c8pen-4f1s-pS-rx-fl4iXVNJN9Wtc4x0xVBZWi4lRjZkAzQyAikEZRCSKnQFTNcoq5rgjgQhjAeSYrIWs2RVejb-fb7wFCX-7awccHQhktM5nzXGSRNR9ZG9VAaZ1pe6907Br2VrcOjI37hRSU8IIKGQXXo0D7NgQPpuy83St_KAkuf-Muj-JmP0Y5cpE</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Liao, Lele</creator><creator>Cheng, Guoliang</creator><creator>Ruan, Haoxin</creator><creator>Chen, Kai</creator><creator>Lu, Jing</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>H8D</scope><scope>HCIFZ</scope><scope>JG9</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid></search><sort><creationdate>20221201</creationdate><title>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</title><author>Liao, Lele ; Cheng, Guoliang ; Ruan, Haoxin ; Chen, Kai ; Lu, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c267t-dd0803ab9275b42c03fec3f1ec03e7fa27e582e1ad38204cb1e095fe0867b57d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Coders</topic><topic>Deep learning</topic><topic>Fourier transforms</topic><topic>Permutations</topic><topic>Representations</topic><topic>Separation</topic><topic>Speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Lele</creatorcontrib><creatorcontrib>Cheng, Guoliang</creatorcontrib><creatorcontrib>Ruan, Haoxin</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Aerospace Database</collection><collection>SciTech Premium Collection</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>Symmetry (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Lele</au><au>Cheng, Guoliang</au><au>Ruan, Haoxin</au><au>Chen, Kai</au><au>Lu, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order</atitle><jtitle>Symmetry (Basel)</jtitle><date>2022-12-01</date><risdate>2022</risdate><volume>14</volume><issue>12</issue><spage>2514</spage><pages>2514-</pages><issn>2073-8994</issn><eissn>2073-8994</eissn><abstract>The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/sym14122514</doi><orcidid>https://orcid.org/0000-0001-9683-3768</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2073-8994 |
ispartof | Symmetry (Basel), 2022-12, Vol.14 (12), p.2514 |
issn | 2073-8994 2073-8994 |
language | eng |
recordid | cdi_proquest_journals_2756784856 |
source | MDPI - Multidisciplinary Digital Publishing Institute; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals |
subjects | Algorithms Coders Deep learning Fourier transforms Permutations Representations Separation Speech |
title | Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T06%3A04%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multichannel%20Variational%20Autoencoder-Based%20Speech%20Separation%20in%20Designated%20Speaker%20Order&rft.jtitle=Symmetry%20(Basel)&rft.au=Liao,%20Lele&rft.date=2022-12-01&rft.volume=14&rft.issue=12&rft.spage=2514&rft.pages=2514-&rft.issn=2073-8994&rft.eissn=2073-8994&rft_id=info:doi/10.3390/sym14122514&rft_dat=%3Cgale_proqu%3EA752149257%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2756784856&rft_id=info:pmid/&rft_galeid=A752149257&rfr_iscdi=true |