Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022-01, Vol.30, p.2993-3007
Hauptverfasser:	Bie, Xiaoyu, Leglaive, Simon, Alameda-Pineda, Xavier, Girin, Laurent
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Computer Science dynamical variational autoencoders Inference algorithms Machine Learning Modelling Noise measurement nonnegative matrix factorization Recording Spectrograms Speech Speech enhancement Speech processing Time series analysis Time-domain analysis Training variational inference
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3007
container_issue
container_start_page	2993
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	30
creator	Bie, Xiaoyu Leglaive, Simon Alameda-Pineda, Xavier Girin, Laurent
description	Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
doi_str_mv	10.1109/TASLP.2022.3207349
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2718794361</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9894060</ieee_id><sourcerecordid>2718794361</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</originalsourceid><addsrcrecordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718794361</pqid></control><display><type>article</type><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><source>IEEE Electronic Library (IEL)</source><creator>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creator><creatorcontrib>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creatorcontrib><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2022.3207349</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Artificial Intelligence ; Computer Science ; dynamical variational autoencoders ; Inference algorithms ; Machine Learning ; Modelling ; Noise measurement ; nonnegative matrix factorization ; Recording ; Spectrograms ; Speech ; Speech enhancement ; Speech processing ; Time series analysis ; Time-domain analysis ; Training ; variational inference</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</citedby><cites>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</cites><orcidid>0000-0002-5354-1084 ; 0000-0002-8219-1298 ; 0000-0003-1480-0538 ; 0000-0002-9214-8760</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://inria.hal.science/hal-03295630$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Computer Science</subject><subject>dynamical variational autoencoders</subject><subject>Inference algorithms</subject><subject>Machine Learning</subject><subject>Modelling</subject><subject>Noise measurement</subject><subject>nonnegative matrix factorization</subject><subject>Recording</subject><subject>Spectrograms</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>Time series analysis</subject><subject>Time-domain analysis</subject><subject>Training</subject><subject>variational inference</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Bie, Xiaoyu</creator><creator>Leglaive, Simon</creator><creator>Alameda-Pineda, Xavier</creator><creator>Girin, Laurent</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid></search><sort><creationdate>20220101</creationdate><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><author>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Computer Science</topic><topic>dynamical variational autoencoders</topic><topic>Inference algorithms</topic><topic>Machine Learning</topic><topic>Modelling</topic><topic>Noise measurement</topic><topic>nonnegative matrix factorization</topic><topic>Recording</topic><topic>Spectrograms</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>Time series analysis</topic><topic>Time-domain analysis</topic><topic>Training</topic><topic>variational inference</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bie, Xiaoyu</au><au>Leglaive, Simon</au><au>Alameda-Pineda, Xavier</au><au>Girin, Laurent</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2022-01-01</date><risdate>2022</risdate><volume>30</volume><spage>2993</spage><epage>3007</epage><pages>2993-3007</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2022.3207349</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007
issn	2329-9290 2329-9304
language	eng
recordid	cdi_proquest_journals_2718794361
source	IEEE Electronic Library (IEL)
subjects	Algorithms Artificial Intelligence Computer Science dynamical variational autoencoders Inference algorithms Machine Learning Modelling Noise measurement nonnegative matrix factorization Recording Spectrograms Speech Speech enhancement Speech processing Time series analysis Time-domain analysis Training variational inference
title	Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T02%3A30%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Speech%20Enhancement%20Using%20Dynamical%20Variational%20Autoencoders&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Bie,%20Xiaoyu&rft.date=2022-01-01&rft.volume=30&rft.spage=2993&rft.epage=3007&rft.pages=2993-3007&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2022.3207349&rft_dat=%3Cproquest_RIE%3E2718794361%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718794361&rft_id=info:pmid/&rft_ieee_id=9894060&rfr_iscdi=true