Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022-01, Vol.30, p.2993-3007
Hauptverfasser: Bie, Xiaoyu, Leglaive, Simon, Alameda-Pineda, Xavier, Girin, Laurent
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3007
container_issue
container_start_page 2993
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 30
creator Bie, Xiaoyu
Leglaive, Simon
Alameda-Pineda, Xavier
Girin, Laurent
description Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
doi_str_mv 10.1109/TASLP.2022.3207349
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2718794361</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9894060</ieee_id><sourcerecordid>2718794361</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</originalsourceid><addsrcrecordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718794361</pqid></control><display><type>article</type><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><source>IEEE Electronic Library (IEL)</source><creator>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creator><creatorcontrib>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creatorcontrib><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2022.3207349</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Artificial Intelligence ; Computer Science ; dynamical variational autoencoders ; Inference algorithms ; Machine Learning ; Modelling ; Noise measurement ; nonnegative matrix factorization ; Recording ; Spectrograms ; Speech ; Speech enhancement ; Speech processing ; Time series analysis ; Time-domain analysis ; Training ; variational inference</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</citedby><cites>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</cites><orcidid>0000-0002-5354-1084 ; 0000-0002-8219-1298 ; 0000-0003-1480-0538 ; 0000-0002-9214-8760</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://inria.hal.science/hal-03295630$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Computer Science</subject><subject>dynamical variational autoencoders</subject><subject>Inference algorithms</subject><subject>Machine Learning</subject><subject>Modelling</subject><subject>Noise measurement</subject><subject>nonnegative matrix factorization</subject><subject>Recording</subject><subject>Spectrograms</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>Time series analysis</subject><subject>Time-domain analysis</subject><subject>Training</subject><subject>variational inference</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Bie, Xiaoyu</creator><creator>Leglaive, Simon</creator><creator>Alameda-Pineda, Xavier</creator><creator>Girin, Laurent</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid></search><sort><creationdate>20220101</creationdate><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><author>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Computer Science</topic><topic>dynamical variational autoencoders</topic><topic>Inference algorithms</topic><topic>Machine Learning</topic><topic>Modelling</topic><topic>Noise measurement</topic><topic>nonnegative matrix factorization</topic><topic>Recording</topic><topic>Spectrograms</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>Time series analysis</topic><topic>Time-domain analysis</topic><topic>Training</topic><topic>variational inference</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bie, Xiaoyu</au><au>Leglaive, Simon</au><au>Alameda-Pineda, Xavier</au><au>Girin, Laurent</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2022-01-01</date><risdate>2022</risdate><volume>30</volume><spage>2993</spage><epage>3007</epage><pages>2993-3007</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2022.3207349</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007
issn 2329-9290
2329-9304
language eng
recordid cdi_proquest_journals_2718794361
source IEEE Electronic Library (IEL)
subjects Algorithms
Artificial Intelligence
Computer Science
dynamical variational autoencoders
Inference algorithms
Machine Learning
Modelling
Noise measurement
nonnegative matrix factorization
Recording
Spectrograms
Speech
Speech enhancement
Speech processing
Time series analysis
Time-domain analysis
Training
variational inference
title Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T02%3A30%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Speech%20Enhancement%20Using%20Dynamical%20Variational%20Autoencoders&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Bie,%20Xiaoyu&rft.date=2022-01-01&rft.volume=30&rft.spage=2993&rft.epage=3007&rft.pages=2993-3007&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2022.3207349&rft_dat=%3Cproquest_RIE%3E2718794361%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718794361&rft_id=info:pmid/&rft_ieee_id=9894060&rfr_iscdi=true