Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders
Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed a...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022-01, Vol.30, p.2993-3007 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 3007 |
---|---|
container_issue | |
container_start_page | 2993 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 30 |
creator | Bie, Xiaoyu Leglaive, Simon Alameda-Pineda, Xavier Girin, Laurent |
description | Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training. |
doi_str_mv | 10.1109/TASLP.2022.3207349 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2718794361</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9894060</ieee_id><sourcerecordid>2718794361</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</originalsourceid><addsrcrecordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718794361</pqid></control><display><type>article</type><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><source>IEEE Electronic Library (IEL)</source><creator>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creator><creatorcontrib>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</creatorcontrib><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2022.3207349</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Artificial Intelligence ; Computer Science ; dynamical variational autoencoders ; Inference algorithms ; Machine Learning ; Modelling ; Noise measurement ; nonnegative matrix factorization ; Recording ; Spectrograms ; Speech ; Speech enhancement ; Speech processing ; Time series analysis ; Time-domain analysis ; Training ; variational inference</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</citedby><cites>FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</cites><orcidid>0000-0002-5354-1084 ; 0000-0002-8219-1298 ; 0000-0003-1480-0538 ; 0000-0002-9214-8760</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9894060$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://inria.hal.science/hal-03295630$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</description><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Computer Science</subject><subject>dynamical variational autoencoders</subject><subject>Inference algorithms</subject><subject>Machine Learning</subject><subject>Modelling</subject><subject>Noise measurement</subject><subject>nonnegative matrix factorization</subject><subject>Recording</subject><subject>Spectrograms</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>Time series analysis</subject><subject>Time-domain analysis</subject><subject>Training</subject><subject>variational inference</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9UE1PwkAU3BhNJMgf0EsTTx7Atx90u8cGUUyaaAJ43SzbrSyBbd1tSfj3thZ5lzd5mZnMG4TuMUwwBvG8SpfZ54QAIRNKgFMmrtCAUCLGggK7_sdEwC0ahbADAAxcCM4GaLF2oamMP9pg8mhZGaO30dxtldPmYFwdrYN139HLyamD1WoffSlvVW1L1-K0qUvjdJkbH-7QTaH2wYzOe4jWr_PVbDHOPt7eZ2k21pTTeqwLw5IcNI8Zm-KcEVJgTCDJMTd4I7poapqbdpIN5EQzUCxh3HBKEoJjQofoqffdqr2svD0of5KlsnKRZrK7QfvsNKZwxC33sedWvvxpTKjlrmx8mzxIwnHCBaNxxyI9S_syBG-Kiy0G2RUs_wqWXcHyXHAreuhFto16EYhEMIiB_gKSM3VQ</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Bie, Xiaoyu</creator><creator>Leglaive, Simon</creator><creator>Alameda-Pineda, Xavier</creator><creator>Girin, Laurent</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid></search><sort><creationdate>20220101</creationdate><title>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</title><author>Bie, Xiaoyu ; Leglaive, Simon ; Alameda-Pineda, Xavier ; Girin, Laurent</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-cfe48d0c764451d422f11208d17e1b90001a5deeee8b0d2c40a4847e732821623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Computer Science</topic><topic>dynamical variational autoencoders</topic><topic>Inference algorithms</topic><topic>Machine Learning</topic><topic>Modelling</topic><topic>Noise measurement</topic><topic>nonnegative matrix factorization</topic><topic>Recording</topic><topic>Spectrograms</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>Time series analysis</topic><topic>Time-domain analysis</topic><topic>Training</topic><topic>variational inference</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Leglaive, Simon</creatorcontrib><creatorcontrib>Alameda-Pineda, Xavier</creatorcontrib><creatorcontrib>Girin, Laurent</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bie, Xiaoyu</au><au>Leglaive, Simon</au><au>Alameda-Pineda, Xavier</au><au>Girin, Laurent</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2022-01-01</date><risdate>2022</risdate><volume>30</volume><spage>2993</spage><epage>3007</epage><pages>2993-3007</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2022.3207349</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5354-1084</orcidid><orcidid>https://orcid.org/0000-0002-8219-1298</orcidid><orcidid>https://orcid.org/0000-0003-1480-0538</orcidid><orcidid>https://orcid.org/0000-0002-9214-8760</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2022-01, Vol.30, p.2993-3007 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_proquest_journals_2718794361 |
source | IEEE Electronic Library (IEL) |
subjects | Algorithms Artificial Intelligence Computer Science dynamical variational autoencoders Inference algorithms Machine Learning Modelling Noise measurement nonnegative matrix factorization Recording Spectrograms Speech Speech enhancement Speech processing Time series analysis Time-domain analysis Training variational inference |
title | Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T02%3A30%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Speech%20Enhancement%20Using%20Dynamical%20Variational%20Autoencoders&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Bie,%20Xiaoyu&rft.date=2022-01-01&rft.volume=30&rft.spage=2993&rft.epage=3007&rft.pages=2993-3007&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2022.3207349&rft_dat=%3Cproquest_RIE%3E2718794361%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718794361&rft_id=info:pmid/&rft_ieee_id=9894060&rfr_iscdi=true |