GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model

IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that con...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liu, Haocheng, Baoueb, Teysir, Fontaine, Mathieu, Roux, Jonathan Le, Richard, Gael
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Liu, Haocheng Baoueb, Teysir Fontaine, Mathieu Roux, Jonathan Le Richard, Gael
description	IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
doi_str_mv	10.48550/arxiv.2402.15516
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_15516</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_15516</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-c7446449944f11b729357d57da5abacfa41d501680d8912491149688f069bc1a3</originalsourceid><addsrcrecordid>eNotj8FqwzAQRHXpoST9gJ6qH5CrtVeylJtJUrXg0kugR7O2JBDEdlHSkP59k7QwMHN4DDzGHkEWaJSSz5TP6VSUKMsClAJ9zzaubYTL5Fe84S6nGNMk2jTy7fkYJh88_6RTiHMeuQtTyHRM88Q3F-77cF3vsw_7JbuLtD-Eh_9esN3Ldrd-Fe2He1s3rSBdazHUiBrRWsQI0NelrVTtLyFFPQ2RELySoI30xkKJFgCtNiZKbfsBqFqwp7_bm0b3ldNI-ae76nQ3neoX-N5CuQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><source>arXiv.org</source><creator>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</creator><creatorcontrib>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</creatorcontrib><description>IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.</description><identifier>DOI: 10.48550/arxiv.2402.15516</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.15516$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.15516$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Haocheng</creatorcontrib><creatorcontrib>Baoueb, Teysir</creatorcontrib><creatorcontrib>Fontaine, Mathieu</creatorcontrib><creatorcontrib>Roux, Jonathan Le</creatorcontrib><creatorcontrib>Richard, Gael</creatorcontrib><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><description>IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FqwzAQRHXpoST9gJ6qH5CrtVeylJtJUrXg0kugR7O2JBDEdlHSkP59k7QwMHN4DDzGHkEWaJSSz5TP6VSUKMsClAJ9zzaubYTL5Fe84S6nGNMk2jTy7fkYJh88_6RTiHMeuQtTyHRM88Q3F-77cF3vsw_7JbuLtD-Eh_9esN3Ldrd-Fe2He1s3rSBdazHUiBrRWsQI0NelrVTtLyFFPQ2RELySoI30xkKJFgCtNiZKbfsBqFqwp7_bm0b3ldNI-ae76nQ3neoX-N5CuQ</recordid><startdate>20240209</startdate><enddate>20240209</enddate><creator>Liu, Haocheng</creator><creator>Baoueb, Teysir</creator><creator>Fontaine, Mathieu</creator><creator>Roux, Jonathan Le</creator><creator>Richard, Gael</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240209</creationdate><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><author>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-c7446449944f11b729357d57da5abacfa41d501680d8912491149688f069bc1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Haocheng</creatorcontrib><creatorcontrib>Baoueb, Teysir</creatorcontrib><creatorcontrib>Fontaine, Mathieu</creatorcontrib><creatorcontrib>Roux, Jonathan Le</creatorcontrib><creatorcontrib>Richard, Gael</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Haocheng</au><au>Baoueb, Teysir</au><au>Fontaine, Mathieu</au><au>Roux, Jonathan Le</au><au>Richard, Gael</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</atitle><date>2024-02-09</date><risdate>2024</risdate><abstract>IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.</abstract><doi>10.48550/arxiv.2402.15516</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.15516
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_15516
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Sound
title	GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T23%3A38%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GLA-Grad:%20A%20Griffin-Lim%20Extended%20Waveform%20Generation%20Diffusion%20Model&rft.au=Liu,%20Haocheng&rft.date=2024-02-09&rft_id=info:doi/10.48550/arxiv.2402.15516&rft_dat=%3Carxiv_GOX%3E2402_15516%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true