GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model
IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that con...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Liu, Haocheng Baoueb, Teysir Fontaine, Mathieu Roux, Jonathan Le Richard, Gael |
description | IEEE International Conference on Acoustics, Speech and Signal
Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal
generation tasks such as speech or music synthesis. WaveGrad, for example, is a
successful diffusion model that conditionally uses the mel spectrogram to guide
a diffusion process for the generation of high-fidelity audio. However, such
models face important challenges concerning the noise diffusion process for
training and inference, and they have difficulty generating high-quality speech
for speakers that were not seen during training. With the aim of minimizing the
conditioning error and increasing the efficiency of the noise diffusion
process, we propose in this paper a new scheme called GLA-Grad, which consists
in introducing a phase recovery algorithm such as the Griffin-Lim algorithm
(GLA) at each step of the regular diffusion process. Furthermore, it can be
directly applied to an already-trained waveform generation model, without
additional training or fine-tuning. We show that our algorithm outperforms
state-of-the-art diffusion models for speech generation, especially when
generating speech for a previously unseen target speaker. |
doi_str_mv | 10.48550/arxiv.2402.15516 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_15516</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_15516</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-c7446449944f11b729357d57da5abacfa41d501680d8912491149688f069bc1a3</originalsourceid><addsrcrecordid>eNotj8FqwzAQRHXpoST9gJ6qH5CrtVeylJtJUrXg0kugR7O2JBDEdlHSkP59k7QwMHN4DDzGHkEWaJSSz5TP6VSUKMsClAJ9zzaubYTL5Fe84S6nGNMk2jTy7fkYJh88_6RTiHMeuQtTyHRM88Q3F-77cF3vsw_7JbuLtD-Eh_9esN3Ldrd-Fe2He1s3rSBdazHUiBrRWsQI0NelrVTtLyFFPQ2RELySoI30xkKJFgCtNiZKbfsBqFqwp7_bm0b3ldNI-ae76nQ3neoX-N5CuQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><source>arXiv.org</source><creator>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</creator><creatorcontrib>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</creatorcontrib><description>IEEE International Conference on Acoustics, Speech and Signal
Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal
generation tasks such as speech or music synthesis. WaveGrad, for example, is a
successful diffusion model that conditionally uses the mel spectrogram to guide
a diffusion process for the generation of high-fidelity audio. However, such
models face important challenges concerning the noise diffusion process for
training and inference, and they have difficulty generating high-quality speech
for speakers that were not seen during training. With the aim of minimizing the
conditioning error and increasing the efficiency of the noise diffusion
process, we propose in this paper a new scheme called GLA-Grad, which consists
in introducing a phase recovery algorithm such as the Griffin-Lim algorithm
(GLA) at each step of the regular diffusion process. Furthermore, it can be
directly applied to an already-trained waveform generation model, without
additional training or fine-tuning. We show that our algorithm outperforms
state-of-the-art diffusion models for speech generation, especially when
generating speech for a previously unseen target speaker.</description><identifier>DOI: 10.48550/arxiv.2402.15516</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.15516$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.15516$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Haocheng</creatorcontrib><creatorcontrib>Baoueb, Teysir</creatorcontrib><creatorcontrib>Fontaine, Mathieu</creatorcontrib><creatorcontrib>Roux, Jonathan Le</creatorcontrib><creatorcontrib>Richard, Gael</creatorcontrib><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><description>IEEE International Conference on Acoustics, Speech and Signal
Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal
generation tasks such as speech or music synthesis. WaveGrad, for example, is a
successful diffusion model that conditionally uses the mel spectrogram to guide
a diffusion process for the generation of high-fidelity audio. However, such
models face important challenges concerning the noise diffusion process for
training and inference, and they have difficulty generating high-quality speech
for speakers that were not seen during training. With the aim of minimizing the
conditioning error and increasing the efficiency of the noise diffusion
process, we propose in this paper a new scheme called GLA-Grad, which consists
in introducing a phase recovery algorithm such as the Griffin-Lim algorithm
(GLA) at each step of the regular diffusion process. Furthermore, it can be
directly applied to an already-trained waveform generation model, without
additional training or fine-tuning. We show that our algorithm outperforms
state-of-the-art diffusion models for speech generation, especially when
generating speech for a previously unseen target speaker.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FqwzAQRHXpoST9gJ6qH5CrtVeylJtJUrXg0kugR7O2JBDEdlHSkP59k7QwMHN4DDzGHkEWaJSSz5TP6VSUKMsClAJ9zzaubYTL5Fe84S6nGNMk2jTy7fkYJh88_6RTiHMeuQtTyHRM88Q3F-77cF3vsw_7JbuLtD-Eh_9esN3Ldrd-Fe2He1s3rSBdazHUiBrRWsQI0NelrVTtLyFFPQ2RELySoI30xkKJFgCtNiZKbfsBqFqwp7_bm0b3ldNI-ae76nQ3neoX-N5CuQ</recordid><startdate>20240209</startdate><enddate>20240209</enddate><creator>Liu, Haocheng</creator><creator>Baoueb, Teysir</creator><creator>Fontaine, Mathieu</creator><creator>Roux, Jonathan Le</creator><creator>Richard, Gael</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240209</creationdate><title>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</title><author>Liu, Haocheng ; Baoueb, Teysir ; Fontaine, Mathieu ; Roux, Jonathan Le ; Richard, Gael</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-c7446449944f11b729357d57da5abacfa41d501680d8912491149688f069bc1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Haocheng</creatorcontrib><creatorcontrib>Baoueb, Teysir</creatorcontrib><creatorcontrib>Fontaine, Mathieu</creatorcontrib><creatorcontrib>Roux, Jonathan Le</creatorcontrib><creatorcontrib>Richard, Gael</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Haocheng</au><au>Baoueb, Teysir</au><au>Fontaine, Mathieu</au><au>Roux, Jonathan Le</au><au>Richard, Gael</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model</atitle><date>2024-02-09</date><risdate>2024</risdate><abstract>IEEE International Conference on Acoustics, Speech and Signal
Processing, Apr 2024, Seoul (Korea), South Korea Diffusion models are receiving a growing interest for a variety of signal
generation tasks such as speech or music synthesis. WaveGrad, for example, is a
successful diffusion model that conditionally uses the mel spectrogram to guide
a diffusion process for the generation of high-fidelity audio. However, such
models face important challenges concerning the noise diffusion process for
training and inference, and they have difficulty generating high-quality speech
for speakers that were not seen during training. With the aim of minimizing the
conditioning error and increasing the efficiency of the noise diffusion
process, we propose in this paper a new scheme called GLA-Grad, which consists
in introducing a phase recovery algorithm such as the Griffin-Lim algorithm
(GLA) at each step of the regular diffusion process. Furthermore, it can be
directly applied to an already-trained waveform generation model, without
additional training or fine-tuning. We show that our algorithm outperforms
state-of-the-art diffusion models for speech generation, especially when
generating speech for a previously unseen target speaker.</abstract><doi>10.48550/arxiv.2402.15516</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2402.15516 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2402_15516 |
source | arXiv.org |
subjects | Computer Science - Learning Computer Science - Sound |
title | GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T23%3A38%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GLA-Grad:%20A%20Griffin-Lim%20Extended%20Waveform%20Generation%20Diffusion%20Model&rft.au=Liu,%20Haocheng&rft.date=2024-02-09&rft_id=info:doi/10.48550/arxiv.2402.15516&rft_dat=%3Carxiv_GOX%3E2402_15516%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |