Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich styl...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Lim, Hyungseob Byun, Kyungguen Moon, Sunkuk Visser, Erik |
description | While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed. |
doi_str_mv | 10.48550/arxiv.2309.02730 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_02730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_02730</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-d37eef6bfc1b3a61fa7c72e90746654acf7367b09ee2d11b204588a20772a3263</originalsourceid><addsrcrecordid>eNotj01ugzAUhL3pokp7gK7qC5gaG2zoLiL9k1Jlkahb9DDPCYLayNCo3L6BdDUz0sxIHyEPMY-SLE35E4Tf5hwJyfOICy35LWn349Rh5X37TAvvRnQj22CPrr44uu8R2sYd6dKin77Gbo7WB7p2Exs9uwj98o3BeX7GMDTe0Z9hbu1cN80XaE50AyPckRsL3YD3_7oih9eXQ_HOtru3j2K9ZaA0Z7XUiFZV1sSVBBVb0EYLzLlOlEoTMFZLpSueI4o6jivBkzTLQHCtBUih5Io8Xm8X2rIPzTeEqZypy4Va_gG_P1LA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><source>arXiv.org</source><creator>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</creator><creatorcontrib>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</creatorcontrib><description>While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.</description><identifier>DOI: 10.48550/arxiv.2309.02730</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Sound</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.02730$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.02730$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lim, Hyungseob</creatorcontrib><creatorcontrib>Byun, Kyungguen</creatorcontrib><creatorcontrib>Moon, Sunkuk</creatorcontrib><creatorcontrib>Visser, Erik</creatorcontrib><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><description>While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj01ugzAUhL3pokp7gK7qC5gaG2zoLiL9k1Jlkahb9DDPCYLayNCo3L6BdDUz0sxIHyEPMY-SLE35E4Tf5hwJyfOICy35LWn349Rh5X37TAvvRnQj22CPrr44uu8R2sYd6dKin77Gbo7WB7p2Exs9uwj98o3BeX7GMDTe0Z9hbu1cN80XaE50AyPckRsL3YD3_7oih9eXQ_HOtru3j2K9ZaA0Z7XUiFZV1sSVBBVb0EYLzLlOlEoTMFZLpSueI4o6jivBkzTLQHCtBUih5Io8Xm8X2rIPzTeEqZypy4Va_gG_P1LA</recordid><startdate>20230906</startdate><enddate>20230906</enddate><creator>Lim, Hyungseob</creator><creator>Byun, Kyungguen</creator><creator>Moon, Sunkuk</creator><creator>Visser, Erik</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230906</creationdate><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><author>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-d37eef6bfc1b3a61fa7c72e90746654acf7367b09ee2d11b204588a20772a3263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Lim, Hyungseob</creatorcontrib><creatorcontrib>Byun, Kyungguen</creatorcontrib><creatorcontrib>Moon, Sunkuk</creatorcontrib><creatorcontrib>Visser, Erik</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lim, Hyungseob</au><au>Byun, Kyungguen</au><au>Moon, Sunkuk</au><au>Visser, Erik</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</atitle><date>2023-09-06</date><risdate>2023</risdate><abstract>While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.</abstract><doi>10.48550/arxiv.2309.02730</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2309.02730 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2309_02730 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Sound |
title | Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T02%3A09%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stylebook:%20Content-Dependent%20Speaking%20Style%20Modeling%20for%20Any-to-Any%20Voice%20Conversion%20using%20Only%20Speech%20Data&rft.au=Lim,%20Hyungseob&rft.date=2023-09-06&rft_id=info:doi/10.48550/arxiv.2309.02730&rft_dat=%3Carxiv_GOX%3E2309_02730%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |