Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich styl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lim, Hyungseob, Byun, Kyungguen, Moon, Sunkuk, Visser, Erik
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lim, Hyungseob Byun, Kyungguen Moon, Sunkuk Visser, Erik
description	While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.
doi_str_mv	10.48550/arxiv.2309.02730
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_02730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_02730</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-d37eef6bfc1b3a61fa7c72e90746654acf7367b09ee2d11b204588a20772a3263</originalsourceid><addsrcrecordid>eNotj01ugzAUhL3pokp7gK7qC5gaG2zoLiL9k1Jlkahb9DDPCYLayNCo3L6BdDUz0sxIHyEPMY-SLE35E4Tf5hwJyfOICy35LWn349Rh5X37TAvvRnQj22CPrr44uu8R2sYd6dKin77Gbo7WB7p2Exs9uwj98o3BeX7GMDTe0Z9hbu1cN80XaE50AyPckRsL3YD3_7oih9eXQ_HOtru3j2K9ZaA0Z7XUiFZV1sSVBBVb0EYLzLlOlEoTMFZLpSueI4o6jivBkzTLQHCtBUih5Io8Xm8X2rIPzTeEqZypy4Va_gG_P1LA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><source>arXiv.org</source><creator>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</creator><creatorcontrib>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</creatorcontrib><description>While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.</description><identifier>DOI: 10.48550/arxiv.2309.02730</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Sound</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.02730$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.02730$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lim, Hyungseob</creatorcontrib><creatorcontrib>Byun, Kyungguen</creatorcontrib><creatorcontrib>Moon, Sunkuk</creatorcontrib><creatorcontrib>Visser, Erik</creatorcontrib><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><description>While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj01ugzAUhL3pokp7gK7qC5gaG2zoLiL9k1Jlkahb9DDPCYLayNCo3L6BdDUz0sxIHyEPMY-SLE35E4Tf5hwJyfOICy35LWn349Rh5X37TAvvRnQj22CPrr44uu8R2sYd6dKin77Gbo7WB7p2Exs9uwj98o3BeX7GMDTe0Z9hbu1cN80XaE50AyPckRsL3YD3_7oih9eXQ_HOtru3j2K9ZaA0Z7XUiFZV1sSVBBVb0EYLzLlOlEoTMFZLpSueI4o6jivBkzTLQHCtBUih5Io8Xm8X2rIPzTeEqZypy4Va_gG_P1LA</recordid><startdate>20230906</startdate><enddate>20230906</enddate><creator>Lim, Hyungseob</creator><creator>Byun, Kyungguen</creator><creator>Moon, Sunkuk</creator><creator>Visser, Erik</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230906</creationdate><title>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</title><author>Lim, Hyungseob ; Byun, Kyungguen ; Moon, Sunkuk ; Visser, Erik</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-d37eef6bfc1b3a61fa7c72e90746654acf7367b09ee2d11b204588a20772a3263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Lim, Hyungseob</creatorcontrib><creatorcontrib>Byun, Kyungguen</creatorcontrib><creatorcontrib>Moon, Sunkuk</creatorcontrib><creatorcontrib>Visser, Erik</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lim, Hyungseob</au><au>Byun, Kyungguen</au><au>Moon, Sunkuk</au><au>Visser, Erik</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data</atitle><date>2023-09-06</date><risdate>2023</risdate><abstract>While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.</abstract><doi>10.48550/arxiv.2309.02730</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2309.02730
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2309_02730
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Sound
title	Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T02%3A09%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stylebook:%20Content-Dependent%20Speaking%20Style%20Modeling%20for%20Any-to-Any%20Voice%20Conversion%20using%20Only%20Speech%20Data&rft.au=Lim,%20Hyungseob&rft.date=2023-09-06&rft_id=info:doi/10.48550/arxiv.2309.02730&rft_dat=%3Carxiv_GOX%3E2309_02730%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true