Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature

Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propos...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12
Hauptverfasser:	Du, Chenpeng, Guo, Yiwei, Chen, Xie, Yu, Kai
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Adaptation Adaptation models Decomposition Embedding Feature extraction Lookup tables Similarity speaker adaptation Speaking Speech processing Speech recognition speech synthesis Timbre timbre normalization Training vector quantization Vocoders
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12
container_issue
container_start_page	1
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	31
creator	Du, Chenpeng Guo, Yiwei Chen, Xie Yu, Kai
description	Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.
doi_str_mv	10.1109/TASLP.2023.3308374
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10229489</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10229489</ieee_id><sourcerecordid>2881501344</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</originalsourceid><addsrcrecordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2881501344</pqid></control><display><type>article</type><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creator><creatorcontrib>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creatorcontrib><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3308374</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Adaptation ; Adaptation models ; Decomposition ; Embedding ; Feature extraction ; Lookup tables ; Similarity ; speaker adaptation ; Speaking ; Speech processing ; Speech recognition ; speech synthesis ; Timbre ; timbre normalization ; Training ; vector quantization ; Vocoders</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</cites><orcidid>0000-0001-7423-617X ; 0000-0002-7102-9826 ; 0000-0001-5329-0847 ; 0009-0003-8114-2085</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><subject>Acoustics</subject><subject>Adaptation</subject><subject>Adaptation models</subject><subject>Decomposition</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>Lookup tables</subject><subject>Similarity</subject><subject>speaker adaptation</subject><subject>Speaking</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>speech synthesis</subject><subject>Timbre</subject><subject>timbre normalization</subject><subject>Training</subject><subject>vector quantization</subject><subject>Vocoders</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Du, Chenpeng</creator><creator>Guo, Yiwei</creator><creator>Chen, Xie</creator><creator>Yu, Kai</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></search><sort><creationdate>20230101</creationdate><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><author>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustics</topic><topic>Adaptation</topic><topic>Adaptation models</topic><topic>Decomposition</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>Lookup tables</topic><topic>Similarity</topic><topic>speaker adaptation</topic><topic>Speaking</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>speech synthesis</topic><topic>Timbre</topic><topic>timbre normalization</topic><topic>Training</topic><topic>vector quantization</topic><topic>Vocoders</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Chenpeng</au><au>Guo, Yiwei</au><au>Chen, Xie</au><au>Yu, Kai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3308374</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_10229489
source	IEEE/IET Electronic Library (IEL)
subjects	Acoustics Adaptation Adaptation models Decomposition Embedding Feature extraction Lookup tables Similarity speaker adaptation Speaking Speech processing Speech recognition speech synthesis Timbre timbre normalization Training vector quantization Vocoders
title	Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A10%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Speaker%20Adaptive%20Text-to-Speech%20with%20Timbre-Normalized%20Vector-Quantized%20Feature&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Du,%20Chenpeng&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3308374&rft_dat=%3Cproquest_RIE%3E2881501344%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2881501344&rft_id=info:pmid/&rft_ieee_id=10229489&rfr_iscdi=true