Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature

Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propos...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12
Hauptverfasser: Du, Chenpeng, Guo, Yiwei, Chen, Xie, Yu, Kai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 12
container_issue
container_start_page 1
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 31
creator Du, Chenpeng
Guo, Yiwei
Chen, Xie
Yu, Kai
description Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.
doi_str_mv 10.1109/TASLP.2023.3308374
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10229489</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10229489</ieee_id><sourcerecordid>2881501344</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</originalsourceid><addsrcrecordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2881501344</pqid></control><display><type>article</type><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creator><creatorcontrib>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creatorcontrib><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3308374</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Adaptation ; Adaptation models ; Decomposition ; Embedding ; Feature extraction ; Lookup tables ; Similarity ; speaker adaptation ; Speaking ; Speech processing ; Speech recognition ; speech synthesis ; Timbre ; timbre normalization ; Training ; vector quantization ; Vocoders</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</cites><orcidid>0000-0001-7423-617X ; 0000-0002-7102-9826 ; 0000-0001-5329-0847 ; 0009-0003-8114-2085</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><subject>Acoustics</subject><subject>Adaptation</subject><subject>Adaptation models</subject><subject>Decomposition</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>Lookup tables</subject><subject>Similarity</subject><subject>speaker adaptation</subject><subject>Speaking</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>speech synthesis</subject><subject>Timbre</subject><subject>timbre normalization</subject><subject>Training</subject><subject>vector quantization</subject><subject>Vocoders</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Du, Chenpeng</creator><creator>Guo, Yiwei</creator><creator>Chen, Xie</creator><creator>Yu, Kai</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></search><sort><creationdate>20230101</creationdate><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><author>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustics</topic><topic>Adaptation</topic><topic>Adaptation models</topic><topic>Decomposition</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>Lookup tables</topic><topic>Similarity</topic><topic>speaker adaptation</topic><topic>Speaking</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>speech synthesis</topic><topic>Timbre</topic><topic>timbre normalization</topic><topic>Training</topic><topic>vector quantization</topic><topic>Vocoders</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Chenpeng</au><au>Guo, Yiwei</au><au>Chen, Xie</au><au>Yu, Kai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3308374</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12
issn 2329-9290
2329-9304
language eng
recordid cdi_ieee_primary_10229489
source IEEE/IET Electronic Library (IEL)
subjects Acoustics
Adaptation
Adaptation models
Decomposition
Embedding
Feature extraction
Lookup tables
Similarity
speaker adaptation
Speaking
Speech processing
Speech recognition
speech synthesis
Timbre
timbre normalization
Training
vector quantization
Vocoders
title Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A10%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Speaker%20Adaptive%20Text-to-Speech%20with%20Timbre-Normalized%20Vector-Quantized%20Feature&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Du,%20Chenpeng&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3308374&rft_dat=%3Cproquest_RIE%3E2881501344%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2881501344&rft_id=info:pmid/&rft_ieee_id=10229489&rfr_iscdi=true