Incremental FastPitch: Chunk-based High Quality Text to Speech

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Du, Muyang, Liu, Chuan, Lai, Junjie
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Du, Muyang Liu, Chuan Lai, Junjie
description	Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.
doi_str_mv	10.48550/arxiv.2401.01755
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_01755</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_01755</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-48f2858dfb18ff51b8b19abb9a45dd159604dc47c86c3d250faf48c47948e0323</originalsourceid><addsrcrecordid>eNotz81Kw0AUhuHZdCGtF-DKuYHEmWROcuJCkNDaQkHF7MOZPzOYpiWZSnv3auvqg3fxwcPYnRSpQgDxQOMpfKeZEjIVsgS4YU-bwYxu54ZIPV_RFN9CNN0jr7vj8JVompzl6_DZ8fcj9SGeeeNOkcc9_zg4Z7oFm3nqJ3f7v3PWrJZNvU62ry-b-nmbUFFCotBnCGi9lug9SI1aVqR1RQqslVAVQlmjSoOFyW0GwpNX-BsqhU7kWT5n99fbC6A9jGFH47n9g7QXSP4DCuJCTw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><source>arXiv.org</source><creator>Du, Muyang ; Liu, Chuan ; Lai, Junjie</creator><creatorcontrib>Du, Muyang ; Liu, Chuan ; Lai, Junjie</creatorcontrib><description>Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.</description><identifier>DOI: 10.48550/arxiv.2401.01755</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Sound</subject><creationdate>2024-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.01755$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.01755$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Du, Muyang</creatorcontrib><creatorcontrib>Liu, Chuan</creatorcontrib><creatorcontrib>Lai, Junjie</creatorcontrib><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><description>Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81Kw0AUhuHZdCGtF-DKuYHEmWROcuJCkNDaQkHF7MOZPzOYpiWZSnv3auvqg3fxwcPYnRSpQgDxQOMpfKeZEjIVsgS4YU-bwYxu54ZIPV_RFN9CNN0jr7vj8JVompzl6_DZ8fcj9SGeeeNOkcc9_zg4Z7oFm3nqJ3f7v3PWrJZNvU62ry-b-nmbUFFCotBnCGi9lug9SI1aVqR1RQqslVAVQlmjSoOFyW0GwpNX-BsqhU7kWT5n99fbC6A9jGFH47n9g7QXSP4DCuJCTw</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Du, Muyang</creator><creator>Liu, Chuan</creator><creator>Lai, Junjie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240103</creationdate><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><author>Du, Muyang ; Liu, Chuan ; Lai, Junjie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-48f2858dfb18ff51b8b19abb9a45dd159604dc47c86c3d250faf48c47948e0323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Du, Muyang</creatorcontrib><creatorcontrib>Liu, Chuan</creatorcontrib><creatorcontrib>Lai, Junjie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Muyang</au><au>Liu, Chuan</au><au>Lai, Junjie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Incremental FastPitch: Chunk-based High Quality Text to Speech</atitle><date>2024-01-03</date><risdate>2024</risdate><abstract>Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.</abstract><doi>10.48550/arxiv.2401.01755</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2401.01755
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2401_01755
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Sound
title	Incremental FastPitch: Chunk-based High Quality Text to Speech
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T04%3A01%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Incremental%20FastPitch:%20Chunk-based%20High%20Quality%20Text%20to%20Speech&rft.au=Du,%20Muyang&rft.date=2024-01-03&rft_id=info:doi/10.48550/arxiv.2401.01755&rft_dat=%3Carxiv_GOX%3E2401_01755%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true