Incremental FastPitch: Chunk-based High Quality Text to Speech
Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for i...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Du, Muyang Liu, Chuan Lai, Junjie |
description | Parallel text-to-speech models have been widely applied for real-time speech
synthesis, and they offer more controllability and a much faster synthesis
process compared with conventional auto-regressive models. Although parallel
models have benefits in many aspects, they become naturally unfit for
incremental synthesis due to their fully parallel architecture such as
transformer. In this work, we propose Incremental FastPitch, a novel FastPitch
variant capable of incrementally producing high-quality Mel chunks by improving
the architecture with chunk-based FFT blocks, training with receptive-field
constrained chunk attention masks, and inference with fixed size past model
states. Experimental results show that our proposal can produce speech quality
comparable to the parallel FastPitch, with a significant lower latency that
allows even lower response time for real-time speech applications. |
doi_str_mv | 10.48550/arxiv.2401.01755 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_01755</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_01755</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-48f2858dfb18ff51b8b19abb9a45dd159604dc47c86c3d250faf48c47948e0323</originalsourceid><addsrcrecordid>eNotz81Kw0AUhuHZdCGtF-DKuYHEmWROcuJCkNDaQkHF7MOZPzOYpiWZSnv3auvqg3fxwcPYnRSpQgDxQOMpfKeZEjIVsgS4YU-bwYxu54ZIPV_RFN9CNN0jr7vj8JVompzl6_DZ8fcj9SGeeeNOkcc9_zg4Z7oFm3nqJ3f7v3PWrJZNvU62ry-b-nmbUFFCotBnCGi9lug9SI1aVqR1RQqslVAVQlmjSoOFyW0GwpNX-BsqhU7kWT5n99fbC6A9jGFH47n9g7QXSP4DCuJCTw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><source>arXiv.org</source><creator>Du, Muyang ; Liu, Chuan ; Lai, Junjie</creator><creatorcontrib>Du, Muyang ; Liu, Chuan ; Lai, Junjie</creatorcontrib><description>Parallel text-to-speech models have been widely applied for real-time speech
synthesis, and they offer more controllability and a much faster synthesis
process compared with conventional auto-regressive models. Although parallel
models have benefits in many aspects, they become naturally unfit for
incremental synthesis due to their fully parallel architecture such as
transformer. In this work, we propose Incremental FastPitch, a novel FastPitch
variant capable of incrementally producing high-quality Mel chunks by improving
the architecture with chunk-based FFT blocks, training with receptive-field
constrained chunk attention masks, and inference with fixed size past model
states. Experimental results show that our proposal can produce speech quality
comparable to the parallel FastPitch, with a significant lower latency that
allows even lower response time for real-time speech applications.</description><identifier>DOI: 10.48550/arxiv.2401.01755</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Sound</subject><creationdate>2024-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.01755$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.01755$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Du, Muyang</creatorcontrib><creatorcontrib>Liu, Chuan</creatorcontrib><creatorcontrib>Lai, Junjie</creatorcontrib><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><description>Parallel text-to-speech models have been widely applied for real-time speech
synthesis, and they offer more controllability and a much faster synthesis
process compared with conventional auto-regressive models. Although parallel
models have benefits in many aspects, they become naturally unfit for
incremental synthesis due to their fully parallel architecture such as
transformer. In this work, we propose Incremental FastPitch, a novel FastPitch
variant capable of incrementally producing high-quality Mel chunks by improving
the architecture with chunk-based FFT blocks, training with receptive-field
constrained chunk attention masks, and inference with fixed size past model
states. Experimental results show that our proposal can produce speech quality
comparable to the parallel FastPitch, with a significant lower latency that
allows even lower response time for real-time speech applications.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81Kw0AUhuHZdCGtF-DKuYHEmWROcuJCkNDaQkHF7MOZPzOYpiWZSnv3auvqg3fxwcPYnRSpQgDxQOMpfKeZEjIVsgS4YU-bwYxu54ZIPV_RFN9CNN0jr7vj8JVompzl6_DZ8fcj9SGeeeNOkcc9_zg4Z7oFm3nqJ3f7v3PWrJZNvU62ry-b-nmbUFFCotBnCGi9lug9SI1aVqR1RQqslVAVQlmjSoOFyW0GwpNX-BsqhU7kWT5n99fbC6A9jGFH47n9g7QXSP4DCuJCTw</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Du, Muyang</creator><creator>Liu, Chuan</creator><creator>Lai, Junjie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240103</creationdate><title>Incremental FastPitch: Chunk-based High Quality Text to Speech</title><author>Du, Muyang ; Liu, Chuan ; Lai, Junjie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-48f2858dfb18ff51b8b19abb9a45dd159604dc47c86c3d250faf48c47948e0323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Du, Muyang</creatorcontrib><creatorcontrib>Liu, Chuan</creatorcontrib><creatorcontrib>Lai, Junjie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Muyang</au><au>Liu, Chuan</au><au>Lai, Junjie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Incremental FastPitch: Chunk-based High Quality Text to Speech</atitle><date>2024-01-03</date><risdate>2024</risdate><abstract>Parallel text-to-speech models have been widely applied for real-time speech
synthesis, and they offer more controllability and a much faster synthesis
process compared with conventional auto-regressive models. Although parallel
models have benefits in many aspects, they become naturally unfit for
incremental synthesis due to their fully parallel architecture such as
transformer. In this work, we propose Incremental FastPitch, a novel FastPitch
variant capable of incrementally producing high-quality Mel chunks by improving
the architecture with chunk-based FFT blocks, training with receptive-field
constrained chunk attention masks, and inference with fixed size past model
states. Experimental results show that our proposal can produce speech quality
comparable to the parallel FastPitch, with a significant lower latency that
allows even lower response time for real-time speech applications.</abstract><doi>10.48550/arxiv.2401.01755</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2401.01755 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2401_01755 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Sound |
title | Incremental FastPitch: Chunk-based High Quality Text to Speech |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T04%3A01%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Incremental%20FastPitch:%20Chunk-based%20High%20Quality%20Text%20to%20Speech&rft.au=Du,%20Muyang&rft.date=2024-01-03&rft_id=info:doi/10.48550/arxiv.2401.01755&rft_dat=%3Carxiv_GOX%3E2401_01755%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |