SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we expl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-06
Hauptverfasser: Casanova, Edresson, Shulby, Christopher, Gölge, Eren, Müller, Nicolas Michael, Santos de Oliveira, Frederico, Arnaldo Candido Junior, Anderson da Silva Soares, Aluisio, Sandra Maria, Moacir Antonelli Ponti
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Casanova, Edresson
Shulby, Christopher
Gölge, Eren
Müller, Nicolas Michael
Santos de Oliveira, Frederico
Arnaldo Candido Junior
Anderson da Silva Soares
Aluisio, Sandra Maria
Moacir Antonelli Ponti
description In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2512175698</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2512175698</sourcerecordid><originalsourceid>FETCH-proquest_journals_25121756983</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQe6Den1lWsLp62UxcR-0RtONsm9fMr6Ad0ejzeW5EAOI9ZngBsSOjcGEURpBkIwQNSyoKdtXkqJY-0mWjZdUM74OTpFa1hsjeeVov2A5MzNne0VOHLM2W-jm1PK3NDvSPrrtEOwx-3ZH8qVXFhszWPBZ2vR7PY6ZNqEDHEmUgPOf_vegOcpjlM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2512175698</pqid></control><display><type>article</type><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><source>Freely Accessible Journals</source><creator>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</creator><creatorcontrib>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</creatorcontrib><description>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coders ; Similarity ; Spectrograms ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Shulby, Christopher</creatorcontrib><creatorcontrib>Gölge, Eren</creatorcontrib><creatorcontrib>Müller, Nicolas Michael</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Aluisio, Sandra Maria</creatorcontrib><creatorcontrib>Moacir Antonelli Ponti</creatorcontrib><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><title>arXiv.org</title><description>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</description><subject>Coders</subject><subject>Similarity</subject><subject>Spectrograms</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQe6Den1lWsLp62UxcR-0RtONsm9fMr6Ad0ejzeW5EAOI9ZngBsSOjcGEURpBkIwQNSyoKdtXkqJY-0mWjZdUM74OTpFa1hsjeeVov2A5MzNne0VOHLM2W-jm1PK3NDvSPrrtEOwx-3ZH8qVXFhszWPBZ2vR7PY6ZNqEDHEmUgPOf_vegOcpjlM</recordid><startdate>20210615</startdate><enddate>20210615</enddate><creator>Casanova, Edresson</creator><creator>Shulby, Christopher</creator><creator>Gölge, Eren</creator><creator>Müller, Nicolas Michael</creator><creator>Santos de Oliveira, Frederico</creator><creator>Arnaldo Candido Junior</creator><creator>Anderson da Silva Soares</creator><creator>Aluisio, Sandra Maria</creator><creator>Moacir Antonelli Ponti</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210615</creationdate><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><author>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25121756983</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Similarity</topic><topic>Spectrograms</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Shulby, Christopher</creatorcontrib><creatorcontrib>Gölge, Eren</creatorcontrib><creatorcontrib>Müller, Nicolas Michael</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Aluisio, Sandra Maria</creatorcontrib><creatorcontrib>Moacir Antonelli Ponti</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Casanova, Edresson</au><au>Shulby, Christopher</au><au>Gölge, Eren</au><au>Müller, Nicolas Michael</au><au>Santos de Oliveira, Frederico</au><au>Arnaldo Candido Junior</au><au>Anderson da Silva Soares</au><au>Aluisio, Sandra Maria</au><au>Moacir Antonelli Ponti</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</atitle><jtitle>arXiv.org</jtitle><date>2021-06-15</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_2512175698
source Freely Accessible Journals
subjects Coders
Similarity
Spectrograms
Speech recognition
Training
title SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T03%3A06%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SC-GlowTTS:%20an%20Efficient%20Zero-Shot%20Multi-Speaker%20Text-To-Speech%20Model&rft.jtitle=arXiv.org&rft.au=Casanova,%20Edresson&rft.date=2021-06-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2512175698%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2512175698&rft_id=info:pmid/&rfr_iscdi=true