SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we expl...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2021-06 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Casanova, Edresson Shulby, Christopher Gölge, Eren Müller, Nicolas Michael Santos de Oliveira, Frederico Arnaldo Candido Junior Anderson da Silva Soares Aluisio, Sandra Maria Moacir Antonelli Ponti |
description | In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2512175698</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2512175698</sourcerecordid><originalsourceid>FETCH-proquest_journals_25121756983</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQe6Den1lWsLp62UxcR-0RtONsm9fMr6Ad0ejzeW5EAOI9ZngBsSOjcGEURpBkIwQNSyoKdtXkqJY-0mWjZdUM74OTpFa1hsjeeVov2A5MzNne0VOHLM2W-jm1PK3NDvSPrrtEOwx-3ZH8qVXFhszWPBZ2vR7PY6ZNqEDHEmUgPOf_vegOcpjlM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2512175698</pqid></control><display><type>article</type><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><source>Freely Accessible Journals</source><creator>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</creator><creatorcontrib>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</creatorcontrib><description>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coders ; Similarity ; Spectrograms ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Shulby, Christopher</creatorcontrib><creatorcontrib>Gölge, Eren</creatorcontrib><creatorcontrib>Müller, Nicolas Michael</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Aluisio, Sandra Maria</creatorcontrib><creatorcontrib>Moacir Antonelli Ponti</creatorcontrib><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><title>arXiv.org</title><description>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</description><subject>Coders</subject><subject>Similarity</subject><subject>Spectrograms</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQe6Den1lWsLp62UxcR-0RtONsm9fMr6Ad0ejzeW5EAOI9ZngBsSOjcGEURpBkIwQNSyoKdtXkqJY-0mWjZdUM74OTpFa1hsjeeVov2A5MzNne0VOHLM2W-jm1PK3NDvSPrrtEOwx-3ZH8qVXFhszWPBZ2vR7PY6ZNqEDHEmUgPOf_vegOcpjlM</recordid><startdate>20210615</startdate><enddate>20210615</enddate><creator>Casanova, Edresson</creator><creator>Shulby, Christopher</creator><creator>Gölge, Eren</creator><creator>Müller, Nicolas Michael</creator><creator>Santos de Oliveira, Frederico</creator><creator>Arnaldo Candido Junior</creator><creator>Anderson da Silva Soares</creator><creator>Aluisio, Sandra Maria</creator><creator>Moacir Antonelli Ponti</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210615</creationdate><title>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</title><author>Casanova, Edresson ; Shulby, Christopher ; Gölge, Eren ; Müller, Nicolas Michael ; Santos de Oliveira, Frederico ; Arnaldo Candido Junior ; Anderson da Silva Soares ; Aluisio, Sandra Maria ; Moacir Antonelli Ponti</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25121756983</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Similarity</topic><topic>Spectrograms</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Shulby, Christopher</creatorcontrib><creatorcontrib>Gölge, Eren</creatorcontrib><creatorcontrib>Müller, Nicolas Michael</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Aluisio, Sandra Maria</creatorcontrib><creatorcontrib>Moacir Antonelli Ponti</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Casanova, Edresson</au><au>Shulby, Christopher</au><au>Gölge, Eren</au><au>Müller, Nicolas Michael</au><au>Santos de Oliveira, Frederico</au><au>Arnaldo Candido Junior</au><au>Anderson da Silva Soares</au><au>Aluisio, Sandra Maria</au><au>Moacir Antonelli Ponti</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model</atitle><jtitle>arXiv.org</jtitle><date>2021-06-15</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2021-06 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2512175698 |
source | Freely Accessible Journals |
subjects | Coders Similarity Spectrograms Speech recognition Training |
title | SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T03%3A06%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SC-GlowTTS:%20an%20Efficient%20Zero-Shot%20Multi-Speaker%20Text-To-Speech%20Model&rft.jtitle=arXiv.org&rft.au=Casanova,%20Edresson&rft.date=2021-06-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2512175698%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2512175698&rft_id=info:pmid/&rfr_iscdi=true |