Controlling Expressivity In End-to-End Speech Synthesis Systems

A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Skerry-Ryan, Russell John Wyatt, Kao, David Teh-Hwa, Bagby, Thomas Edward, Shannon, Sean Matthew, Battenberg, Eric Dean, Mariooryad, Soroosh, Stanton, Daisy
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Skerry-Ryan, Russell John Wyatt
Kao, David Teh-Hwa
Bagby, Thomas Edward
Shannon, Sean Matthew
Battenberg, Eric Dean
Mariooryad, Soroosh
Stanton, Daisy
description A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.
format Patent
fullrecord <record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_US2021035551A1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>US2021035551A1</sourcerecordid><originalsourceid>FETCH-epo_espacenet_US2021035551A13</originalsourceid><addsrcrecordid>eNrjZLB3zs8rKcrPycnMS1dwrSgoSi0uzizLLKlU8MxTcM1L0S3J1wVSCsEFqanJGQrBlXklGanFmcVAVnFJam4xDwNrWmJOcSovlOZmUHZzDXH20E0tyI9PLS5ITE7NSy2JDw02MjAyNDA2NTU1dDQ0Jk4VAEn6MbA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Controlling Expressivity In End-to-End Speech Synthesis Systems</title><source>esp@cenet</source><creator>Skerry-Ryan, Russell John Wyatt ; Kao, David Teh-Hwa ; Bagby, Thomas Edward ; Shannon, Sean Matthew ; Battenberg, Eric Dean ; Mariooryad, Soroosh ; Stanton, Daisy</creator><creatorcontrib>Skerry-Ryan, Russell John Wyatt ; Kao, David Teh-Hwa ; Bagby, Thomas Edward ; Shannon, Sean Matthew ; Battenberg, Eric Dean ; Mariooryad, Soroosh ; Stanton, Daisy</creatorcontrib><description>A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><language>eng</language><subject>ACOUSTICS ; MUSICAL INSTRUMENTS ; PHYSICS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2021</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20210204&amp;DB=EPODOC&amp;CC=US&amp;NR=2021035551A1$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,780,885,25563,76318</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&amp;date=20210204&amp;DB=EPODOC&amp;CC=US&amp;NR=2021035551A1$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Skerry-Ryan, Russell John Wyatt</creatorcontrib><creatorcontrib>Kao, David Teh-Hwa</creatorcontrib><creatorcontrib>Bagby, Thomas Edward</creatorcontrib><creatorcontrib>Shannon, Sean Matthew</creatorcontrib><creatorcontrib>Battenberg, Eric Dean</creatorcontrib><creatorcontrib>Mariooryad, Soroosh</creatorcontrib><creatorcontrib>Stanton, Daisy</creatorcontrib><title>Controlling Expressivity In End-to-End Speech Synthesis Systems</title><description>A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><subject>ACOUSTICS</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2021</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZLB3zs8rKcrPycnMS1dwrSgoSi0uzizLLKlU8MxTcM1L0S3J1wVSCsEFqanJGQrBlXklGanFmcVAVnFJam4xDwNrWmJOcSovlOZmUHZzDXH20E0tyI9PLS5ITE7NSy2JDw02MjAyNDA2NTU1dDQ0Jk4VAEn6MbA</recordid><startdate>20210204</startdate><enddate>20210204</enddate><creator>Skerry-Ryan, Russell John Wyatt</creator><creator>Kao, David Teh-Hwa</creator><creator>Bagby, Thomas Edward</creator><creator>Shannon, Sean Matthew</creator><creator>Battenberg, Eric Dean</creator><creator>Mariooryad, Soroosh</creator><creator>Stanton, Daisy</creator><scope>EVB</scope></search><sort><creationdate>20210204</creationdate><title>Controlling Expressivity In End-to-End Speech Synthesis Systems</title><author>Skerry-Ryan, Russell John Wyatt ; Kao, David Teh-Hwa ; Bagby, Thomas Edward ; Shannon, Sean Matthew ; Battenberg, Eric Dean ; Mariooryad, Soroosh ; Stanton, Daisy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_US2021035551A13</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng</language><creationdate>2021</creationdate><topic>ACOUSTICS</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>Skerry-Ryan, Russell John Wyatt</creatorcontrib><creatorcontrib>Kao, David Teh-Hwa</creatorcontrib><creatorcontrib>Bagby, Thomas Edward</creatorcontrib><creatorcontrib>Shannon, Sean Matthew</creatorcontrib><creatorcontrib>Battenberg, Eric Dean</creatorcontrib><creatorcontrib>Mariooryad, Soroosh</creatorcontrib><creatorcontrib>Stanton, Daisy</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Skerry-Ryan, Russell John Wyatt</au><au>Kao, David Teh-Hwa</au><au>Bagby, Thomas Edward</au><au>Shannon, Sean Matthew</au><au>Battenberg, Eric Dean</au><au>Mariooryad, Soroosh</au><au>Stanton, Daisy</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Controlling Expressivity In End-to-End Speech Synthesis Systems</title><date>2021-02-04</date><risdate>2021</risdate><abstract>A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</abstract><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier
ispartof
issn
language eng
recordid cdi_epo_espacenet_US2021035551A1
source esp@cenet
subjects ACOUSTICS
MUSICAL INSTRUMENTS
PHYSICS
SPEECH ANALYSIS OR SYNTHESIS
SPEECH OR AUDIO CODING OR DECODING
SPEECH OR VOICE PROCESSING
SPEECH RECOGNITION
title Controlling Expressivity In End-to-End Speech Synthesis Systems
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T01%3A48%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=Skerry-Ryan,%20Russell%20John%20Wyatt&rft.date=2021-02-04&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EUS2021035551A1%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true