CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS
A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Patent |
Sprache: | eng ; fre ; ger |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | STANTON, Daisy BATTENBERG, Eric, Dean SHANNON, Sean, Matthew KAO, David, Teh-hwa SKERRY-RYAN, Russell, John Wyatt MARIOORYAD, Soroosh BAGBY, Thomas, Edward |
description | A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding. |
format | Patent |
fullrecord | <record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_EP4007997A1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EP4007997A1</sourcerecordid><originalsourceid>FETCH-epo_espacenet_EP4007997A13</originalsourceid><addsrcrecordid>eNrjZLB39vcLCfL38fH0c1dwjQgIcg0O9gzzDIlU8PRTcPVz0Q3x1wVSCsEBrq7OHgrBkX4hHq7BnsFAVnCIq28wDwNrWmJOcSovlOZmUHBzDXH20E0tyI9PLS5ITE7NSy2Jdw0wMTAwt7Q0dzQ0JkIJAHRSKpw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><source>esp@cenet</source><creator>STANTON, Daisy ; BATTENBERG, Eric, Dean ; SHANNON, Sean, Matthew ; KAO, David, Teh-hwa ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BAGBY, Thomas, Edward</creator><creatorcontrib>STANTON, Daisy ; BATTENBERG, Eric, Dean ; SHANNON, Sean, Matthew ; KAO, David, Teh-hwa ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BAGBY, Thomas, Edward</creatorcontrib><description>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><language>eng ; fre ; ger</language><subject>ACOUSTICS ; MUSICAL INSTRUMENTS ; PHYSICS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2022</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20220608&DB=EPODOC&CC=EP&NR=4007997A1$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,780,885,25563,76318</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20220608&DB=EPODOC&CC=EP&NR=4007997A1$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>STANTON, Daisy</creatorcontrib><creatorcontrib>BATTENBERG, Eric, Dean</creatorcontrib><creatorcontrib>SHANNON, Sean, Matthew</creatorcontrib><creatorcontrib>KAO, David, Teh-hwa</creatorcontrib><creatorcontrib>SKERRY-RYAN, Russell, John Wyatt</creatorcontrib><creatorcontrib>MARIOORYAD, Soroosh</creatorcontrib><creatorcontrib>BAGBY, Thomas, Edward</creatorcontrib><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><description>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><subject>ACOUSTICS</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2022</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZLB39vcLCfL38fH0c1dwjQgIcg0O9gzzDIlU8PRTcPVz0Q3x1wVSCsEBrq7OHgrBkX4hHq7BnsFAVnCIq28wDwNrWmJOcSovlOZmUHBzDXH20E0tyI9PLS5ITE7NSy2Jdw0wMTAwt7Q0dzQ0JkIJAHRSKpw</recordid><startdate>20220608</startdate><enddate>20220608</enddate><creator>STANTON, Daisy</creator><creator>BATTENBERG, Eric, Dean</creator><creator>SHANNON, Sean, Matthew</creator><creator>KAO, David, Teh-hwa</creator><creator>SKERRY-RYAN, Russell, John Wyatt</creator><creator>MARIOORYAD, Soroosh</creator><creator>BAGBY, Thomas, Edward</creator><scope>EVB</scope></search><sort><creationdate>20220608</creationdate><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><author>STANTON, Daisy ; BATTENBERG, Eric, Dean ; SHANNON, Sean, Matthew ; KAO, David, Teh-hwa ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BAGBY, Thomas, Edward</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_EP4007997A13</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng ; fre ; ger</language><creationdate>2022</creationdate><topic>ACOUSTICS</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>STANTON, Daisy</creatorcontrib><creatorcontrib>BATTENBERG, Eric, Dean</creatorcontrib><creatorcontrib>SHANNON, Sean, Matthew</creatorcontrib><creatorcontrib>KAO, David, Teh-hwa</creatorcontrib><creatorcontrib>SKERRY-RYAN, Russell, John Wyatt</creatorcontrib><creatorcontrib>MARIOORYAD, Soroosh</creatorcontrib><creatorcontrib>BAGBY, Thomas, Edward</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>STANTON, Daisy</au><au>BATTENBERG, Eric, Dean</au><au>SHANNON, Sean, Matthew</au><au>KAO, David, Teh-hwa</au><au>SKERRY-RYAN, Russell, John Wyatt</au><au>MARIOORYAD, Soroosh</au><au>BAGBY, Thomas, Edward</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><date>2022-06-08</date><risdate>2022</risdate><abstract>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</abstract><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | |
ispartof | |
issn | |
language | eng ; fre ; ger |
recordid | cdi_epo_espacenet_EP4007997A1 |
source | esp@cenet |
subjects | ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION |
title | CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T01%3A50%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=STANTON,%20Daisy&rft.date=2022-06-08&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EEP4007997A1%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |