CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS

A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	STANTON, Daisy, BAGBY, Thomas Edward, SHANNON, Sean Matthew, SKERRY-RYAN, Russell, John Wyatt, MARIOORYAD, Soroosh, BATTENBERG, Eric Dean, KAO, David Teh-hwa
Format:	Patent
Sprache:	eng ; fre ; ger
Schlagworte:	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	STANTON, Daisy BAGBY, Thomas Edward SHANNON, Sean Matthew SKERRY-RYAN, Russell, John Wyatt MARIOORYAD, Soroosh BATTENBERG, Eric Dean KAO, David Teh-hwa
description	A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.
format	Patent
fullrecord	<record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_EP4345815A3</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EP4345815A3</sourcerecordid><originalsourceid>FETCH-epo_espacenet_EP4345815A33</originalsourceid><addsrcrecordid>eNrjZLB39vcLCfL38fH0c1dwjQgIcg0O9gzzDIlU8PRTcPVz0Q3x1wVSCsEBrq7OHgrBkX4hHq7BnsFAVnCIq28wDwNrWmJOcSovlOZmUHBzDXH20E0tyI9PLS5ITE7NSy2Jdw0wMTYxtTA0dTQ2JkIJAHO8KpQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><source>esp@cenet</source><creator>STANTON, Daisy ; BAGBY, Thomas Edward ; SHANNON, Sean Matthew ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BATTENBERG, Eric Dean ; KAO, David Teh-hwa</creator><creatorcontrib>STANTON, Daisy ; BAGBY, Thomas Edward ; SHANNON, Sean Matthew ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BATTENBERG, Eric Dean ; KAO, David Teh-hwa</creatorcontrib><description>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><language>eng ; fre ; ger</language><subject>ACOUSTICS ; MUSICAL INSTRUMENTS ; PHYSICS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2024</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20240612&DB=EPODOC&CC=EP&NR=4345815A3$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,780,885,25563,76318</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20240612&DB=EPODOC&CC=EP&NR=4345815A3$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>STANTON, Daisy</creatorcontrib><creatorcontrib>BAGBY, Thomas Edward</creatorcontrib><creatorcontrib>SHANNON, Sean Matthew</creatorcontrib><creatorcontrib>SKERRY-RYAN, Russell, John Wyatt</creatorcontrib><creatorcontrib>MARIOORYAD, Soroosh</creatorcontrib><creatorcontrib>BATTENBERG, Eric Dean</creatorcontrib><creatorcontrib>KAO, David Teh-hwa</creatorcontrib><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><description>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</description><subject>ACOUSTICS</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2024</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZLB39vcLCfL38fH0c1dwjQgIcg0O9gzzDIlU8PRTcPVz0Q3x1wVSCsEBrq7OHgrBkX4hHq7BnsFAVnCIq28wDwNrWmJOcSovlOZmUHBzDXH20E0tyI9PLS5ITE7NSy2Jdw0wMTYxtTA0dTQ2JkIJAHO8KpQ</recordid><startdate>20240612</startdate><enddate>20240612</enddate><creator>STANTON, Daisy</creator><creator>BAGBY, Thomas Edward</creator><creator>SHANNON, Sean Matthew</creator><creator>SKERRY-RYAN, Russell, John Wyatt</creator><creator>MARIOORYAD, Soroosh</creator><creator>BATTENBERG, Eric Dean</creator><creator>KAO, David Teh-hwa</creator><scope>EVB</scope></search><sort><creationdate>20240612</creationdate><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><author>STANTON, Daisy ; BAGBY, Thomas Edward ; SHANNON, Sean Matthew ; SKERRY-RYAN, Russell, John Wyatt ; MARIOORYAD, Soroosh ; BATTENBERG, Eric Dean ; KAO, David Teh-hwa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_EP4345815A33</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng ; fre ; ger</language><creationdate>2024</creationdate><topic>ACOUSTICS</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>STANTON, Daisy</creatorcontrib><creatorcontrib>BAGBY, Thomas Edward</creatorcontrib><creatorcontrib>SHANNON, Sean Matthew</creatorcontrib><creatorcontrib>SKERRY-RYAN, Russell, John Wyatt</creatorcontrib><creatorcontrib>MARIOORYAD, Soroosh</creatorcontrib><creatorcontrib>BATTENBERG, Eric Dean</creatorcontrib><creatorcontrib>KAO, David Teh-hwa</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>STANTON, Daisy</au><au>BAGBY, Thomas Edward</au><au>SHANNON, Sean Matthew</au><au>SKERRY-RYAN, Russell, John Wyatt</au><au>MARIOORYAD, Soroosh</au><au>BATTENBERG, Eric Dean</au><au>KAO, David Teh-hwa</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS</title><date>2024-06-12</date><risdate>2024</risdate><abstract>A system (900) includes a context encoder (610), a text-prediction network (520), and a text-to-speech (TTS) model (650). The context encoder is configured to receive one or more context features (602) associated with current input text (502) and process the one or more context features to generate a context embedding (612) associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding (650) for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech (680). The TTS model is configured to process the current input text and the style embedding to generate an output audio signal (670) of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.</abstract><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier
ispartof
issn
language	eng ; fre ; ger
recordid	cdi_epo_espacenet_EP4345815A3
source	esp@cenet
subjects	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
title	CONTROLLING EXPRESSIVITY IN END-TO-END SPEECH SYNTHESIS SYSTEMS
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T02%3A05%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=STANTON,%20Daisy&rft.date=2024-06-12&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EEP4345815A3%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true