Unsupervised clustering of emotion and voice styles for expressive TTS

Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Eyben, F., Buchholz, S., Braunschweiler, N., Latorre, Javier, Wan, Vincent, Gales, Mark J. F., Knill, Kate
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4012
container_issue
container_start_page 4009
container_title
container_volume
creator Eyben, F.
Buchholz, S.
Braunschweiler, N.
Latorre, Javier
Wan, Vincent
Gales, Mark J. F.
Knill, Kate
description Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.
doi_str_mv 10.1109/ICASSP.2012.6288797
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6288797</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6288797</ieee_id><sourcerecordid>6288797</sourcerecordid><originalsourceid>FETCH-LOGICAL-i241t-c1c6f6b1b3f1addbd6dcf3890fa7780722a869c7c9da5f0bcde7e5b64fd07eee3</originalsourceid><addsrcrecordid>eNo1UM1qwzAY8_5gXZcn6MUvkOyzndjxcZS1GxQ2SAq7Fcf-PDzapMRpWN9-gXW6CCEkhAhZMMgYA_30tnyuqo-MA-OZ5GWptLoiiVYly6USALnU12TGhdIp0_B5Qx7-jSK_JTNWcEgly_U9SWL8hglTFISckdW2jacj9mOI6Kjdn-KAfWi_aOcpHrohdC01raNjFyzSOJz3GKnveoo_xx5jDCPSuq4eyZ03-4jJhedku3qpl6_p5n09jd-kgedsSC2z0suGNcIz41zjpLNelBq8UaoExbkppbbKamcKD411qLBoZO4dKEQUc7L46w2T2h37cDD9eXe5RPwC14ZTRQ</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</creator><creatorcontrib>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</creatorcontrib><description>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 1467300454</identifier><identifier>ISBN: 9781467300452</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781467300469</identifier><identifier>EISBN: 1467300446</identifier><identifier>EISBN: 9781467300445</identifier><identifier>EISBN: 1467300462</identifier><identifier>DOI: 10.1109/ICASSP.2012.6288797</identifier><language>eng</language><publisher>IEEE</publisher><subject>Average Voice Model ; Context ; Decision trees ; Expressive synthesis ; Hidden Markov models ; HMM-TTS ; IEEE Aerospace and Electronic Systems Society ; Speech ; Speech synthesis ; text-to-speech ; Training ; unsupervised clustering</subject><ispartof>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, p.4009-4012</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6288797$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6288797$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Eyben, F.</creatorcontrib><creatorcontrib>Buchholz, S.</creatorcontrib><creatorcontrib>Braunschweiler, N.</creatorcontrib><creatorcontrib>Latorre, Javier</creatorcontrib><creatorcontrib>Wan, Vincent</creatorcontrib><creatorcontrib>Gales, Mark J. F.</creatorcontrib><creatorcontrib>Knill, Kate</creatorcontrib><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><title>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</description><subject>Average Voice Model</subject><subject>Context</subject><subject>Decision trees</subject><subject>Expressive synthesis</subject><subject>Hidden Markov models</subject><subject>HMM-TTS</subject><subject>IEEE Aerospace and Electronic Systems Society</subject><subject>Speech</subject><subject>Speech synthesis</subject><subject>text-to-speech</subject><subject>Training</subject><subject>unsupervised clustering</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>1467300454</isbn><isbn>9781467300452</isbn><isbn>9781467300469</isbn><isbn>1467300446</isbn><isbn>9781467300445</isbn><isbn>1467300462</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1UM1qwzAY8_5gXZcn6MUvkOyzndjxcZS1GxQ2SAq7Fcf-PDzapMRpWN9-gXW6CCEkhAhZMMgYA_30tnyuqo-MA-OZ5GWptLoiiVYly6USALnU12TGhdIp0_B5Qx7-jSK_JTNWcEgly_U9SWL8hglTFISckdW2jacj9mOI6Kjdn-KAfWi_aOcpHrohdC01raNjFyzSOJz3GKnveoo_xx5jDCPSuq4eyZ03-4jJhedku3qpl6_p5n09jd-kgedsSC2z0suGNcIz41zjpLNelBq8UaoExbkppbbKamcKD411qLBoZO4dKEQUc7L46w2T2h37cDD9eXe5RPwC14ZTRQ</recordid><startdate>201203</startdate><enddate>201203</enddate><creator>Eyben, F.</creator><creator>Buchholz, S.</creator><creator>Braunschweiler, N.</creator><creator>Latorre, Javier</creator><creator>Wan, Vincent</creator><creator>Gales, Mark J. F.</creator><creator>Knill, Kate</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201203</creationdate><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><author>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i241t-c1c6f6b1b3f1addbd6dcf3890fa7780722a869c7c9da5f0bcde7e5b64fd07eee3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Average Voice Model</topic><topic>Context</topic><topic>Decision trees</topic><topic>Expressive synthesis</topic><topic>Hidden Markov models</topic><topic>HMM-TTS</topic><topic>IEEE Aerospace and Electronic Systems Society</topic><topic>Speech</topic><topic>Speech synthesis</topic><topic>text-to-speech</topic><topic>Training</topic><topic>unsupervised clustering</topic><toplevel>online_resources</toplevel><creatorcontrib>Eyben, F.</creatorcontrib><creatorcontrib>Buchholz, S.</creatorcontrib><creatorcontrib>Braunschweiler, N.</creatorcontrib><creatorcontrib>Latorre, Javier</creatorcontrib><creatorcontrib>Wan, Vincent</creatorcontrib><creatorcontrib>Gales, Mark J. F.</creatorcontrib><creatorcontrib>Knill, Kate</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Eyben, F.</au><au>Buchholz, S.</au><au>Braunschweiler, N.</au><au>Latorre, Javier</au><au>Wan, Vincent</au><au>Gales, Mark J. F.</au><au>Knill, Kate</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Unsupervised clustering of emotion and voice styles for expressive TTS</atitle><btitle>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2012-03</date><risdate>2012</risdate><spage>4009</spage><epage>4012</epage><pages>4009-4012</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>1467300454</isbn><isbn>9781467300452</isbn><eisbn>9781467300469</eisbn><eisbn>1467300446</eisbn><eisbn>9781467300445</eisbn><eisbn>1467300462</eisbn><abstract>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2012.6288797</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-6149
ispartof 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, p.4009-4012
issn 1520-6149
2379-190X
language eng
recordid cdi_ieee_primary_6288797
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Average Voice Model
Context
Decision trees
Expressive synthesis
Hidden Markov models
HMM-TTS
IEEE Aerospace and Electronic Systems Society
Speech
Speech synthesis
text-to-speech
Training
unsupervised clustering
title Unsupervised clustering of emotion and voice styles for expressive TTS
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T22%3A51%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Unsupervised%20clustering%20of%20emotion%20and%20voice%20styles%20for%20expressive%20TTS&rft.btitle=2012%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Eyben,%20F.&rft.date=2012-03&rft.spage=4009&rft.epage=4012&rft.pages=4009-4012&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=1467300454&rft.isbn_list=9781467300452&rft_id=info:doi/10.1109/ICASSP.2012.6288797&rft_dat=%3Cieee_6IE%3E6288797%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781467300469&rft.eisbn_list=1467300446&rft.eisbn_list=9781467300445&rft.eisbn_list=1467300462&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6288797&rfr_iscdi=true