Unsupervised clustering of emotion and voice styles for expressive TTS
Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitio...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 4012 |
---|---|
container_issue | |
container_start_page | 4009 |
container_title | |
container_volume | |
creator | Eyben, F. Buchholz, S. Braunschweiler, N. Latorre, Javier Wan, Vincent Gales, Mark J. F. Knill, Kate |
description | Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system. |
doi_str_mv | 10.1109/ICASSP.2012.6288797 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6288797</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6288797</ieee_id><sourcerecordid>6288797</sourcerecordid><originalsourceid>FETCH-LOGICAL-i241t-c1c6f6b1b3f1addbd6dcf3890fa7780722a869c7c9da5f0bcde7e5b64fd07eee3</originalsourceid><addsrcrecordid>eNo1UM1qwzAY8_5gXZcn6MUvkOyzndjxcZS1GxQ2SAq7Fcf-PDzapMRpWN9-gXW6CCEkhAhZMMgYA_30tnyuqo-MA-OZ5GWptLoiiVYly6USALnU12TGhdIp0_B5Qx7-jSK_JTNWcEgly_U9SWL8hglTFISckdW2jacj9mOI6Kjdn-KAfWi_aOcpHrohdC01raNjFyzSOJz3GKnveoo_xx5jDCPSuq4eyZ03-4jJhedku3qpl6_p5n09jd-kgedsSC2z0suGNcIz41zjpLNelBq8UaoExbkppbbKamcKD411qLBoZO4dKEQUc7L46w2T2h37cDD9eXe5RPwC14ZTRQ</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</creator><creatorcontrib>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</creatorcontrib><description>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 1467300454</identifier><identifier>ISBN: 9781467300452</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781467300469</identifier><identifier>EISBN: 1467300446</identifier><identifier>EISBN: 9781467300445</identifier><identifier>EISBN: 1467300462</identifier><identifier>DOI: 10.1109/ICASSP.2012.6288797</identifier><language>eng</language><publisher>IEEE</publisher><subject>Average Voice Model ; Context ; Decision trees ; Expressive synthesis ; Hidden Markov models ; HMM-TTS ; IEEE Aerospace and Electronic Systems Society ; Speech ; Speech synthesis ; text-to-speech ; Training ; unsupervised clustering</subject><ispartof>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, p.4009-4012</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6288797$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6288797$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Eyben, F.</creatorcontrib><creatorcontrib>Buchholz, S.</creatorcontrib><creatorcontrib>Braunschweiler, N.</creatorcontrib><creatorcontrib>Latorre, Javier</creatorcontrib><creatorcontrib>Wan, Vincent</creatorcontrib><creatorcontrib>Gales, Mark J. F.</creatorcontrib><creatorcontrib>Knill, Kate</creatorcontrib><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><title>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</description><subject>Average Voice Model</subject><subject>Context</subject><subject>Decision trees</subject><subject>Expressive synthesis</subject><subject>Hidden Markov models</subject><subject>HMM-TTS</subject><subject>IEEE Aerospace and Electronic Systems Society</subject><subject>Speech</subject><subject>Speech synthesis</subject><subject>text-to-speech</subject><subject>Training</subject><subject>unsupervised clustering</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>1467300454</isbn><isbn>9781467300452</isbn><isbn>9781467300469</isbn><isbn>1467300446</isbn><isbn>9781467300445</isbn><isbn>1467300462</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1UM1qwzAY8_5gXZcn6MUvkOyzndjxcZS1GxQ2SAq7Fcf-PDzapMRpWN9-gXW6CCEkhAhZMMgYA_30tnyuqo-MA-OZ5GWptLoiiVYly6USALnU12TGhdIp0_B5Qx7-jSK_JTNWcEgly_U9SWL8hglTFISckdW2jacj9mOI6Kjdn-KAfWi_aOcpHrohdC01raNjFyzSOJz3GKnveoo_xx5jDCPSuq4eyZ03-4jJhedku3qpl6_p5n09jd-kgedsSC2z0suGNcIz41zjpLNelBq8UaoExbkppbbKamcKD411qLBoZO4dKEQUc7L46w2T2h37cDD9eXe5RPwC14ZTRQ</recordid><startdate>201203</startdate><enddate>201203</enddate><creator>Eyben, F.</creator><creator>Buchholz, S.</creator><creator>Braunschweiler, N.</creator><creator>Latorre, Javier</creator><creator>Wan, Vincent</creator><creator>Gales, Mark J. F.</creator><creator>Knill, Kate</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201203</creationdate><title>Unsupervised clustering of emotion and voice styles for expressive TTS</title><author>Eyben, F. ; Buchholz, S. ; Braunschweiler, N. ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J. F. ; Knill, Kate</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i241t-c1c6f6b1b3f1addbd6dcf3890fa7780722a869c7c9da5f0bcde7e5b64fd07eee3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Average Voice Model</topic><topic>Context</topic><topic>Decision trees</topic><topic>Expressive synthesis</topic><topic>Hidden Markov models</topic><topic>HMM-TTS</topic><topic>IEEE Aerospace and Electronic Systems Society</topic><topic>Speech</topic><topic>Speech synthesis</topic><topic>text-to-speech</topic><topic>Training</topic><topic>unsupervised clustering</topic><toplevel>online_resources</toplevel><creatorcontrib>Eyben, F.</creatorcontrib><creatorcontrib>Buchholz, S.</creatorcontrib><creatorcontrib>Braunschweiler, N.</creatorcontrib><creatorcontrib>Latorre, Javier</creatorcontrib><creatorcontrib>Wan, Vincent</creatorcontrib><creatorcontrib>Gales, Mark J. F.</creatorcontrib><creatorcontrib>Knill, Kate</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Eyben, F.</au><au>Buchholz, S.</au><au>Braunschweiler, N.</au><au>Latorre, Javier</au><au>Wan, Vincent</au><au>Gales, Mark J. F.</au><au>Knill, Kate</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Unsupervised clustering of emotion and voice styles for expressive TTS</atitle><btitle>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2012-03</date><risdate>2012</risdate><spage>4009</spage><epage>4012</epage><pages>4009-4012</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>1467300454</isbn><isbn>9781467300452</isbn><eisbn>9781467300469</eisbn><eisbn>1467300446</eisbn><eisbn>9781467300445</eisbn><eisbn>1467300462</eisbn><abstract>Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2012.6288797</doi><tpages>4</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-6149 |
ispartof | 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, p.4009-4012 |
issn | 1520-6149 2379-190X |
language | eng |
recordid | cdi_ieee_primary_6288797 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Average Voice Model Context Decision trees Expressive synthesis Hidden Markov models HMM-TTS IEEE Aerospace and Electronic Systems Society Speech Speech synthesis text-to-speech Training unsupervised clustering |
title | Unsupervised clustering of emotion and voice styles for expressive TTS |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T22%3A51%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Unsupervised%20clustering%20of%20emotion%20and%20voice%20styles%20for%20expressive%20TTS&rft.btitle=2012%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Eyben,%20F.&rft.date=2012-03&rft.spage=4009&rft.epage=4012&rft.pages=4009-4012&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=1467300454&rft.isbn_list=9781467300452&rft_id=info:doi/10.1109/ICASSP.2012.6288797&rft_dat=%3Cieee_6IE%3E6288797%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781467300469&rft.eisbn_list=1467300446&rft.eisbn_list=9781467300445&rft.eisbn_list=1467300462&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6288797&rfr_iscdi=true |