Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation

Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of selected topics in signal processing 2014-04, Vol.8 (2), p.209-220
Hauptverfasser:	Csapo, Tamas Gabor, Nemeth, Geza
Format:	Artikel
Sprache:	eng
Schlagworte:	Biological system modeling Boundaries Creaky voice Excitation glottalization Hidden Markov models High-temperature superconductors HMM Internet irregular phonation Methods parametric Perception Phonation residual Similarity Speech speech processing Speech recognition Speech synthesis Synthesis Training vocal fry Voice voice quality Voice simulation
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	220
container_issue	2
container_start_page	209
container_title	IEEE journal of selected topics in signal processing
container_volume	8
creator	Csapo, Tamas Gabor Nemeth, Geza
description	Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.
doi_str_mv	10.1109/JSTSP.2013.2292037
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_1507140449</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6674045</ieee_id><sourcerecordid>3245799571</sourcerecordid><originalsourceid>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</originalsourceid><addsrcrecordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1507140449</pqid></control><display><type>article</type><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Csapo, Tamas Gabor ; Nemeth, Geza</creator><creatorcontrib>Csapo, Tamas Gabor ; Nemeth, Geza</creatorcontrib><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><identifier>ISSN: 1932-4553</identifier><identifier>EISSN: 1941-0484</identifier><identifier>DOI: 10.1109/JSTSP.2013.2292037</identifier><identifier>CODEN: IJSTGY</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Biological system modeling ; Boundaries ; Creaky voice ; Excitation ; glottalization ; Hidden Markov models ; High-temperature superconductors ; HMM ; Internet ; irregular phonation ; Methods ; parametric ; Perception ; Phonation ; residual ; Similarity ; Speech ; speech processing ; Speech recognition ; Speech synthesis ; Synthesis ; Training ; vocal fry ; Voice ; voice quality ; Voice simulation</subject><ispartof>IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</citedby><cites>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><title>IEEE journal of selected topics in signal processing</title><addtitle>JSTSP</addtitle><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><subject>Biological system modeling</subject><subject>Boundaries</subject><subject>Creaky voice</subject><subject>Excitation</subject><subject>glottalization</subject><subject>Hidden Markov models</subject><subject>High-temperature superconductors</subject><subject>HMM</subject><subject>Internet</subject><subject>irregular phonation</subject><subject>Methods</subject><subject>parametric</subject><subject>Perception</subject><subject>Phonation</subject><subject>residual</subject><subject>Similarity</subject><subject>Speech</subject><subject>speech processing</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Training</subject><subject>vocal fry</subject><subject>Voice</subject><subject>voice quality</subject><subject>Voice simulation</subject><issn>1932-4553</issn><issn>1941-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</recordid><startdate>201404</startdate><enddate>201404</enddate><creator>Csapo, Tamas Gabor</creator><creator>Nemeth, Geza</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>H8D</scope><scope>L7M</scope></search><sort><creationdate>201404</creationdate><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><author>Csapo, Tamas Gabor ; Nemeth, Geza</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Biological system modeling</topic><topic>Boundaries</topic><topic>Creaky voice</topic><topic>Excitation</topic><topic>glottalization</topic><topic>Hidden Markov models</topic><topic>High-temperature superconductors</topic><topic>HMM</topic><topic>Internet</topic><topic>irregular phonation</topic><topic>Methods</topic><topic>parametric</topic><topic>Perception</topic><topic>Phonation</topic><topic>residual</topic><topic>Similarity</topic><topic>Speech</topic><topic>speech processing</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Training</topic><topic>vocal fry</topic><topic>Voice</topic><topic>voice quality</topic><topic>Voice simulation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Aerospace Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of selected topics in signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Csapo, Tamas Gabor</au><au>Nemeth, Geza</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</atitle><jtitle>IEEE journal of selected topics in signal processing</jtitle><stitle>JSTSP</stitle><date>2014-04</date><risdate>2014</risdate><volume>8</volume><issue>2</issue><spage>209</spage><epage>220</epage><pages>209-220</pages><issn>1932-4553</issn><eissn>1941-0484</eissn><coden>IJSTGY</coden><abstract>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSTSP.2013.2292037</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1932-4553
ispartof	IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220
issn	1932-4553 1941-0484
language	eng
recordid	cdi_proquest_journals_1507140449
source	IEEE/IET Electronic Library (IEL)
subjects	Biological system modeling Boundaries Creaky voice Excitation glottalization Hidden Markov models High-temperature superconductors HMM Internet irregular phonation Methods parametric Perception Phonation residual Similarity Speech speech processing Speech recognition Speech synthesis Synthesis Training vocal fry Voice voice quality Voice simulation
title	Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T18%3A25%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%20Irregular%20Voice%20in%20Statistical%20Parametric%20Speech%20Synthesis%20With%20Residual%20Codebook%20Based%20Excitation&rft.jtitle=IEEE%20journal%20of%20selected%20topics%20in%20signal%20processing&rft.au=Csapo,%20Tamas%20Gabor&rft.date=2014-04&rft.volume=8&rft.issue=2&rft.spage=209&rft.epage=220&rft.pages=209-220&rft.issn=1932-4553&rft.eissn=1941-0484&rft.coden=IJSTGY&rft_id=info:doi/10.1109/JSTSP.2013.2292037&rft_dat=%3Cproquest_RIE%3E3245799571%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1507140449&rft_id=info:pmid/&rft_ieee_id=6674045&rfr_iscdi=true