Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation

Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of selected topics in signal processing 2014-04, Vol.8 (2), p.209-220
Hauptverfasser: Csapo, Tamas Gabor, Nemeth, Geza
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 220
container_issue 2
container_start_page 209
container_title IEEE journal of selected topics in signal processing
container_volume 8
creator Csapo, Tamas Gabor
Nemeth, Geza
description Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.
doi_str_mv 10.1109/JSTSP.2013.2292037
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_1507140449</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6674045</ieee_id><sourcerecordid>3245799571</sourcerecordid><originalsourceid>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</originalsourceid><addsrcrecordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1507140449</pqid></control><display><type>article</type><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Csapo, Tamas Gabor ; Nemeth, Geza</creator><creatorcontrib>Csapo, Tamas Gabor ; Nemeth, Geza</creatorcontrib><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><identifier>ISSN: 1932-4553</identifier><identifier>EISSN: 1941-0484</identifier><identifier>DOI: 10.1109/JSTSP.2013.2292037</identifier><identifier>CODEN: IJSTGY</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Biological system modeling ; Boundaries ; Creaky voice ; Excitation ; glottalization ; Hidden Markov models ; High-temperature superconductors ; HMM ; Internet ; irregular phonation ; Methods ; parametric ; Perception ; Phonation ; residual ; Similarity ; Speech ; speech processing ; Speech recognition ; Speech synthesis ; Synthesis ; Training ; vocal fry ; Voice ; voice quality ; Voice simulation</subject><ispartof>IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</citedby><cites>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><title>IEEE journal of selected topics in signal processing</title><addtitle>JSTSP</addtitle><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><subject>Biological system modeling</subject><subject>Boundaries</subject><subject>Creaky voice</subject><subject>Excitation</subject><subject>glottalization</subject><subject>Hidden Markov models</subject><subject>High-temperature superconductors</subject><subject>HMM</subject><subject>Internet</subject><subject>irregular phonation</subject><subject>Methods</subject><subject>parametric</subject><subject>Perception</subject><subject>Phonation</subject><subject>residual</subject><subject>Similarity</subject><subject>Speech</subject><subject>speech processing</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Training</subject><subject>vocal fry</subject><subject>Voice</subject><subject>voice quality</subject><subject>Voice simulation</subject><issn>1932-4553</issn><issn>1941-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</recordid><startdate>201404</startdate><enddate>201404</enddate><creator>Csapo, Tamas Gabor</creator><creator>Nemeth, Geza</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>H8D</scope><scope>L7M</scope></search><sort><creationdate>201404</creationdate><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><author>Csapo, Tamas Gabor ; Nemeth, Geza</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Biological system modeling</topic><topic>Boundaries</topic><topic>Creaky voice</topic><topic>Excitation</topic><topic>glottalization</topic><topic>Hidden Markov models</topic><topic>High-temperature superconductors</topic><topic>HMM</topic><topic>Internet</topic><topic>irregular phonation</topic><topic>Methods</topic><topic>parametric</topic><topic>Perception</topic><topic>Phonation</topic><topic>residual</topic><topic>Similarity</topic><topic>Speech</topic><topic>speech processing</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Training</topic><topic>vocal fry</topic><topic>Voice</topic><topic>voice quality</topic><topic>Voice simulation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Aerospace Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of selected topics in signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Csapo, Tamas Gabor</au><au>Nemeth, Geza</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</atitle><jtitle>IEEE journal of selected topics in signal processing</jtitle><stitle>JSTSP</stitle><date>2014-04</date><risdate>2014</risdate><volume>8</volume><issue>2</issue><spage>209</spage><epage>220</epage><pages>209-220</pages><issn>1932-4553</issn><eissn>1941-0484</eissn><coden>IJSTGY</coden><abstract>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSTSP.2013.2292037</doi><tpages>12</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1932-4553
ispartof IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220
issn 1932-4553
1941-0484
language eng
recordid cdi_proquest_journals_1507140449
source IEEE/IET Electronic Library (IEL)
subjects Biological system modeling
Boundaries
Creaky voice
Excitation
glottalization
Hidden Markov models
High-temperature superconductors
HMM
Internet
irregular phonation
Methods
parametric
Perception
Phonation
residual
Similarity
Speech
speech processing
Speech recognition
Speech synthesis
Synthesis
Training
vocal fry
Voice
voice quality
Voice simulation
title Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T18%3A25%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%20Irregular%20Voice%20in%20Statistical%20Parametric%20Speech%20Synthesis%20With%20Residual%20Codebook%20Based%20Excitation&rft.jtitle=IEEE%20journal%20of%20selected%20topics%20in%20signal%20processing&rft.au=Csapo,%20Tamas%20Gabor&rft.date=2014-04&rft.volume=8&rft.issue=2&rft.spage=209&rft.epage=220&rft.pages=209-220&rft.issn=1932-4553&rft.eissn=1941-0484&rft.coden=IJSTGY&rft_id=info:doi/10.1109/JSTSP.2013.2292037&rft_dat=%3Cproquest_RIE%3E3245799571%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1507140449&rft_id=info:pmid/&rft_ieee_id=6674045&rfr_iscdi=true