Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation
Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of t...
Gespeichert in:
Veröffentlicht in: | IEEE journal of selected topics in signal processing 2014-04, Vol.8 (2), p.209-220 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 220 |
---|---|
container_issue | 2 |
container_start_page | 209 |
container_title | IEEE journal of selected topics in signal processing |
container_volume | 8 |
creator | Csapo, Tamas Gabor Nemeth, Geza |
description | Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems. |
doi_str_mv | 10.1109/JSTSP.2013.2292037 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_1507140449</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6674045</ieee_id><sourcerecordid>3245799571</sourcerecordid><originalsourceid>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</originalsourceid><addsrcrecordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1507140449</pqid></control><display><type>article</type><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Csapo, Tamas Gabor ; Nemeth, Geza</creator><creatorcontrib>Csapo, Tamas Gabor ; Nemeth, Geza</creatorcontrib><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><identifier>ISSN: 1932-4553</identifier><identifier>EISSN: 1941-0484</identifier><identifier>DOI: 10.1109/JSTSP.2013.2292037</identifier><identifier>CODEN: IJSTGY</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Biological system modeling ; Boundaries ; Creaky voice ; Excitation ; glottalization ; Hidden Markov models ; High-temperature superconductors ; HMM ; Internet ; irregular phonation ; Methods ; parametric ; Perception ; Phonation ; residual ; Similarity ; Speech ; speech processing ; Speech recognition ; Speech synthesis ; Synthesis ; Training ; vocal fry ; Voice ; voice quality ; Voice simulation</subject><ispartof>IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</citedby><cites>FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6674045$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><title>IEEE journal of selected topics in signal processing</title><addtitle>JSTSP</addtitle><description>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</description><subject>Biological system modeling</subject><subject>Boundaries</subject><subject>Creaky voice</subject><subject>Excitation</subject><subject>glottalization</subject><subject>Hidden Markov models</subject><subject>High-temperature superconductors</subject><subject>HMM</subject><subject>Internet</subject><subject>irregular phonation</subject><subject>Methods</subject><subject>parametric</subject><subject>Perception</subject><subject>Phonation</subject><subject>residual</subject><subject>Similarity</subject><subject>Speech</subject><subject>speech processing</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Training</subject><subject>vocal fry</subject><subject>Voice</subject><subject>voice quality</subject><subject>Voice simulation</subject><issn>1932-4553</issn><issn>1941-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkE1PwkAURRujiYj-Ad1M4sZNcT477VIJKgYjsajLZpg-YKB0cKZN5N87CHHh6t3FuScvN4ouCe4RgrPb53ySj3sUE9ajNKOYyaOoQzJOYsxTfrzLjMZcCHYanXm_xFjIhPBOtHqxJVSmnqOhczBvK-XQhzUakKlR3qjG-MZoVaGxcmoNjTMa5RsAvUD5tm4W4I1Hn6ZZoLcQyzaQ_WCcWrtC98pDiQbf2uw8tj6PTmaq8nBxuN3o_WEw6T_Fo9fHYf9uFGtG0yaWRILKRFkSrpggM6GFpDjNZpAyyUkpGBYqmyZcTrVQQMtSBDRNJE4YJKlm3ehm7904-9WCb4q18RqqStVgW18QQXEWcE4Cev0PXdrW1eG7QGFJOOY8CxTdU9pZ7x3Mio0za-W2BcHFbv_id_9it39x2D-UrvYlAwB_hSSRwSnYD7ULgV4</recordid><startdate>201404</startdate><enddate>201404</enddate><creator>Csapo, Tamas Gabor</creator><creator>Nemeth, Geza</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>H8D</scope><scope>L7M</scope></search><sort><creationdate>201404</creationdate><title>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</title><author>Csapo, Tamas Gabor ; Nemeth, Geza</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c328t-717ea95dd14a351f5c572089fe83741d5305a9b647bc5ae2dd5d14867063e68c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Biological system modeling</topic><topic>Boundaries</topic><topic>Creaky voice</topic><topic>Excitation</topic><topic>glottalization</topic><topic>Hidden Markov models</topic><topic>High-temperature superconductors</topic><topic>HMM</topic><topic>Internet</topic><topic>irregular phonation</topic><topic>Methods</topic><topic>parametric</topic><topic>Perception</topic><topic>Phonation</topic><topic>residual</topic><topic>Similarity</topic><topic>Speech</topic><topic>speech processing</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Training</topic><topic>vocal fry</topic><topic>Voice</topic><topic>voice quality</topic><topic>Voice simulation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Csapo, Tamas Gabor</creatorcontrib><creatorcontrib>Nemeth, Geza</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Aerospace Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of selected topics in signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Csapo, Tamas Gabor</au><au>Nemeth, Geza</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation</atitle><jtitle>IEEE journal of selected topics in signal processing</jtitle><stitle>JSTSP</stitle><date>2014-04</date><risdate>2014</risdate><volume>8</volume><issue>2</issue><spage>209</spage><epage>220</epage><pages>209-220</pages><issn>1932-4553</issn><eissn>1941-0484</eissn><coden>IJSTGY</coden><abstract>Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSTSP.2013.2292037</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1932-4553 |
ispartof | IEEE journal of selected topics in signal processing, 2014-04, Vol.8 (2), p.209-220 |
issn | 1932-4553 1941-0484 |
language | eng |
recordid | cdi_proquest_journals_1507140449 |
source | IEEE/IET Electronic Library (IEL) |
subjects | Biological system modeling Boundaries Creaky voice Excitation glottalization Hidden Markov models High-temperature superconductors HMM Internet irregular phonation Methods parametric Perception Phonation residual Similarity Speech speech processing Speech recognition Speech synthesis Synthesis Training vocal fry Voice voice quality Voice simulation |
title | Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T18%3A25%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%20Irregular%20Voice%20in%20Statistical%20Parametric%20Speech%20Synthesis%20With%20Residual%20Codebook%20Based%20Excitation&rft.jtitle=IEEE%20journal%20of%20selected%20topics%20in%20signal%20processing&rft.au=Csapo,%20Tamas%20Gabor&rft.date=2014-04&rft.volume=8&rft.issue=2&rft.spage=209&rft.epage=220&rft.pages=209-220&rft.issn=1932-4553&rft.eissn=1941-0484&rft.coden=IJSTGY&rft_id=info:doi/10.1109/JSTSP.2013.2292037&rft_dat=%3Cproquest_RIE%3E3245799571%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1507140449&rft_id=info:pmid/&rft_ieee_id=6674045&rfr_iscdi=true |