Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study
Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including auto...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Baali, Massa Hayashi, Tomoki Mubarak, Hamdy Maiti, Soumi Watanabe, Shinji El-Hajj, Wassim Ali, Ahmed |
description | Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method. |
doi_str_mv | 10.48550/arxiv.2301.09099 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2301_09099</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2301_09099</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-b78352514b2d26de2a0f73352c189d13174c1ac17071d3f584e3751618c1bfdd3</originalsourceid><addsrcrecordid>eNotz81KxDAUBeBsXMjoA7jyvkBrbtI0qbux_sKgQjvrcpukEhjbIemMzts7jsKBA2dx4GPsCnleGKX4DcXvsM-F5JjzilfVOXtfj2m39XEfkndwTzNB4zfezmEaYZgitG1zC-sUxg9YRuqDhbs4kbOUZnj1XwnoGKgpeWjmnTtcsLOBNslf_veCtY8Pbf2crd6eXurlKqNSV1mvjVRCYdELJ0rnBfFBy-Nk0VQOJerCIlnUXKOTgzKFl1phicZiPzgnF-z67_ZE6rYxfFI8dL-07kSTPzhlR3k</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study</title><source>arXiv.org</source><creator>Baali, Massa ; Hayashi, Tomoki ; Mubarak, Hamdy ; Maiti, Soumi ; Watanabe, Shinji ; El-Hajj, Wassim ; Ali, Ahmed</creator><creatorcontrib>Baali, Massa ; Hayashi, Tomoki ; Mubarak, Hamdy ; Maiti, Soumi ; Watanabe, Shinji ; El-Hajj, Wassim ; Ali, Ahmed</creatorcontrib><description>Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method.</description><identifier>DOI: 10.48550/arxiv.2301.09099</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2023-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2301.09099$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2301.09099$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Baali, Massa</creatorcontrib><creatorcontrib>Hayashi, Tomoki</creatorcontrib><creatorcontrib>Mubarak, Hamdy</creatorcontrib><creatorcontrib>Maiti, Soumi</creatorcontrib><creatorcontrib>Watanabe, Shinji</creatorcontrib><creatorcontrib>El-Hajj, Wassim</creatorcontrib><creatorcontrib>Ali, Ahmed</creatorcontrib><title>Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study</title><description>Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KxDAUBeBsXMjoA7jyvkBrbtI0qbux_sKgQjvrcpukEhjbIemMzts7jsKBA2dx4GPsCnleGKX4DcXvsM-F5JjzilfVOXtfj2m39XEfkndwTzNB4zfezmEaYZgitG1zC-sUxg9YRuqDhbs4kbOUZnj1XwnoGKgpeWjmnTtcsLOBNslf_veCtY8Pbf2crd6eXurlKqNSV1mvjVRCYdELJ0rnBfFBy-Nk0VQOJerCIlnUXKOTgzKFl1phicZiPzgnF-z67_ZE6rYxfFI8dL-07kSTPzhlR3k</recordid><startdate>20230122</startdate><enddate>20230122</enddate><creator>Baali, Massa</creator><creator>Hayashi, Tomoki</creator><creator>Mubarak, Hamdy</creator><creator>Maiti, Soumi</creator><creator>Watanabe, Shinji</creator><creator>El-Hajj, Wassim</creator><creator>Ali, Ahmed</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230122</creationdate><title>Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study</title><author>Baali, Massa ; Hayashi, Tomoki ; Mubarak, Hamdy ; Maiti, Soumi ; Watanabe, Shinji ; El-Hajj, Wassim ; Ali, Ahmed</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-b78352514b2d26de2a0f73352c189d13174c1ac17071d3f584e3751618c1bfdd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Baali, Massa</creatorcontrib><creatorcontrib>Hayashi, Tomoki</creatorcontrib><creatorcontrib>Mubarak, Hamdy</creatorcontrib><creatorcontrib>Maiti, Soumi</creatorcontrib><creatorcontrib>Watanabe, Shinji</creatorcontrib><creatorcontrib>El-Hajj, Wassim</creatorcontrib><creatorcontrib>Ali, Ahmed</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Baali, Massa</au><au>Hayashi, Tomoki</au><au>Mubarak, Hamdy</au><au>Maiti, Soumi</au><au>Watanabe, Shinji</au><au>El-Hajj, Wassim</au><au>Ali, Ahmed</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study</atitle><date>2023-01-22</date><risdate>2023</risdate><abstract>Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method.</abstract><doi>10.48550/arxiv.2301.09099</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2301.09099 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2301_09099 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Sound |
title | Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T10%3A06%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Data%20Selection%20for%20TTS:%20Using%20Arabic%20Broadcast%20News%20as%20a%20Case%20Study&rft.au=Baali,%20Massa&rft.date=2023-01-22&rft_id=info:doi/10.48550/arxiv.2301.09099&rft_dat=%3Carxiv_GOX%3E2301_09099%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |