Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of King Saud University. Computer and information sciences 2024-09, Vol.36 (7), p.102131, Article 102131
Hauptverfasser: Ahmad, Hawraz A., Rashid, Tarik A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.
ISSN:1319-1578
DOI:10.1016/j.jksuci.2024.102131