Spectral mismatch as the index of quality of naturalness in synthetic speech

It is extremely tough to make a machine which sounds identical to human. Hence the best text to speech (TTS) algorithm ever made sounds robotic, until and unless human speech itself is involved in it. But it is not possible to create a database of each and every word possible in any language. Syllab...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kawachale, S.P., Gengaje, S.R., Chitode, J.S.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:It is extremely tough to make a machine which sounds identical to human. Hence the best text to speech (TTS) algorithm ever made sounds robotic, until and unless human speech itself is involved in it. But it is not possible to create a database of each and every word possible in any language. Syllable based concatenative speech synthesis (CSS) leads to formation of new words from existing words in data base. Improper concatenation with respect to position of the syllable leads to spectral mismatch. A first step to overcome this is to estimate spectral mismatch with respect to position of the syllable. We propose a method based on power spectral density (PSD) to estimate position dependent spectral mismatch. This can be done by plotting power spectral density of 10 millisecond samples of original, properly concatenated (PC) and improperly concatenated (IC) words. These samples are then made noise free to neglect their low amplitude peaks. Analysis of PSD leads to locate formants in the given samples. Formants for original, properly and improperly concatenated words is then plotted. It is observed that formant plots for original and properly concatenated words are very similar for all formants while for improper concatenation extra peaks are observed in all formants. These extra peaks can be considered as estimation for spectral mismatch. The results are validated using Marathi text to speech synthesis.
ISSN:1555-5798
2154-5952
DOI:10.1109/PACRIM.2009.5291267