Quality prediction of synthesized speech based on perceptual quality dimensions
•The feasibility of non-intrusive quality prediction for synthetic speech, which only requires acoustical measurements, is shown.•Diagnostic structuring of the perceptual quality space of TTS sound through perceptual quality dimensions.•A new framework for non-intrusive quality assessment, denoted a...
Gespeichert in:
Veröffentlicht in: | Speech communication 2015-02, Vol.66 (Feb), p.17-35 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •The feasibility of non-intrusive quality prediction for synthetic speech, which only requires acoustical measurements, is shown.•Diagnostic structuring of the perceptual quality space of TTS sound through perceptual quality dimensions.•A new framework for non-intrusive quality assessment, denoted as perceptual regularization, is introduced.•Detailed comparison of different model types, feature groups, and model assessment schemes.•Insight into model robustness using a large amount of subjective test data (3 subjective tests, 177 rated TTS stimuli).
Instrumental speech-quality prediction for text-to-speech signals is explored in a twofold manner. First, the perceptual quality space of TTS is structured by means of three perceptual quality dimensions which are derived from multiple auditory tests. Second, quality-prediction models are evaluated for each dimension using prosodic and MFCC-based measurands. Linear and nonlinear model types are compared under cross-validation restrictions, giving detailed insight into model-generalizability aspects. Perceptually regularized properties, denoted as quality elements, are introduced in order to encode the quality-indicative effect of individual signal characteristics. These elements integrate a perceptual model reference which is derived in a semi-supervised fashion from natural and synthetic speech. The results highlight the feasibility of instrumental quality prediction for TTS signals provided that broad training material is employed. High prediction accuracy, however, requires nonlinear model structures. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2014.06.003 |