Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis
•We study the impact of adding large-scale listener's perceptual annotations into the emotional speech modeling process.•We consider a number of different emotional representations that allow us to exploit this perceptual information. These representations also consider ways of manipulating the...
Gespeichert in:
Veröffentlicht in: | Speech communication 2018-05, Vol.99, p.135-143 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •We study the impact of adding large-scale listener's perceptual annotations into the emotional speech modeling process.•We consider a number of different emotional representations that allow us to exploit this perceptual information. These representations also consider ways of manipulating the modeled emotion at synthesis time.•Two large scale perceptual evaluations were carried out, one to evaluate modeling accuracy and another to evaluate control capabilities at synthesis time.•We prove how adding perceptual information based on listener’s annotation significantly improves emotional speech modeling accuracy.•We also show how the proposed representations provide us with notable emotional control capabilities.•They allow us to control both emotion recognition rates and perceived emotional strength without decreasing produced speech quality.
In this paper, we investigate the simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional voice actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize different emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates, perceived emotional strength, and subjective speech quality. Simultaneously, we also study which representations are most appropriate for controlling the emotional strength of synthetic speech. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2018.03.002 |