Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivati...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We often verbally express emotions in a multifaceted manner, they may vary in
their intensities and may be expressed not just as a single but as a mixture of
emotions. This wide spectrum of emotions is well-studied in the structural
model of emotions, which represents variety of emotions as derivative products
of primary emotions with varying degrees of intensity. In this paper, we
propose an emotional text-to-speech design to simulate a wider spectrum of
emotions grounded on the structural model. Our proposed design, Daisy-TTS,
incorporates a prosody encoder to learn emotionally-separable prosody embedding
as a proxy for emotion. This emotion representation allows the model to
simulate: (1) Primary emotions, as learned from the training samples, (2)
Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by
scaling the emotion embedding, and (4) Emotions polarity, by negating the
emotion embedding. Through a series of perceptual evaluations, Daisy-TTS
demonstrated overall higher emotional speech naturalness and emotion
perceiveability compared to the baseline. |
---|---|
DOI: | 10.48550/arxiv.2402.14523 |