CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or select...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To further improve the speaking styles of synthesized speeches, current
text-to-speech (TTS) synthesis systems commonly employ reference speeches to
stylize their outputs instead of just the input texts. These reference speeches
are obtained by manual selection which is resource-consuming, or selected by
semantic features. However, semantic features contain not only style-related
information, but also style irrelevant information. The information irrelevant
to speaking style in the text could interfere the reference audio selection and
result in improper speaking styles. To improve the reference selection, we
propose Contrastive Acoustic-Linguistic Module (CALM) to extract the
Style-related Text Feature (STF) from the text. CALM optimizes the correlation
between the speaking style embedding and the extracted STF with contrastive
learning. Thus, a certain number of the most appropriate reference speeches for
the input text are selected by retrieving the speeches with the top STF
similarities. Then the style embeddings are weighted summarized according to
their STF similarities and used to stylize the synthesized speech of TTS.
Experiment results demonstrate the effectiveness of our proposed approach, with
both objective evaluations and subjective evaluations on the speaking styles of
the synthesized speeches outperform a baseline approach with
semantic-feature-based reference selection. |
---|---|
DOI: | 10.48550/arxiv.2308.16021 |