Multi-Voice Singing Synthesis From Lyrics

In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is g...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Circuits, systems, and signal processing systems, and signal processing, 2023, Vol.42 (1), p.307-321
Hauptverfasser:	Resna, S., Rajan, Rajeev
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Annotations Circuits and Systems Coders Electrical Engineering Electronics and Microelectronics Engineering Generative adversarial networks Instrumentation Intelligibility Lyrics Modules Multilingualism Phonemes Phonetics Signal processing Signal,Image and Speech Processing Singers Singing Speech Speech recognition Speech synthesis Synthesis Text-to-speech
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.
ISSN:	0278-081X 1531-5878
DOI:	10.1007/s00034-022-02122-3