RMSSinger: Realistic-Music-Score based Singing Voice Synthesis
We are interested in a challenging task, Realistic-Music-Score based Singing Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types (grace, slur, rest, etc.). Though significant progress has been achieved, recent singing...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We are interested in a challenging task, Realistic-Music-Score based Singing
Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices
given realistic music scores with different note types (grace, slur, rest,
etc.). Though significant progress has been achieved, recent singing voice
synthesis (SVS) methods are limited to fine-grained music scores, which require
a complicated data collection pipeline with time-consuming manual annotation to
align music notes with phonemes. Furthermore, these manual annotation destroys
the regularity of note durations in music scores, making fine-grained music
scores inconvenient for composing. To tackle these challenges, we propose
RMSSinger, the first RMS-SVS method, which takes realistic music scores as
input, eliminating most of the tedious manual annotation and avoiding the
aforementioned inconvenience. Note that music scores are based on words rather
than phonemes, in RMSSinger, we introduce word-level modeling to avoid the
time-consuming phoneme duration annotation and the complicated phoneme-level
mel-note alignment. Furthermore, we propose the first diffusion-based pitch
modeling method, which ameliorates the naturalness of existing pitch-modeling
methods. To achieve these, we collect a new dataset containing realistic music
scores and singing voices according to these realistic music scores from
professional singers. Extensive experiments on the dataset demonstrate the
effectiveness of our methods. Audio samples are available at
https://rmssinger.github.io/. |
---|---|
DOI: | 10.48550/arxiv.2305.10686 |