Axial Residual Networks for CycleGAN-based Voice Conversion
We propose a novel architecture and improved training objectives for non-parallel voice conversion. Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram, converting its style (i.e. speaker identity) while preservin...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose a novel architecture and improved training objectives for
non-parallel voice conversion. Our proposed CycleGAN-based model performs a
shape-preserving transformation directly on a high frequency-resolution
magnitude spectrogram, converting its style (i.e. speaker identity) while
preserving the speech content. Throughout the entire conversion process, the
model does not resort to compressed intermediate representations of any sort
(e.g. mel spectrogram, low resolution spectrogram, decomposed network feature).
We propose an efficient axial residual block architecture to support this
expensive procedure and various modifications to the CycleGAN losses to
stabilize the training process. We demonstrate via experiments that our
proposed model outperforms Scyclone and shows a comparable or better
performance to that of CycleGAN-VC2 even without employing a neural vocoder. |
---|---|
DOI: | 10.48550/arxiv.2102.08075 |