The sound of my voice: speaker representation loss for target voice separation
Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Content and style representations have been widely studied in the field of
style transfer. In this paper, we propose a new loss function using speaker
content representation for audio source separation, and we call it speaker
representation loss. The objective is to extract the target speaker voice from
the noisy input and also remove it from the residual components. Compared to
the conventional spectral reconstruction, our proposed framework maximizes the
use of target speaker information by minimizing the distance between the
speaker representations of reference and source separation output. We also
propose triplet speaker representation loss as an additional criterion to
remove the target speaker information from residual spectrogram output.
VoiceFilter framework is adopted to evaluate source separation performance
using the VCTK database, and we achieved improved performances compared to the
baseline loss function without any additional network parameters. |
---|---|
DOI: | 10.48550/arxiv.1911.02411 |