Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a t...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Separating different speaker properties from a multi-speaker environment is
challenging. Instead of separating a two-speaker signal in signal space like
speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker
signal in embedding space. The proposed approach contains two steps. In step
one, the clean speaker embeddings are learned and collected by a residual TDNN
based network. In step two, the two-speaker signal and the embedding of one of
the speakers are both input to a speaker embedding de-mixing network. The
de-mixing network is trained to generate the embedding of the other speaker by
reconstruction loss. Speaker identification accuracy and the cosine similarity
score between the clean embeddings and the de-mixed embeddings are used to
evaluate the quality of the obtained embeddings. Experiments are done in two
kind of data: artificial augmented two-speaker data (TIMIT) and real world
recording of two-speaker data (MC-WSJ). Six different speaker embedding
de-mixing architectures are investigated. Comparing with the performance on the
clean speaker embeddings, the obtained results show that one of the proposed
architectures obtained close performance, reaching 96.9% identification
accuracy and 0.89 cosine similarity. |
---|---|
DOI: | 10.48550/arxiv.2001.06397 |