State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations

•Neural network embeddings become the new state-of-the-art in speaker recognition evaluations, improving i-vector performance by 2 in some conditions.•Comparing network architectures for x-vectors, factorized TDNN performed the best in a moderately large setup. However, E-TDNN can be also competitiv...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer speech & language 2020-03, Vol.60, p.101026, Article 101026
Hauptverfasser: Villalba, Jesús, Chen, Nanxin, Snyder, David, Garcia-Romero, Daniel, McCree, Alan, Sell, Gregory, Borgstrom, Jonas, García-Perera, Leibny Paola, Richardson, Fred, Dehak, Réda, Torres-Carrasquillo, Pedro A., Dehak, Najim
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Neural network embeddings become the new state-of-the-art in speaker recognition evaluations, improving i-vector performance by 2 in some conditions.•Comparing network architectures for x-vectors, factorized TDNN performed the best in a moderately large setup. However, E-TDNN can be also competitive with a larger training setup.•Comparing pooling methods, learnable dictionary encoder performed the best indicating that we can take advantage of multi-modal frame-level hidden representations.•Angular-margin based training objectives performed better in-domain conditions but not in domain mismatched conditions.•Calibration in a new domain can be achieved by MAP adaptation of out-of-domain score distribution to the new domain using a very limited number of in-domain recordings. We present a thorough analysis of the systems developed by the JHU-MIT consortium in the context of NIST speaker recognition evaluation 2018. In the previous NIST evaluation, in 2016, i-vectors were the speaker recognition state-of-the-art. However now, neural network embeddings (a.k.a. x-vectors) rise as the best performing approach. We show that in some conditions, x-vectors’ detection error reduces by 2 w.r.t. i-vectors. In this work, we experimented on the Speakers In The Wild evaluation (SITW), NIST SRE18 VAST (Video Annotation for Speech Technology), and SRE18 CMN2 (Call My Net 2, telephone Tunisian Arabic) to compare network architectures, pooling layers, training objectives, back-end adaptation methods, and calibration techniques. x-Vectors based on factorized and extended TDNN networks achieved performance without parallel on SITW and CMN2 data. However for VAST, performance was significantly worse than for SITW. We noted that the VAST audio quality was severely degraded compared to the SITW, even though they both consist of Internet videos. This degradation caused strong domain mismatch between training and VAST data. Due to this mismatch, large networks performed just slightly better than smaller networks. This also complicated VAST calibration. However, we managed to calibrate VAST by adapting SITW scores distribution to VAST, using a small amount of in-domain development data. Regarding pooling methods, learnable dictionary encoder performed the best. This suggests that representations learned by x-vector encoders are multi-modal. Maximum margin losses were better than cross-entropy for in-domain data but not in VAST mismatched data. We also analyzed back-end adaptation met
ISSN:0885-2308
1095-8363
DOI:10.1016/j.csl.2019.101026