Progressive channel fusion for more efficient TDNN on speaker verification

ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Co...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Speech communication 2024-09, Vol.163, p.103105, Article 103105
Hauptverfasser:	Zhao, Zhenduo, Li, Zhuo, Wang, Wenchao, Xu, Ji
Format:	Artikel
Sprache:	eng
Schlagworte:	Channel permutation Progressive channel fusion Speaker verification Time delay neural networks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model. •Channel fusion is proposed to introduce local receptive field in feature dimension.•A novel branch structure is proposed to enhance multi-scale capability.•The model is scaled to 40, 58 and 76 layers and achieves SOTA performance.
ISSN:	0167-6393
DOI:	10.1016/j.specom.2024.103105