Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

A multi-level distortion measure (MLDM) is proposed as an objective to optimize deep neural network-based speech enhancement (SE) in both audio-only and audio-visual scenarios. The aim is to achieve simultaneous performance improvements in speech quality, intelligibility, and recognition error reduc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.2508-2521
Hauptverfasser:	Chen, Hang, Wang, Qing, Du, Jun, Yin, Bao-Cai, Pan, Jia, Lee, Chin-Hui
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Audio-visual Codes Correlation analysis Correlation coefficients Distortion Distortion measurement Error analysis Hidden Markov models Intelligibility Measurement Noise measurement Optimization optimization objective Performance enhancement robust speech recognition Signal to noise ratio Speech enhancement Speech processing Speech recognition task-generic Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A multi-level distortion measure (MLDM) is proposed as an objective to optimize deep neural network-based speech enhancement (SE) in both audio-only and audio-visual scenarios. The aim is to achieve simultaneous performance improvements in speech quality, intelligibility, and recognition error reductions. Moreover, a comprehensive correlation analysis shows that these three evaluation metrics exhibit high Pearson correlation coefficient (PCC) values with three commonly used optimization objectives: the mean squared error between the ideal ratio and estimated magnitude masks, scale-invariant signal-to-noise ratio, and cross-entropy-guided measure. To further improve the performance, we leverage the complementarities of the three objectives and propose another correlated multi-level distortion measure (C-MLDM) defined as a weighted combination of MLDM and an average correlation measure based on the three PCCs. Experimental results on the TCD-TIMIT corpus corrupted by additive noise demonstrate that MLDM outperforms systems optimized with each objective in both audio-visual and audio-only scenarios, offering improved performances in all three metrics: speech quality, intelligibility, and recognition performance. C-MLDM also consistently outperforms MLDM in all test cases. Finally, the generalizability of both MLDM and C-MLDM is confirmed through extensive testing across diverse datasets, SE model architectures, and linguistic conditions.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2024.3393732