Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

A multi-level distortion measure (MLDM) is proposed as an objective to optimize deep neural network-based speech enhancement (SE) in both audio-only and audio-visual scenarios. The aim is to achieve simultaneous performance improvements in speech quality, intelligibility, and recognition error reduc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.2508-2521
Hauptverfasser: Chen, Hang, Wang, Qing, Du, Jun, Yin, Bao-Cai, Pan, Jia, Lee, Chin-Hui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A multi-level distortion measure (MLDM) is proposed as an objective to optimize deep neural network-based speech enhancement (SE) in both audio-only and audio-visual scenarios. The aim is to achieve simultaneous performance improvements in speech quality, intelligibility, and recognition error reductions. Moreover, a comprehensive correlation analysis shows that these three evaluation metrics exhibit high Pearson correlation coefficient (PCC) values with three commonly used optimization objectives: the mean squared error between the ideal ratio and estimated magnitude masks, scale-invariant signal-to-noise ratio, and cross-entropy-guided measure. To further improve the performance, we leverage the complementarities of the three objectives and propose another correlated multi-level distortion measure (C-MLDM) defined as a weighted combination of MLDM and an average correlation measure based on the three PCCs. Experimental results on the TCD-TIMIT corpus corrupted by additive noise demonstrate that MLDM outperforms systems optimized with each objective in both audio-visual and audio-only scenarios, offering improved performances in all three metrics: speech quality, intelligibility, and recognition performance. C-MLDM also consistently outperforms MLDM in all test cases. Finally, the generalizability of both MLDM and C-MLDM is confirmed through extensive testing across diverse datasets, SE model architectures, and linguistic conditions.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2024.3393732