EMDSQA: A Neural Speech Quality Assessment Model With Speaker Embedding

We present a neural speech quality assessment model with speaker embedding. This model, i.e., EMDSQA, can precisely predict the Mean Opinion Score (MOS) of speech quality during online communications. Intrusive speech quality assessment methods such as perceptual objective listening quality analysis...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE signal processing letters 2024, Vol.31, p.3064-3068
Hauptverfasser: Hao, Yiya, Xiong, Feifei, Li, Bei, Ding, Nai, Feng, Jinwei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We present a neural speech quality assessment model with speaker embedding. This model, i.e., EMDSQA, can precisely predict the Mean Opinion Score (MOS) of speech quality during online communications. Intrusive speech quality assessment methods such as perceptual objective listening quality analysis (POLQA) are not practical for online communications because every piece of degraded speech requires a corresponding clean reference. Non-intrusive methods can assess the quality of online speech, but have not reached the accuracy and robustness required for real-world applications. EMDSQA extracts the speaker embedding using an independent pipeline and feeds it as a prior feature to a self-attention-based MOS prediction model. Since EMDSQA does not need the corresponding clean reference, it is practical for real-world communication applications. An open-source test corpus, featuring real-world data, was also developed. Experimental results show that EMDSQA achieves a 0.92 Pearson correlation coefficient with the MOS measured from humans, surpassing other state-of-the-art intrusive or non-intrusive methods.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2024.3478211