EMDSQA: A Neural Speech Quality Assessment Model With Speaker Embedding
We present a neural speech quality assessment model with speaker embedding. This model, i.e., EMDSQA, can precisely predict the Mean Opinion Score (MOS) of speech quality during online communications. Intrusive speech quality assessment methods such as perceptual objective listening quality analysis...
Gespeichert in:
Veröffentlicht in: | IEEE signal processing letters 2024, Vol.31, p.3064-3068 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present a neural speech quality assessment model with speaker embedding. This model, i.e., EMDSQA, can precisely predict the Mean Opinion Score (MOS) of speech quality during online communications. Intrusive speech quality assessment methods such as perceptual objective listening quality analysis (POLQA) are not practical for online communications because every piece of degraded speech requires a corresponding clean reference. Non-intrusive methods can assess the quality of online speech, but have not reached the accuracy and robustness required for real-world applications. EMDSQA extracts the speaker embedding using an independent pipeline and feeds it as a prior feature to a self-attention-based MOS prediction model. Since EMDSQA does not need the corresponding clean reference, it is practical for real-world communication applications. An open-source test corpus, featuring real-world data, was also developed. Experimental results show that EMDSQA achieves a 0.92 Pearson correlation coefficient with the MOS measured from humans, surpassing other state-of-the-art intrusive or non-intrusive methods. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2024.3478211 |