Self-Supervised Pre-Trained Speech Representation Based End-to-End Mispronunciation Detection and Diagnosis of Mandarin

Mispronunciation Detection and Diagnosis (MDD) is an essential basic technology in Computer-Assisted Pronunciation Training (CAPT) and Computer-Assisted Language Learning (CALL). MDD research in Mandarin is faced with the problem of lack of relevant data, which is a typical low-resource scenario. In...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2022, Vol.10, p.1-1
Hauptverfasser:	Shen, Yunfei, Liu, Qingqing, Fan, Zhixing, Liu, Jiajun, Wumaier, Aishan
Format:	Artikel
Sprache:	eng
Schlagworte:	CAPT Computational modeling Context modeling Datasets Diagnosis Feature extraction Hidden Markov models MDD Representations Self-supervised learning Speech recognition Task analysis Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Mispronunciation Detection and Diagnosis (MDD) is an essential basic technology in Computer-Assisted Pronunciation Training (CAPT) and Computer-Assisted Language Learning (CALL). MDD research in Mandarin is faced with the problem of lack of relevant data, which is a typical low-resource scenario. In recent years, self-supervised pre-trained speech representation has developed rapidly and achieved significant performance improvement in low-resource speech recognition scenarios, making it necessary to be applied to MDD tasks. First, we build a Mandarin MDD dataset called PSC-Reading for the Putonghua Proficiency Test (PSC) passage reading section. Then we extended the end-to-end MDD system based on CTC/Attention hybrid architecture and Transformer architecture, using features extracted from self-supervised pre-training speech representation models such as Wav2Vec 2.0 and WavLM to replace conventional speech features like MFCC and Fbank, and conduct experiments on the PSC-Reading dataset. Experimental results show that, compared with the baseline model CNN-RNN-CTC, our WavLM-based model obtains 20.5% realtive improvement on the F1 score metric.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3212417