Optimizing Data Usage for Low-Resource Speech Recognition

Automatic speech recognition has made huge progress recently. However, the current modeling strategy still suffers a large performance degradation when facing the low-resource languages with limited training data. In this paper, we propose a series of methods to optimize the data usage for low-resou...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022, Vol.30, p.394-403
Hauptverfasser: Qian, Yanmin, Zhou, Zhikai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Automatic speech recognition has made huge progress recently. However, the current modeling strategy still suffers a large performance degradation when facing the low-resource languages with limited training data. In this paper, we propose a series of methods to optimize the data usage for low-resource speech recognition. Multilingual speech recognition helps a lot in low-resource scenarios. The correlation and similarity between languages are further exploited for multilingual pretraining in our work. We utilize the posterior of the target language extracted from a language classifier to perform data weighing on training samples, which assists the model in being more biased towards the target language during pretraining. Furthermore, dynamic curriculum learning for data allocation and length perturbation for data augmentation are also designed. All these three methods form the new strategy on optimized data usage for low-resource languages. We evaluate the proposed method using rich resource languages for pretraining (PT) and finetuning (FT) the model on the target language with limited data. Experimental results show that the proposed data usage method obtains a 15 to 25% relative word error rate reduction for different target languages compared with the commonly adopted multilingual PT+FT method on CommonVoice dataset. The same improvement and conclusion are also observed on Babel dataset with conversational telephone speech, and \sim40% relative character error rate reduction can be obtained for the target low-resource language.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2022.3140552