Improvement in Automatic Speech Recognition of South Asian Accent Using Transfer Learning of DeepSpeech2

Automatic speech recognition (ASR) has ensured a convenient and fast mode of communication between humans and computers. It has become more accurate over the passage of time. However, in majority of ASR systems, the models have been trained using native English accents. While they serve best for nat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Mathematical problems in engineering 2022-10, Vol.2022, p.1-12
Hauptverfasser:	Hassan, Muhammad Ahmed, Rehmat, Asim, Ghani Khan, Muhammad Usman, Yousaf, Muhammad Haroon
Format:	Artikel
Sprache:	eng
Schlagworte:	Accentuation Accuracy Automatic speech recognition Automation Autonomous vehicles Computer mediated communication Datasets Deep learning English language Error analysis Errors Human-computer interaction Learning transfer Machine learning Neural networks R&D Research & development Smartphones Speaking Speech Speech recognition Voice communication Voice recognition Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Automatic speech recognition (ASR) has ensured a convenient and fast mode of communication between humans and computers. It has become more accurate over the passage of time. However, in majority of ASR systems, the models have been trained using native English accents. While they serve best for native English speakers, their accuracy drops drastically for non-native English accents. Our proposed model covers this limitation for non-native English accents. We fine-tuned the DeepSpeech2 model, pretrained on the native English accent dataset by LibriSpeech. We retrain the model on a subset of the common voice dataset having only South Asian accents using the proposed novel loss function. We experimented with three different layer configurations of model to learn the best features for South Asian accents. Three evaluation parameters, word error rate (WER), match error rate (MER), and word information loss (WIL) were used. The results show that DeepSpeech2 can perform significantly well for South Asian accents if the weights of initial convolutional layers are retained while updating weights of deeper layers in the model (i.e., RNN and fully connected layers). Our model gave WER of 18.08%, which is the minimum error achieved for non-native English accents in comparison with the original model.
ISSN:	1024-123X 1563-5147
DOI:	10.1155/2022/6825555