Developing children’s speech recognition system for low resource Punjabi language
Building an automatic speech recognition (ASR) system for children is a very challenging problem especially when the domain-specific data for training is absent or insufficient. In this paper, we present our efforts towards developing a children’s ASR system in Punjabi which a low-resourced language...
Gespeichert in:
Veröffentlicht in: | Applied acoustics 2021-07, Vol.178, p.108002, Article 108002 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Building an automatic speech recognition (ASR) system for children is a very challenging problem especially when the domain-specific data for training is absent or insufficient. In this paper, we present our efforts towards developing a children’s ASR system in Punjabi which a low-resourced language. To begin with, since speech data from children in the case of the Punjabi language is unavailable, we first created a small speech corpus consisting of data from both adult and child speakers. Next, an ASR system was developed on a mix of adults’ and children’s speech and tested on children’s speech. Due to the differences in acoustic attributes such as formant frequency, pitch, and speaking-rate differences between adults’ and children’s speech, the developed ASR system is observed to result in a highly degraded recognition rate. To reduce the acoustic mismatch, we have explored vocal-tract length normalization (VTLN), explicit pitch, and duration modification. All the three explored approaches are observed to be highly effective. To deal with training data scarcity, the role of prosody-modification-based out-of-domain data augmentation is studied. For that purpose, the pitch and speaking-rate of adults’ speech training set are explicitly changed to render it similar to children’s speech. The original and prosody modified data are then pooled together before learning the acoustic models. Significantly reduced error rates are observed by prosody-modification-based out-of-domain data augmentation. In addition to these, we have also studied the effect of varying the number of senones, the number of hidden nodes, and hidden layers as well as early stopping resulting in 32.1% of Relative Improvement (RI) in comparison to the baseline system with varied senones. |
---|---|
ISSN: | 0003-682X 1872-910X |
DOI: | 10.1016/j.apacoust.2021.108002 |