Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Developing an automatic speaker verification (ASV) system for children’s speech presents significant challenges. One major obstacle is the scarcity of domain-specific data. This issue is exacerbated when dealing with short speech utterances, a relatively unexplored area in children’s ASV. Voice biom...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Circuits, systems, and signal processing systems, and signal processing, 2024-05, Vol.43 (5), p.3139-3159
Hauptverfasser: Aziz, Shahid, Shahnawazuddin, S.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Developing an automatic speaker verification (ASV) system for children’s speech presents significant challenges. One major obstacle is the scarcity of domain-specific data. This issue is exacerbated when dealing with short speech utterances, a relatively unexplored area in children’s ASV. Voice biometric systems struggle during enrollment and verification phase, when faced with inadequate speech data, both in volume as well as in duration. To address data scarcity, this paper explores various in-domain and out-of-domain data augmentation techniques. Out-of-domain data from adult speakers, which have distinct acoustic attributes from children, are modified using techniques like voice-conversion, prosody and formant modification to make them acoustically similar to children’s speech. In-domain data augmentation involves perturbing the speed of children’s speech. This combined data augmentation approach not only increases training data volume but also captures missing target attributes, resulting in a significant 43.91% reduction in equal error rate (EER) compared to the baseline system. Additionally, the paper addresses the challenge of preserving higher-frequency components in children’s speech. It achieves this by concatenating conventional Mel-frequency cepstral coefficients (MFCC) with Inverse-Mel-frequency cepstral coefficient (IMFCC) features at the frame level. The low canonical correlation between MFCC and IMFCC feature vectors motivates this fusion. The feature concatenation approach, when combined with proposed data augmentation, results in an appreciable reduction of 48.51% in the overall EER, demonstrating its effectiveness in improving the performance of children’s ASV system.
ISSN:0278-081X
1531-5878
DOI:10.1007/s00034-024-02598-1