ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children's speech remains challenging due to differences in pronunciation, t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Automatic speech recognition (ASR) systems have advanced significantly with
models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec
2.0 and HuBERT. However, developing robust ASR models for young children's
speech remains challenging due to differences in pronunciation, tone, and pace
compared to adult speech. In this paper, we introduce a new Mandarin speech
dataset focused on children aged 3 to 5, addressing the scarcity of resources
in this area. The dataset comprises 41.25 hours of speech with carefully
crafted manual transcriptions, collected from 397 speakers across various
provinces in China, with balanced gender representation. We provide a
comprehensive analysis of speaker demographics, speech duration distribution
and geographic coverage. Additionally, we evaluate ASR performance on models
trained from scratch, such as Conformer, as well as fine-tuned pre-trained
models like HuBERT and Whisper, where fine-tuning demonstrates significant
performance improvements. Furthermore, we assess speaker verification (SV) on
our dataset, showing that, despite the challenges posed by the unique vocal
characteristics of young children, the dataset effectively supports both ASR
and SV tasks. This dataset is a valuable contribution to Mandarin child speech
research and holds potential for applications in educational technology and
child-computer interaction. It will be open-source and freely available for all
academic purposes. |
---|---|
DOI: | 10.48550/arxiv.2409.18584 |