YODAS: Youtube-Oriented Dataset for Audio and Speech
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manua...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and
Speech), a large-scale, multilingual dataset comprising currently over 500k
hours of speech data in more than 100 languages, sourced from both labeled and
unlabeled YouTube speech datasets. The labeled subsets, including manual or
automatic subtitles, facilitate supervised model training. Conversely, the
unlabeled subsets are apt for self-supervised learning applications. YODAS is
distinctive as the first publicly available dataset of its scale, and it is
distributed under a Creative Commons license. We introduce the collection
methodology utilized for YODAS, which contributes to the large-scale speech
dataset construction. Subsequently, we provide a comprehensive analysis of
speech, text contained within the dataset. Finally, we describe the speech
recognition baselines over the top-15 languages. |
---|---|
DOI: | 10.48550/arxiv.2406.00899 |