Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta ( Donate Speech ) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Language resources and evaluation 2023-09, Vol.57 (3), p.1295-1327
Hauptverfasser:	Moisio, Anssi, Porjazovski, Dejan, Rouhe, Aku, Getman, Yaroslav, Virkkunen, Anja, AlGhezi, Ragheb, Lennes, Mietta, Grósz, Tamás, Lindén, Krister, Kurimo, Mikko
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Benchmarks Colloquial language Computational Linguistics Computer Science Corpus linguistics Finnish language Language and Literature Linguistics Metadata Original Paper Regional dialects Social Sciences Source code Speech Speech recognition Spoken language Spontaneous speech Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta ( Donate Speech ) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services. In this paper, we present the collection process and the collected corpus, and showcase its versatility through multiple use cases. The evaluated use cases include: automatic speech recognition of spontaneous speech, detection of age, gender, dialect and topic and metadata analysis. We provide benchmarks for the use cases, as well downloadable, trained baseline systems with open-source code for reproducibility. One further use case is to verify the metadata and transcripts given in this corpus itself, and to suggest artificial metadata and transcripts for the part of the corpus where it is missing.
ISSN:	1574-020X 1574-0218
DOI:	10.1007/s10579-022-09606-3