The KAS corpus of Slovenian academic writing

The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Language Resources and Evaluation 2021-06, Vol.55 (2), p.551-583
Hauptverfasser: Erjavec, Tomaž, Fišer, Darja, Ljubešić, Nikola
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for mono- and bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction.
ISSN:1574-020X
1572-8412
1574-0218
DOI:10.1007/s10579-020-09506-4