The KAS corpus of Slovenian academic writing
The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science p...
Gespeichert in:
Veröffentlicht in: | Language Resources and Evaluation 2021-06, Vol.55 (2), p.551-583 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for mono- and bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction. |
---|---|
ISSN: | 1574-020X 1572-8412 1574-0218 |
DOI: | 10.1007/s10579-020-09506-4 |