Named-Entity Recognition for Modern Tibetan Newspapers: Tagset, Guidelines and Training Data

This dataset, tagset and guidelines were the output of a six-month incubator project on the feasibility of developing Named-Entity Recognition (NER) for modern Tibetan, primarily for use with contemporary Tibetan-language newspapers and media published inside the PRC. The project was carried out by...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Barnett, Robert, Hill, Nathan, Diemberger, Hildegard, Samdrup, Tsering
Format: Dataset
Sprache:tib
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This dataset, tagset and guidelines were the output of a six-month incubator project on the feasibility of developing Named-Entity Recognition (NER) for modern Tibetan, primarily for use with contemporary Tibetan-language newspapers and media published inside the PRC. The project was carried out by the Mongolian and Inner Asian Studies Unit at Cambridge University’s Department of Social Anthropology. It was funded by an incubator grant from Cambridge Language Sciences. The project title was “Named-Entity Recognition in Tibetan and Mongolian Newspapers.” The Project PI was Dr Hildegard Diemberger (Cambridge), the Coordinator and Lead Author was Dr Robert Barnett (SOAS), and Senior Advisers were Dr Nathan Hill (SOAS), Dr Marieke Meelen (Cambridge), and Dr Thomas White (Cambridge). Although some forms of NER and other NLP procedures have been developed within China for modern Tibetan (see Liu, Nuo et al, 2011), the data underlying those initiatives have not been made publicly available and their findings cannot be tested or reproduced. Significant work on developing NLP for Tibetan has been carried out outside China, but has focused largely on classical Tibetan and religious texts (see Hill & Garrett, Edward, 2017). The Cambridge incubator project therefore produced a tagset, guidelines and training data for developing NER for modern Tibetan, with a focus on historical and political analysis of contemporary newspapers, media and other public documents in Tibetan. We compiled 3.11m syllables of data in Tibetan extracted from articles downloaded from Chinese-language news aggregator sites within China, primarily tibet.cpc.people.com.cn and tibet.people.com.cn. From this data, we selected texts containing 280,000 syllables in Tibetan, grouped in 26,000 utterances/sentences (available on request). Using Lighttag, an online annotation site, we developed a tagset for NER consisting of 17 tags (and one for wrong segmentation if using segmented data). We annotated approximately 186,000 syllables, leading to 9,884 annotations. Of these, after discounting flawed data, we produced training data containing c.6,700 annotations. We carried out the secondary, manual review offline (for our method of converting Lighttag data for offline review, see the attached report “Using Spreadsheets to Review Annotations Offline.pdf”), and found an error rate of 3.6%. The final total of reviewed annotations was 6,624. The dataset, tagset, guidelines and reports were developed and docume
DOI:10.5281/zenodo.4536515