Unsupervised Topic Labeling of Text Based on Wikipedia Categorization
Defining text topicality is often an expensive problem that requires significant resources for text labeling. Though many packages already exist that provide dictionaries of labeled text, synonyms, and Part-of-Speach tagging, the problem is ongoing as language develops and new meanings of words and...
Gespeichert in:
Veröffentlicht in: | Journal of systemics, cybernetics and informatics cybernetics and informatics, 2019-08, Vol.17 (4), p.1-5 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Defining text topicality is often an expensive problem that requires significant resources for text labeling. Though many packages already exist that provide dictionaries of labeled text, synonyms, and Part-of-Speach tagging, the problem is ongoing as language develops and new meanings of words and phrases emerge. This paper proposes a cheap in human labor solution to topic labeling of any text in the majority of languages. The methodology uses links to the naturally emerging corpus of labeled text – the Wikipedia. Wikipedia categories are processed to extract a weighted set of topic labels for the analyzed text. The approach is evaluated by processing categorized texts and comparing the similarity of the top ranks of topic labels to the text category. The topic labels extracted using this methodology can be used for comparing similarity of texts, for the assessment of the completeness of topic coverage in automated marking of essays, and for coding in qualitative text analysis. The paper contributes to the field of NLP by offering a cheap and organically developing method of topical text labeling. The paper contributes to the work of qualitative analysts by offering a methodology for the analysis of interview transcripts and other unstructured text. |
---|---|
ISSN: | 1690-4524 |