Unsupervised Topic Labeling of Text Based on Wikipedia Categorization

Defining text topicality is often an expensive problem that requires significant resources for text labeling. Though many packages already exist that provide dictionaries of labeled text, synonyms, and Part-of-Speach tagging, the problem is ongoing as language develops and new meanings of words and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of systemics, cybernetics and informatics cybernetics and informatics, 2019-08, Vol.17 (4), p.1-5
1. Verfasser: Tetyana Loskutova
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Defining text topicality is often an expensive problem that requires significant resources for text labeling. Though many packages already exist that provide dictionaries of labeled text, synonyms, and Part-of-Speach tagging, the problem is ongoing as language develops and new meanings of words and phrases emerge. This paper proposes a cheap in human labor solution to topic labeling of any text in the majority of languages. The methodology uses links to the naturally emerging corpus of labeled text – the Wikipedia. Wikipedia categories are processed to extract a weighted set of topic labels for the analyzed text. The approach is evaluated by processing categorized texts and comparing the similarity of the top ranks of topic labels to the text category. The topic labels extracted using this methodology can be used for comparing similarity of texts, for the assessment of the completeness of topic coverage in automated marking of essays, and for coding in qualitative text analysis. The paper contributes to the field of NLP by offering a cheap and organically developing method of topical text labeling. The paper contributes to the work of qualitative analysts by offering a methodology for the analysis of interview transcripts and other unstructured text.
ISSN:1690-4524