A variant of n-gram based language-independent text categorization
A technique for automated categorization of text documents, based on byte-level n-gram profiles and a new dissimilarity measure between profiles is presented. K nearest neighbors classifier is used. The technique is language independent. It has been applied to four document collections in English, C...
Gespeichert in:
Veröffentlicht in: | Intelligent data analysis 2014-01, Vol.18 (4), p.677-695 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A technique for automated categorization of text documents, based on byte-level n-gram profiles and a new dissimilarity measure between profiles is presented. K nearest neighbors classifier is used. The technique is language independent. It has been applied to four document collections in English, Chinese and Serbian: Reuters-21578 newswire articles, 20-Newsgroups, Tancorp and Ebart. The evaluation was done by using the micro- and macro-averaged F_1 function. The results obtained confirm that the presented technique, although very simple, in the case of Tancorp and 20-Newsgroups corpora achieves better results than other n-gram based techniques. As compared to other state-of-the-art methods, it performs better than "bag-of-words" K nearest neighbors classifier and in the case of 20-Newsgroups corpus it works even better than "bag-of-words" Support vector machines classifier. It can be successfully used in a variety of related problems. |
---|---|
ISSN: | 1088-467X 1571-4128 |
DOI: | 10.3233/IDA-140663 |