Feature subset selection for Arabic document categorization using BPSO-KNN

Document categorization is an important topic that is central to many applications that demand reasoning about and organisation of text documents, web pages, and so forth. Document classification is commonly achieved by choosing appropriate features (terms) and building a term-frequency inerse-docum...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chantar, H. K., Corne, D. W.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Accuracy Arabic language processing feature selection Particle swarm optimization Support vector machines Text categorization text mining Training Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Document categorization is an important topic that is central to many applications that demand reasoning about and organisation of text documents, web pages, and so forth. Document classification is commonly achieved by choosing appropriate features (terms) and building a term-frequency inerse-document frequency (TFIDF) feature vector. In this process, feature selection is a key factor in the accuracy and effectiveness of resulting classifications. For a given task, the right choice of features means accurate classification with suitable levels of computational efficiency. Meanwhile, most document classification work is based on English language documents. In this paper we make three main contributions: (i) we demonstrate successful document classification in the context of Arabic documents (although previous work has demonstrated text classification in Arabic, the datasets used, and the experimental setup, have not been revealed); (ii) we offer our datasets to enable other researchers to compare directly with our results; (iii) we demonstrate a combination of Binary PSO and K nearest neighbour that performs well in selecting good sets of features for this task.
DOI:	10.1109/NaBIC.2011.6089647