Relative discrimination criterion – A novel feature ranking method for text data

•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65%...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2015-05, Vol.42 (7), p.3670-3681
Hauptverfasser:	Rehman, Abdur, Javed, Kashif, Babri, Haroon A., Saeed, Mehreen
Format:	Artikel
Sprache:	eng
Schlagworte:	Classifiers Criteria Discrimination Document frequency Expert systems False positive rate Feature selection Ranking Selectors Support vector machines Term count Text classification Texts True positive rate
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65% of the classification cases. High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2014.12.013