A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include samp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	KSII transactions on Internet and information systems 2018-08, Vol.12 (8), p.3725-3748
Hauptverfasser:	Pouramini, Jafar, Minaei-Bidgoli, Behrouze, Esmaeili, Mahdi
Format:	Artikel
Sprache:	kor
Schlagworte:	Feature selection High dimensionality Imbalanced class Text classification
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.
ISSN:	1976-7277 1976-7277