Optimizing text classification through efficient feature selection based on quality metric

Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of intelligent information systems 2015-12, Vol.45 (3), p.379-396
Hauptverfasser:	Lamirel, Jean-Charles, Cuxac, Pascal, Chivukula, Aneesh Sreevallabh, Hajlaoui, Kafil
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Analysis Artificial Intelligence Classification Clusters Computer Science Data Structures and Information Theory Datasets Feature extraction Feature selection Gain Information Storage and Retrieval Information systems Intelligent systems IT in Business Mathematical models Maximization Methods Natural Language Processing (NLP) Neural and Evolutionary Computing Similarity Studies Support vector machines Text categorization Texts
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. The method is experienced on different types of textual datasets. The paper illustrates that the proposed method provides a very significant performance increase, as compared to state of the art methods, in all the studied cases even when a single bag of words model is exploited for data description. Interestingly, the most significant performance gain is obtained in the case of the classification of highly unbalanced, highly multidimensional and noisy data, with a high degree of similarity between the classes.
ISSN:	0925-9902 1573-7675
DOI:	10.1007/s10844-014-0317-4