Supervised term weighting centroid-based classifiers for text categorization

In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invar...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge and information systems 2013-04, Vol.35 (1), p.61-85
Hauptverfasser:	Nguyen, Tam T., Chang, Kuiyu, Hui, Siu Cheung
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Centroids Chlorofluorocarbons Classification Classifiers Computer Science Computer science control theory systems Data Mining and Knowledge Discovery Data processing. List processing. Character string processing Database Management Datasets Divergence Documents Exact sciences and technology Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Memory organisation. Data processing Regular Paper Software Speech and sound recognition and synthesis. Linguistics Studies Support vector machines Text categorization Texts Vector space Visualization Weighting
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-012-0559-9