SVMs Modeling for Highly Imbalanced Classification

Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on cybernetics 2009-02, Vol.39 (1), p.281-288
Hauptverfasser:	Yuchun Tang, Yan-Qing Zhang, Chawla, N.V., Krasser, S.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Area Under Curve Artificial Intelligence Bioinformatics Classification Classification algorithms Cleaning Cluster Analysis Computational intelligence Computer science Computer Simulation cost-sensitive learning Cybernetics Data Interpretation, Statistical Data mining granular computing highly imbalanced classification Learning systems Machine learning Mathematical analysis Mathematical models oversampling Pattern Recognition, Automated - methods ROC Curve Sampling methods Strategy Studies Support vector machine classification Support vector machines support vector machines (SVMs) undersampling
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different ldquorebalancerdquo heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G -mean, area under the receiver operating characteristic curve, F -measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
ISSN:	1083-4419 2168-2267 1941-0492 2168-2275
DOI:	10.1109/TSMCB.2008.2002909