A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Chem-Bio Informatics Journal 2013, Vol.13, pp.19-29
Hauptverfasser: Dang, Xuan Tho, Hirose, Osamu, Bui, Duong Hung, Saethang, Thammakorn, Tran, Vu Anh, Nguyen, Lan Anh T., Le, Tu Kien T., Kubo, Mamoru, Yamada, Yoichi, Satou, Kenji
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.
ISSN:1347-6297
1347-0442
DOI:10.1273/cbij.13.19