Oversampling imbalanced data in the string space

•Oversampling in the string space for addressing imbalanced classification.•Generating new strings between pairs of instances using the Edit distance.•Experimentation with contour representations of handwritten digits and characters.•Statistical performance improvement of the classifier with respect...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition letters 2018-02, Vol.103, p.32-38
Hauptverfasser:	Castellanos, Francisco J., Valero-Mas, Jose J., Calvo-Zaragoza, Jorge, Rico-Juan, Juan R.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Class imbalance problem Coding Data analysis Datasets Oversampling SMOTE String space Strings
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Oversampling in the string space for addressing imbalanced classification.•Generating new strings between pairs of instances using the Edit distance.•Experimentation with contour representations of handwritten digits and characters.•Statistical performance improvement of the classifier with respect to imbalanced case. Imbalanced data is a typical problem in the supervised classification field, which occurs when the different classes are not equally represented. This fact typically results in the classifier biasing its performance towards the class representing the majority of the elements. Many methods have been proposed to alleviate this scenario, yet all of them assume that data is represented as feature vectors. In this paper we propose a strategy to balance a dataset whose samples are encoded as strings. Our approach is based on adapting the well-known Synthetic Minority Over-sampling Technique (SMOTE) algorithm to the string space. More precisely, data generation is achieved with an iterative approach to create artificial strings within the segment between two given samples of the training set. Results with several datasets and imbalance ratios show that the proposed strategy properly deals with the problem in all cases considered.
ISSN:	0167-8655 1872-7344
DOI:	10.1016/j.patrec.2018.01.003