Data resampling method based on clustering oversampling and instance hardness threshold

The invention provides a data resampling method based on clustering oversampling and an instance hardness threshold. The method comprises the following steps: firstly, performing clustering processingon a data set by utilizing a Kmeans method, and performing filtering processing and sampling weight...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	MA HUAIYU, ZHANG XIAOGANG, YIN MING, ZHU KUIYU, GAO CUNZHI
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	CALCULATING COMPUTING COUNTING HANDLING RECORD CARRIERS PHYSICS PRESENTATION OF DATA RECOGNITION OF DATA RECORD CARRIERS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention provides a data resampling method based on clustering oversampling and an instance hardness threshold. The method comprises the following steps: firstly, performing clustering processingon a data set by utilizing a Kmeans method, and performing filtering processing and sampling weight distribution on clustering; then, adopting an SMOTE algorithm to carry out oversampling on the dataset to generate new data, so that the number of minority class samples in the data set is equal to that of majority class samples, and the data set becomes class balance; and finally, cleaning the data by adopting an instance hardness threshold algorithm to obtain a final balanced data set with fewer noisy points. According to the method, the class imbalance data set can be processed into the balance data set, and the prediction performance of the classifier for minority class samples is improved. 本发明提供了一种基于聚类过采样与实例硬度阈值的数据重采样方法。首先，利用K-means方法对数据集进行聚类处理，并对聚类进行过滤处理和采样权重分配；接着，采用SMOTE算法对数据集进行过采样，生成新的数据使数据集中少数类与多数类样本数量相等，数