Complete Random Forest Based Class Noise Filtering Learning for Improving the Generalizability of Classifiers

The existing noise detection methods required the classifiers or distance measurements or data overall distribution, and `curse of dimensionality' and other restrictions made them insufficiently effective in complex data, e.g., different attribute weights, high-dimensionality, containing featur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2019-11, Vol.31 (11), p.2063-2078
Hauptverfasser: Xia, Shuyin, Wang, Guoyin, Chen, Zizhong, Duan, Yanlin, liu, Qun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The existing noise detection methods required the classifiers or distance measurements or data overall distribution, and `curse of dimensionality' and other restrictions made them insufficiently effective in complex data, e.g., different attribute weights, high-dimensionality, containing feature noise, nonlinearity, etc. This is also the main reason that the existing noise filtering methods were not widely applied and formed an effective learning framework. To address this problem, we propose here a complete and efficient random forest method (CRF) specifically for the class noise detection by simulating the grid generation and expansion. The CRF is not based on distance measures or overall distribution or classifiers; besides, the voting mechanism makes it able to effectively process datasets containing feature noise. Furthermore, we introduce CRF based class noise filtering learning framework (CRF-NFL) and derive its mathematical model. The framework is then applied to many widely used classifiers including some state-of-the-art algorithms, e.g., k-means tree, GBDT, and XGBoost. Moreover, its parallelized is designed for large-scale data. The CRF-NFL show much better generalizability than the conventional classifiers and the relative density-based method, which is the most effective noise filtering method as far as we know. All research has formed an open source library, called CRF-NFL: http://www.cquptshuyinxia.com/CRF-NFL.html.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2018.2873791