GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection

With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of Nanjing University of Aeronautics & Astronautics 2019-12, Vol.36 (6), p.1026-1038
Hauptverfasser: Li, Kexin, Li, Jing, Liu, Shuji, Li, Zhao, Bo, Jue, Liu, Biqi
Format: Artikel
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years. In the process of constructing the isolation tree by the isolated forest algorithm, as the isolation tree is continuously generated, the difference of isolation trees will gradually decrease or even no difference, which will result in the waste of memory and reduced efficiency of outlier detection. And in the constructed isolation trees, some isolation trees cannot detect outlier. In this paper, an improved iForest-based method GA-iForest is proposed. This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees, thereby reducing some duplicate, similar and poor detection isolation trees and improving the accuracy and stability of outlier detection. In the experiment, Ubuntu system and Spark platform are used to build the experiment environment. The outlier datasets provided by ODDS are used as test. According to indicators such as the accuracy, recall rate, ROC curves, AUC and execution time, the performance of the proposed method is evaluated. Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection, but also reduce the number of isolation trees by 20%-40% compared with the original iForest method.
ISSN:1005-1120
DOI:10.16356/j.1005-1120.2019.06.015