Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest

Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of Big Data 2021-06, Vol.8 (1), p.1-22, Article 84
Hauptverfasser:	Prasetiyowati, Maria Irmina, Maulidevi, Nur Ulfa, Surendro, Kridanto
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Big Data Communications Engineering Computational Science and Engineering Computer Science Computer Science, Theory & Methods Data Mining and Knowledge Discovery Database Management Datasets Feature selection Information Storage and Retrieval Mathematical Applications in Computer Science Networks Random forest Science & Technology Standard deviation Technology Threshold Time
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared.
ISSN:	2196-1115 2196-1115
DOI:	10.1186/s40537-021-00472-4