A distributed attribute reduction based on neighborhood evidential conflict with Apache Spark

Attribute reduction is widely employed to improve the efficiency and accuracy of data analysis by eliminating redundant and irrelevant attributes from datasets. However, with the emergence of growing big data, the sequential execution of such algorithms becomes time-consuming and requires distribute...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences 2024-05, Vol.668, p.120521, Article 120521
Hauptverfasser: Chen, Yuepeng, Ding, Weiping, Ju, Hengrong, Huang, Jiashuang, Yin, Tao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Attribute reduction is widely employed to improve the efficiency and accuracy of data analysis by eliminating redundant and irrelevant attributes from datasets. However, with the emergence of growing big data, the sequential execution of such algorithms becomes time-consuming and requires distributed computing capabilities to achieve scalable parallelization. This study proposes a novel attribute reduction algorithm for neighborhood decision systems. We introduce two novel metrics—the neighborhood evidential conflict degree (NECD) and neighborhood evidential conflict rate (NECR)—to compute heterogeneity between samples in the neighborhood and assess the significance of attributes in the feature space, respectively. These metrics assess the quality and selection of attribute subsets in attribute reduction, improving classification accuracy and computational efficiency. We also develop a sequentially forward selection attribute reduction method to select a feature subset through the defined NECR. Finally, we develop a distributed attribute reduction algorithm implemented in Apache Spark. Our approach involves a two-phase Map-Reduce process for K-Nearest Neighbors search, evidence combination, and NECR computation. NECR, as a measure of feature subset quality, enhances the feature subset's decision approximation capability of the data. Experimental results on small and large datasets demonstrate that the proposed algorithm outperforms benchmarking algorithms regarding classification accuracy and computational efficiency. •Neighborhood evidential conflict measures are presented to quantify attribute significance.•A sequentially forward selection attribute reduction is developed to select a feature subset.•The research proposes an Apache Spark-based distributed attribute reduction algorithm.
ISSN:0020-0255
DOI:10.1016/j.ins.2024.120521