Big data analytics approaches for treatment of imbalance and missing values problems on high dimensionality dataset

The telecommunications industry faced challenges with their datasets, primarily due to their high dimensionality and other issues such as imbalanced classes and missing values. These deficiencies led to inaccurate predictions and a decline in performance when the datasets were not handled properly....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Muhammed Nor, Muhammed Haziq, Abu Bakar, Mohd Aftar, Ariff, Noratiqah Mohd, Hassan, Hasmirah, Ahmad Tajudin, Siti Amira Nadia
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The telecommunications industry faced challenges with their datasets, primarily due to their high dimensionality and other issues such as imbalanced classes and missing values. These deficiencies led to inaccurate predictions and a decline in performance when the datasets were not handled properly. Due to the significant disparity in size between the churned customer class and the active customer class, the accuracy paradox arose. Consequently, despite the model’s accuracy metrics reaching 90%, this level of performance aligned with the actual distribution of classes. In addition, the presence of numerous features significantly prolonged the time required for learning and computation. This was due to the inclusion of redundant and unnecessary features, which created disarray and hindered the learning process. Therefore, the purpose of this study was to determine the effect of feature selection, imputation data, and techniques for dealing with imbalanced data on model performance. This study proposed the improvement of the techniques for developing voluntary churn models by combining techniques for dealing with imbalance and missing data with high dimensionality. Thus, when compared to other combinations of models, the combination of Decision Trees+Mode Imputation+SMOTE with Random Undersampling methods and Random Forest as the classifier builder produced the highest classification accuracy, AUC, and F1-Score. Additionally, this study suggested the use of Dask or PySpark for processing the large telecommunication dataset to allow for the faster and more effective execution of other machine learning algorithms in Python via parallel computing.
ISSN:0094-243X
1551-7616
DOI:10.1063/5.0228054