Political social media bot detection: Unveiling cutting-edge feature selection and engineering strategies in machine learning model development

•The article offers a comprehensive analysis of existing political SMB detection research.•Propose a classification of political SMB detection research.•The proposed approach uses a diverse set of features and systematic feature selection techniques.•Balancing the dataset using Smote-ENN to address...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Scientific African 2024-09, Vol.25, p.e02269, Article e02269
Hauptverfasser: Ellaky, Zineb, Benabbou, Faouzia
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•The article offers a comprehensive analysis of existing political SMB detection research.•Propose a classification of political SMB detection research.•The proposed approach uses a diverse set of features and systematic feature selection techniques.•Balancing the dataset using Smote-ENN to address class imbalance and overfitting.•The results demonstrate the effectiveness of the model compared to the state-of-the-art techniques. Over time, social media bots (SMBs), specifically political SMBs, have played a crucial role in influencing and spreading misinformation, manipulating public opinion, and harassing and intimidating users of online social networks (OSNs). This article aims to study previous works on the detection and analysis of political SMB activities and address critical challenges that significantly impact the effectiveness of SMB detection models. These challenges include feature engineering, feature selection (FS), and model implementation. Over 33 features were extracted from the Twibot-20 dataset, including content, user information, network, behavior, and temporal features. Various FS techniques are explored and compared to select the optimal features, comprising basic, filter, wrapper, embedded, and hybrid. The optimal features are then employed to train multiple machine-learning algorithms. To balance the dataset, the synthetic minority oversampling technique coupled with edited nearest neighbors (Smote-ENN) is used. The results showed an improvement in model performance, from an initial Area Under the Curve (AUC) of 90.40 % and accuracy of 81.60 % using the original set to a score of 99.50 % for the test set and 100 % for the training set in all used metrics. Decision Trees, Random Forest, Gradient Boosting, Adaboost, XGB, and Extra Trees emerge as the most effective for detecting political SMBs.
ISSN:2468-2276
2468-2276
DOI:10.1016/j.sciaf.2024.e02269