Enhanced Detection of Text and Image Spam Using Cost-Sensitive Deep Learning

In the realm of unwanted digital content, image spam presents a distinct challenge, characterized by its evasion of traditional text-based filters. This study introduces an advanced approach for the classification of image spam through the deployment of hybrid, cost-sensitive machine learning techni...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Traitement du signal 2024-06, Vol.41 (3), p.1283-1292
Hauptverfasser: Mallampati, Deepika, Hegde, Nagaratna P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In the realm of unwanted digital content, image spam presents a distinct challenge, characterized by its evasion of traditional text-based filters. This study introduces an advanced approach for the classification of image spam through the deployment of hybrid, cost-sensitive machine learning techniques. Images laden with spam (unwanted content) and benign images (ham) are distinguished by employing a combination of textual and visual data, which enriches the interpretative depth of the analysis. By integrating multi-modal features, resilience against fluctuations in input data and noise is significantly improved. The synthesis of textual context and visual elements enables robust generalization across similar instances while compensating for variations in verbal descriptions, thus maintaining consistent model performance in diverse conditions. A novel methodology is presented wherein cost-sensitive (CS) learning is applied to optimize both feature representations and classifier parameters concurrently, using a deep convolutional neural network (CNN) integrated with a support vector machine (SVM) model. This cost-effective strategy is designed to address class imbalances and refine intermediate feature representations, facilitating rapid adaptation to class-dependent costs. The proposed CSCNN-SVM model is evaluated using the ISH dataset, demonstrating superior performance with an accuracy rate of 98.05%, an AUC of 99.01%, and a computational testing duration of one to two seconds. Furthermore, a variety of machine learning techniques including Logistic Regression, Random Forest, Decision Trees, K Nearest Neighbors, Gaussian Naive Bayes, AdaBoost, and Linear SVM are employed. Utilizing the Spam Hunter Dataset, which consists of real spam emails, these algorithms have proven effective in identifying both text and image spam, achieving comparable levels of accuracy. This innovative, hybrid model not only enhances the detection capabilities of spam classifiers but also contributes significantly to the broader field of digital content management.
ISSN:0765-0019
1958-5608
DOI:10.18280/ts.410317