A Dynamic Analysis Data Preprocessing Technique for Malicious Code Detection with TF-IDF and Sliding Windows

When using dynamic analysis data to detect malware, time-series data such as API call sequences are used to determine malicious activity through deep learning models such as recurrent neural networks (RNN). However, in API call sequences, APIs are called differently when different programs are execu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics (Basel) 2024-03, Vol.13 (5), p.963
Hauptverfasser: Kim, Mihui, Kim, Haesoo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:When using dynamic analysis data to detect malware, time-series data such as API call sequences are used to determine malicious activity through deep learning models such as recurrent neural networks (RNN). However, in API call sequences, APIs are called differently when different programs are executed. To use these data as input for deep learning, preprocessing is performed to unify the size of the data by adding dummy zeros to the data using the zero-padding technique. However, when the standard deviation of the size is significant, the amount of dummy data added increases, making it difficult for the deep learning model to reflect the characteristics of the data. Therefore, this paper proposes a preprocessing technique using term frequency–inverse document frequency (TF-IDF) and a sliding window algorithm. We trained the long short-term memory (LSTM) model on the data with the proposed preprocessing, and the results, with an accuracy of 95.94%, a recall of 97.32%, a precision of 95.71%, and an F1-score of 96.5%, showed that the proposed preprocessing technique is effective.
ISSN:2079-9292
2079-9292
DOI:10.3390/electronics13050963