Developing Software Signature Search Engines Using Paragraph Vector Model: A Triage Approach for Digital Forensics

Today, with the growth of information and communication technology, digital crimes have also spread. Advanced storage technologies and their low cost have led to a significant increase in their use. Therefore, the high volume of digital data to be analyzed is a challenge facing digital forensic inve...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2021, Vol.9, p.55814-55832
Hauptverfasser: Soltani, Somayeh, Seno, Seyed Amin Hosseini, Budiarto, Rahmat
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Today, with the growth of information and communication technology, digital crimes have also spread. Advanced storage technologies and their low cost have led to a significant increase in their use. Therefore, the high volume of digital data to be analyzed is a challenge facing digital forensic investigators. Digital forensic triage solutions aim to alleviate the forensic backlog. A promising triage technique is to quickly find the software packages run on the target system to narrow down the search space. In this paper, we propose a software signature search engine (S3E) to identify software on the system under investigation. The document collection of this search engine consists of software signatures, and the query is the features extracted from the system's hard disk. We propose a forensic differential analysis model to build software signatures. Besides, we use the paragraph vector model to construct the corresponding vectors of each software signature and find similarities between the query vector and the signature vectors. Different design parameters are involved in making software signature search engines, and distinct values of these parameters lead to different models. We have measured the performance of these S3E models against several controlled systems and some pseudo-real systems. The experimental results on both datasets show that some S3E models achieve perfect recall, and many of them have a recall of more than 90%. Besides, we find that the recall rate of the S3E models in both datasets is higher than the averaged word2vec model and the TF-IDF model.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2021.3071795