Differential area analysis for ransomware attack detection within mixed file datasets

The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers & security 2021-09, Vol.108, p.102377, Article 102377
Hauptverfasser: Davies, Simon R., Macfarlane, Richard, Buchanan, William J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page 102377
container_title Computers & security
container_volume 108
creator Davies, Simon R.
Macfarlane, Richard
Buchanan, William J.
description The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users’ files. This paper demonstrates a technique that can identify when these encrypted files are being generated and is independent of the strain of the ransomware. An enhanced mixed file ransomware data set of more than 130,000 files was developed based on the govdocs1(Garfinkel, 2020) corpus. This data set was enriched to contain examples of files that reflect the more modern Microsoft file formats, as well as examples of high entropy file formats such as compressed files and archives. The data set also contained eight different sets of files that were generated as the result of different real-world high profile ransomware attacks such as WannaCry, Ryuk, Phobos, Sodinokibi and NetWalker. Previous research Penrose et al. (2013); Zhao et al. (2011) has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data. The experiments demonstrate a high degree of confidence in the accuracy of the model achieving a success rate of more than 99.96% when examining only the first 192 bytes of a file, using a mixed data set of more than 80,000 files. This technique successfully addresses the problem of using file entropy to differentiate compressed and archived files from files encrypted by ransomware in a timely manne
doi_str_mv 10.1016/j.cose.2021.102377
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2561518248</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0167404821002017</els_id><sourcerecordid>2561518248</sourcerecordid><originalsourceid>FETCH-LOGICAL-c372t-bc8d2fb0591e2b2ca7d4905a6b8271f21a63a4ecbcf7c121713019fe4a274a4f3</originalsourceid><addsrcrecordid>eNp9kE9LAzEQxYMoWKtfwFPA89Ykm92k4EXqXyh4secwm51g1u2mJqm1394t9expYOa94b0fIdeczTjj9W03syHhTDDBx4UolTohE66VKGrB9CmZjCJVSCb1OblIqWOMq1rrCVk9eOcw4pA99BQiAoUB-n3yiboQaYQhhfVuPFDIGewnbTGjzT4MdOfzhx_o2v9gS53vkbaQIWFOl-TMQZ_w6m9Oyerp8X3xUizfnl8X98vClkrkorG6Fa5h1ZyjaIQF1co5q6ButFDcCQ51CRJtY52yXHDFS8bnDiUIJUG6ckpujn83MXxtMWXThW0c8ycjqppXXAupR5U4qmwMKUV0ZhP9GuLecGYO-ExnDvjMAZ854htNd0cTjvm_PUaTrMfBYuvj2N-0wf9n_wWCF3mV</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2561518248</pqid></control><display><type>article</type><title>Differential area analysis for ransomware attack detection within mixed file datasets</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Davies, Simon R. ; Macfarlane, Richard ; Buchanan, William J.</creator><creatorcontrib>Davies, Simon R. ; Macfarlane, Richard ; Buchanan, William J.</creatorcontrib><description>The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users’ files. This paper demonstrates a technique that can identify when these encrypted files are being generated and is independent of the strain of the ransomware. An enhanced mixed file ransomware data set of more than 130,000 files was developed based on the govdocs1(Garfinkel, 2020) corpus. This data set was enriched to contain examples of files that reflect the more modern Microsoft file formats, as well as examples of high entropy file formats such as compressed files and archives. The data set also contained eight different sets of files that were generated as the result of different real-world high profile ransomware attacks such as WannaCry, Ryuk, Phobos, Sodinokibi and NetWalker. Previous research Penrose et al. (2013); Zhao et al. (2011) has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data. The experiments demonstrate a high degree of confidence in the accuracy of the model achieving a success rate of more than 99.96% when examining only the first 192 bytes of a file, using a mixed data set of more than 80,000 files. This technique successfully addresses the problem of using file entropy to differentiate compressed and archived files from files encrypted by ransomware in a timely manner.</description><identifier>ISSN: 0167-4048</identifier><identifier>EISSN: 1872-6208</identifier><identifier>DOI: 10.1016/j.cose.2021.102377</identifier><language>eng</language><publisher>Amsterdam: Elsevier Ltd</publisher><subject>Archives &amp; records ; Data encryption ; Datasets ; Entropy ; Entropy (Information theory) ; Model accuracy ; Phobos ; Random numbers ; Ransomware ; Ransomware detection ; Test data sets</subject><ispartof>Computers &amp; security, 2021-09, Vol.108, p.102377, Article 102377</ispartof><rights>2021 Elsevier Ltd</rights><rights>Copyright Elsevier Sequoia S.A. Sep 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c372t-bc8d2fb0591e2b2ca7d4905a6b8271f21a63a4ecbcf7c121713019fe4a274a4f3</citedby><cites>FETCH-LOGICAL-c372t-bc8d2fb0591e2b2ca7d4905a6b8271f21a63a4ecbcf7c121713019fe4a274a4f3</cites><orcidid>0000-0001-9377-4539</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.cose.2021.102377$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Davies, Simon R.</creatorcontrib><creatorcontrib>Macfarlane, Richard</creatorcontrib><creatorcontrib>Buchanan, William J.</creatorcontrib><title>Differential area analysis for ransomware attack detection within mixed file datasets</title><title>Computers &amp; security</title><description>The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users’ files. This paper demonstrates a technique that can identify when these encrypted files are being generated and is independent of the strain of the ransomware. An enhanced mixed file ransomware data set of more than 130,000 files was developed based on the govdocs1(Garfinkel, 2020) corpus. This data set was enriched to contain examples of files that reflect the more modern Microsoft file formats, as well as examples of high entropy file formats such as compressed files and archives. The data set also contained eight different sets of files that were generated as the result of different real-world high profile ransomware attacks such as WannaCry, Ryuk, Phobos, Sodinokibi and NetWalker. Previous research Penrose et al. (2013); Zhao et al. (2011) has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data. The experiments demonstrate a high degree of confidence in the accuracy of the model achieving a success rate of more than 99.96% when examining only the first 192 bytes of a file, using a mixed data set of more than 80,000 files. This technique successfully addresses the problem of using file entropy to differentiate compressed and archived files from files encrypted by ransomware in a timely manner.</description><subject>Archives &amp; records</subject><subject>Data encryption</subject><subject>Datasets</subject><subject>Entropy</subject><subject>Entropy (Information theory)</subject><subject>Model accuracy</subject><subject>Phobos</subject><subject>Random numbers</subject><subject>Ransomware</subject><subject>Ransomware detection</subject><subject>Test data sets</subject><issn>0167-4048</issn><issn>1872-6208</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kE9LAzEQxYMoWKtfwFPA89Ykm92k4EXqXyh4secwm51g1u2mJqm1394t9expYOa94b0fIdeczTjj9W03syHhTDDBx4UolTohE66VKGrB9CmZjCJVSCb1OblIqWOMq1rrCVk9eOcw4pA99BQiAoUB-n3yiboQaYQhhfVuPFDIGewnbTGjzT4MdOfzhx_o2v9gS53vkbaQIWFOl-TMQZ_w6m9Oyerp8X3xUizfnl8X98vClkrkorG6Fa5h1ZyjaIQF1co5q6ButFDcCQ51CRJtY52yXHDFS8bnDiUIJUG6ckpujn83MXxtMWXThW0c8ycjqppXXAupR5U4qmwMKUV0ZhP9GuLecGYO-ExnDvjMAZ854htNd0cTjvm_PUaTrMfBYuvj2N-0wf9n_wWCF3mV</recordid><startdate>202109</startdate><enddate>202109</enddate><creator>Davies, Simon R.</creator><creator>Macfarlane, Richard</creator><creator>Buchanan, William J.</creator><general>Elsevier Ltd</general><general>Elsevier Sequoia S.A</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>K7.</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-9377-4539</orcidid></search><sort><creationdate>202109</creationdate><title>Differential area analysis for ransomware attack detection within mixed file datasets</title><author>Davies, Simon R. ; Macfarlane, Richard ; Buchanan, William J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c372t-bc8d2fb0591e2b2ca7d4905a6b8271f21a63a4ecbcf7c121713019fe4a274a4f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Archives &amp; records</topic><topic>Data encryption</topic><topic>Datasets</topic><topic>Entropy</topic><topic>Entropy (Information theory)</topic><topic>Model accuracy</topic><topic>Phobos</topic><topic>Random numbers</topic><topic>Ransomware</topic><topic>Ransomware detection</topic><topic>Test data sets</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Davies, Simon R.</creatorcontrib><creatorcontrib>Macfarlane, Richard</creatorcontrib><creatorcontrib>Buchanan, William J.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Criminal Justice (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Computers &amp; security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Davies, Simon R.</au><au>Macfarlane, Richard</au><au>Buchanan, William J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Differential area analysis for ransomware attack detection within mixed file datasets</atitle><jtitle>Computers &amp; security</jtitle><date>2021-09</date><risdate>2021</risdate><volume>108</volume><spage>102377</spage><pages>102377-</pages><artnum>102377</artnum><issn>0167-4048</issn><eissn>1872-6208</eissn><abstract>The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users’ files. This paper demonstrates a technique that can identify when these encrypted files are being generated and is independent of the strain of the ransomware. An enhanced mixed file ransomware data set of more than 130,000 files was developed based on the govdocs1(Garfinkel, 2020) corpus. This data set was enriched to contain examples of files that reflect the more modern Microsoft file formats, as well as examples of high entropy file formats such as compressed files and archives. The data set also contained eight different sets of files that were generated as the result of different real-world high profile ransomware attacks such as WannaCry, Ryuk, Phobos, Sodinokibi and NetWalker. Previous research Penrose et al. (2013); Zhao et al. (2011) has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data. The experiments demonstrate a high degree of confidence in the accuracy of the model achieving a success rate of more than 99.96% when examining only the first 192 bytes of a file, using a mixed data set of more than 80,000 files. This technique successfully addresses the problem of using file entropy to differentiate compressed and archived files from files encrypted by ransomware in a timely manner.</abstract><cop>Amsterdam</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.cose.2021.102377</doi><orcidid>https://orcid.org/0000-0001-9377-4539</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0167-4048
ispartof Computers & security, 2021-09, Vol.108, p.102377, Article 102377
issn 0167-4048
1872-6208
language eng
recordid cdi_proquest_journals_2561518248
source Elsevier ScienceDirect Journals Complete
subjects Archives & records
Data encryption
Datasets
Entropy
Entropy (Information theory)
Model accuracy
Phobos
Random numbers
Ransomware
Ransomware detection
Test data sets
title Differential area analysis for ransomware attack detection within mixed file datasets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T00%3A09%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Differential%20area%20analysis%20for%20ransomware%20attack%20detection%20within%20mixed%20file%20datasets&rft.jtitle=Computers%20&%20security&rft.au=Davies,%20Simon%20R.&rft.date=2021-09&rft.volume=108&rft.spage=102377&rft.pages=102377-&rft.artnum=102377&rft.issn=0167-4048&rft.eissn=1872-6208&rft_id=info:doi/10.1016/j.cose.2021.102377&rft_dat=%3Cproquest_cross%3E2561518248%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2561518248&rft_id=info:pmid/&rft_els_id=S0167404821002017&rfr_iscdi=true