Toward the Detection of Polyglot Files

Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allow...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Koch, Luke, Oesch, Sean, Adkisson, Mary, Erwin, Sam, Weber, Brian, Chaulagain, Amul
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Cryptography and Security Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Koch, Luke Oesch, Sean Adkisson, Mary Erwin, Sam Weber, Brian Chaulagain, Amul
description	Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.
doi_str_mv	10.48550/arxiv.2203.07561
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2203_07561</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2203_07561</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-dd39b8188402e91dffcd57bfcf26dcb7ee32dc3f4dd72516e7ed899691a282203</originalsourceid><addsrcrecordid>eNotzr1uwjAUhmEvDBX0AjrhiS2pf-K_EaWlrYQEQ_bI8TmGSAFXTgTl7itop294pU8PIS-clZVVir36_NNfSiGYLJlRmj-RVZOuPgOdjkjfcMIw9elMU6T7NNwOQ5roph9wXJBZ9MOIz_87J83mvak_i-3u46tebwuvDS8ApOsst7ZiAh2HGAMo08UQhYbQGUQpIMhYARihuEaDYJ3Tjnth76o5Wf7dPqDtd-5PPt_ae2ofYPkLUSQ6XQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Toward the Detection of Polyglot Files</title><source>arXiv.org</source><creator>Koch, Luke ; Oesch, Sean ; Adkisson, Mary ; Erwin, Sam ; Weber, Brian ; Chaulagain, Amul</creator><creatorcontrib>Koch, Luke ; Oesch, Sean ; Adkisson, Mary ; Erwin, Sam ; Weber, Brian ; Chaulagain, Amul</creatorcontrib><description>Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.</description><identifier>DOI: 10.48550/arxiv.2203.07561</identifier><language>eng</language><subject>Computer Science - Cryptography and Security ; Computer Science - Learning</subject><creationdate>2022-03</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2203.07561$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2203.07561$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Koch, Luke</creatorcontrib><creatorcontrib>Oesch, Sean</creatorcontrib><creatorcontrib>Adkisson, Mary</creatorcontrib><creatorcontrib>Erwin, Sam</creatorcontrib><creatorcontrib>Weber, Brian</creatorcontrib><creatorcontrib>Chaulagain, Amul</creatorcontrib><title>Toward the Detection of Polyglot Files</title><description>Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.</description><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr1uwjAUhmEvDBX0AjrhiS2pf-K_EaWlrYQEQ_bI8TmGSAFXTgTl7itop294pU8PIS-clZVVir36_NNfSiGYLJlRmj-RVZOuPgOdjkjfcMIw9elMU6T7NNwOQ5roph9wXJBZ9MOIz_87J83mvak_i-3u46tebwuvDS8ApOsst7ZiAh2HGAMo08UQhYbQGUQpIMhYARihuEaDYJ3Tjnth76o5Wf7dPqDtd-5PPt_ae2ofYPkLUSQ6XQ</recordid><startdate>20220314</startdate><enddate>20220314</enddate><creator>Koch, Luke</creator><creator>Oesch, Sean</creator><creator>Adkisson, Mary</creator><creator>Erwin, Sam</creator><creator>Weber, Brian</creator><creator>Chaulagain, Amul</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220314</creationdate><title>Toward the Detection of Polyglot Files</title><author>Koch, Luke ; Oesch, Sean ; Adkisson, Mary ; Erwin, Sam ; Weber, Brian ; Chaulagain, Amul</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-dd39b8188402e91dffcd57bfcf26dcb7ee32dc3f4dd72516e7ed899691a282203</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Koch, Luke</creatorcontrib><creatorcontrib>Oesch, Sean</creatorcontrib><creatorcontrib>Adkisson, Mary</creatorcontrib><creatorcontrib>Erwin, Sam</creatorcontrib><creatorcontrib>Weber, Brian</creatorcontrib><creatorcontrib>Chaulagain, Amul</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Koch, Luke</au><au>Oesch, Sean</au><au>Adkisson, Mary</au><au>Erwin, Sam</au><au>Weber, Brian</au><au>Chaulagain, Amul</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Toward the Detection of Polyglot Files</atitle><date>2022-03-14</date><risdate>2022</risdate><abstract>Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.</abstract><doi>10.48550/arxiv.2203.07561</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2203.07561
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2203_07561
source	arXiv.org
subjects	Computer Science - Cryptography and Security Computer Science - Learning
title	Toward the Detection of Polyglot Files
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T10%3A37%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Toward%20the%20Detection%20of%20Polyglot%20Files&rft.au=Koch,%20Luke&rft.date=2022-03-14&rft_id=info:doi/10.48550/arxiv.2203.07561&rft_dat=%3Carxiv_GOX%3E2203_07561%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true