A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Casey, Beatrice, Santos, Joanna C. S, Perry, George
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Cryptography and Security Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Casey, Beatrice Santos, Joanna C. S Perry, George
description	Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.
doi_str_mv	10.48550/arxiv.2403.10646
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_10646</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_10646</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-26d46729f3f40c2483ce3c006e6a8cbedb7b98963c26c2dc042d72ba93e4393e3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz4BhJc2z1JxhLxJwUh0Yg1OraPwQLiyk4qcvdAYfne7ZMexi7WotT1ZiOuMH2FQym1UOVagIZT9rLluzkdaOHR812ckyXeRkf8mfaJMo0TTiGOmfuY-CPatzAS7wjTGMbX4hozOd4uhlImO6cwLbzH_J7P2InHj0zn_12x_vamb--L7unuod12BUIFhQSnoZKNV14LK3WtLCkrBBBgbQ05U5mmbkBZCVY6K7R0lTTYKNLqZ9SKXf7dHmHDPoVPTMvwCxyOQPUNGoBLmw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><source>arXiv.org</source><creator>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</creator><creatorcontrib>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</creatorcontrib><description>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</description><identifier>DOI: 10.48550/arxiv.2403.10646</identifier><language>eng</language><subject>Computer Science - Cryptography and Security ; Computer Science - Learning</subject><creationdate>2024-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.10646$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.10646$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Casey, Beatrice</creatorcontrib><creatorcontrib>Santos, Joanna C. S</creatorcontrib><creatorcontrib>Perry, George</creatorcontrib><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><description>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</description><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz4BhJc2z1JxhLxJwUh0Yg1OraPwQLiyk4qcvdAYfne7ZMexi7WotT1ZiOuMH2FQym1UOVagIZT9rLluzkdaOHR812ckyXeRkf8mfaJMo0TTiGOmfuY-CPatzAS7wjTGMbX4hozOd4uhlImO6cwLbzH_J7P2InHj0zn_12x_vamb--L7unuod12BUIFhQSnoZKNV14LK3WtLCkrBBBgbQ05U5mmbkBZCVY6K7R0lTTYKNLqZ9SKXf7dHmHDPoVPTMvwCxyOQPUNGoBLmw</recordid><startdate>20240315</startdate><enddate>20240315</enddate><creator>Casey, Beatrice</creator><creator>Santos, Joanna C. S</creator><creator>Perry, George</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240315</creationdate><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><author>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-26d46729f3f40c2483ce3c006e6a8cbedb7b98963c26c2dc042d72ba93e4393e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Casey, Beatrice</creatorcontrib><creatorcontrib>Santos, Joanna C. S</creatorcontrib><creatorcontrib>Perry, George</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Casey, Beatrice</au><au>Santos, Joanna C. S</au><au>Perry, George</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</atitle><date>2024-03-15</date><risdate>2024</risdate><abstract>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</abstract><doi>10.48550/arxiv.2403.10646</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.10646
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_10646
source	arXiv.org
subjects	Computer Science - Cryptography and Security Computer Science - Learning
title	A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T12%3A12%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Survey%20of%20Source%20Code%20Representations%20for%20Machine%20Learning-Based%20Cybersecurity%20Tasks&rft.au=Casey,%20Beatrice&rft.date=2024-03-15&rft_id=info:doi/10.48550/arxiv.2403.10646&rft_dat=%3Carxiv_GOX%3E2403_10646%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true