A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Casey, Beatrice, Santos, Joanna C. S, Perry, George
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Casey, Beatrice
Santos, Joanna C. S
Perry, George
description Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.
doi_str_mv 10.48550/arxiv.2403.10646
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_10646</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_10646</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-26d46729f3f40c2483ce3c006e6a8cbedb7b98963c26c2dc042d72ba93e4393e3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz4BhJc2z1JxhLxJwUh0Yg1OraPwQLiyk4qcvdAYfne7ZMexi7WotT1ZiOuMH2FQym1UOVagIZT9rLluzkdaOHR812ckyXeRkf8mfaJMo0TTiGOmfuY-CPatzAS7wjTGMbX4hozOd4uhlImO6cwLbzH_J7P2InHj0zn_12x_vamb--L7unuod12BUIFhQSnoZKNV14LK3WtLCkrBBBgbQ05U5mmbkBZCVY6K7R0lTTYKNLqZ9SKXf7dHmHDPoVPTMvwCxyOQPUNGoBLmw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><source>arXiv.org</source><creator>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</creator><creatorcontrib>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</creatorcontrib><description>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</description><identifier>DOI: 10.48550/arxiv.2403.10646</identifier><language>eng</language><subject>Computer Science - Cryptography and Security ; Computer Science - Learning</subject><creationdate>2024-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.10646$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.10646$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Casey, Beatrice</creatorcontrib><creatorcontrib>Santos, Joanna C. S</creatorcontrib><creatorcontrib>Perry, George</creatorcontrib><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><description>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</description><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz4BhJc2z1JxhLxJwUh0Yg1OraPwQLiyk4qcvdAYfne7ZMexi7WotT1ZiOuMH2FQym1UOVagIZT9rLluzkdaOHR812ckyXeRkf8mfaJMo0TTiGOmfuY-CPatzAS7wjTGMbX4hozOd4uhlImO6cwLbzH_J7P2InHj0zn_12x_vamb--L7unuod12BUIFhQSnoZKNV14LK3WtLCkrBBBgbQ05U5mmbkBZCVY6K7R0lTTYKNLqZ9SKXf7dHmHDPoVPTMvwCxyOQPUNGoBLmw</recordid><startdate>20240315</startdate><enddate>20240315</enddate><creator>Casey, Beatrice</creator><creator>Santos, Joanna C. S</creator><creator>Perry, George</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240315</creationdate><title>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</title><author>Casey, Beatrice ; Santos, Joanna C. S ; Perry, George</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-26d46729f3f40c2483ce3c006e6a8cbedb7b98963c26c2dc042d72ba93e4393e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Casey, Beatrice</creatorcontrib><creatorcontrib>Santos, Joanna C. S</creatorcontrib><creatorcontrib>Perry, George</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Casey, Beatrice</au><au>Santos, Joanna C. S</au><au>Perry, George</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks</atitle><date>2024-03-15</date><risdate>2024</risdate><abstract>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</abstract><doi>10.48550/arxiv.2403.10646</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2403.10646
ispartof
issn
language eng
recordid cdi_arxiv_primary_2403_10646
source arXiv.org
subjects Computer Science - Cryptography and Security
Computer Science - Learning
title A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T12%3A12%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Survey%20of%20Source%20Code%20Representations%20for%20Machine%20Learning-Based%20Cybersecurity%20Tasks&rft.au=Casey,%20Beatrice&rft.date=2024-03-15&rft_id=info:doi/10.48550/arxiv.2403.10646&rft_dat=%3Carxiv_GOX%3E2403_10646%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true