Author Identification in Imbalanced Sets of Source Code Samples

Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chatzicharalampous, E., Frantzeskou, G., Stamatatos, E.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	byte-level n-grams class imbalance Forensics Measurement Natural languages Software Source code author identification Support vector machines Text categorization Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	797
container_issue
container_start_page	790
container_title
container_volume	1
creator	Chatzicharalampous, E. Frantzeskou, G. Stamatatos, E.
description	Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
doi_str_mv	10.1109/ICTAI.2012.112
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6495124</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6495124</ieee_id><sourcerecordid>6495124</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-796ac86fe1ea8f476fac6e9b919b8e0523ac02309e1e6a2bea666bf4b546f0a33</originalsourceid><addsrcrecordid>eNotjMtKw0AUQMcX2NZu3biZH0i9885dSQhVAwUXretyk9zBkTYpSbrw7y3o6nDgcIR4VLBSCvC5KndFtdKg9MX1lZhD8OgsKqevxUyb4DJQGG7EXNmACFoHfytmCnKdGQt4L5bj-A0ACoyD3M3ES3GevvpBVi13U4qpoSn1nUydrI41HahruJVbnkbZR7ntz0PDsuxblls6ng48Poi7SIeRl_9ciM_X9a58zzYfb1VZbLKkgpuygJ6a3EdWTHm0wUdqPGONCuucwWlDDWgDeAk86ZrJe19HWzvrI5AxC_H0903MvD8N6UjDz95bdEpb8wv-50xY</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Author Identification in Imbalanced Sets of Source Code Samples</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</creator><creatorcontrib>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</creatorcontrib><description>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</description><identifier>ISSN: 1082-3409</identifier><identifier>ISBN: 1479902276</identifier><identifier>ISBN: 9781479902279</identifier><identifier>EISSN: 2375-0197</identifier><identifier>EISBN: 0769549152</identifier><identifier>EISBN: 9780769549156</identifier><identifier>DOI: 10.1109/ICTAI.2012.112</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>byte-level n-grams ; class imbalance ; Forensics ; Measurement ; Natural languages ; Software ; Source code author identification ; Support vector machines ; Text categorization ; Training</subject><ispartof>2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 2012, Vol.1, p.790-797</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6495124$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6495124$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chatzicharalampous, E.</creatorcontrib><creatorcontrib>Frantzeskou, G.</creatorcontrib><creatorcontrib>Stamatatos, E.</creatorcontrib><title>Author Identification in Imbalanced Sets of Source Code Samples</title><title>2012 IEEE 24th International Conference on Tools with Artificial Intelligence</title><addtitle>TAI</addtitle><description>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</description><subject>byte-level n-grams</subject><subject>class imbalance</subject><subject>Forensics</subject><subject>Measurement</subject><subject>Natural languages</subject><subject>Software</subject><subject>Source code author identification</subject><subject>Support vector machines</subject><subject>Text categorization</subject><subject>Training</subject><issn>1082-3409</issn><issn>2375-0197</issn><isbn>1479902276</isbn><isbn>9781479902279</isbn><isbn>0769549152</isbn><isbn>9780769549156</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotjMtKw0AUQMcX2NZu3biZH0i9885dSQhVAwUXretyk9zBkTYpSbrw7y3o6nDgcIR4VLBSCvC5KndFtdKg9MX1lZhD8OgsKqevxUyb4DJQGG7EXNmACFoHfytmCnKdGQt4L5bj-A0ACoyD3M3ES3GevvpBVi13U4qpoSn1nUydrI41HahruJVbnkbZR7ntz0PDsuxblls6ng48Poi7SIeRl_9ciM_X9a58zzYfb1VZbLKkgpuygJ6a3EdWTHm0wUdqPGONCuucwWlDDWgDeAk86ZrJe19HWzvrI5AxC_H0903MvD8N6UjDz95bdEpb8wv-50xY</recordid><startdate>201211</startdate><enddate>201211</enddate><creator>Chatzicharalampous, E.</creator><creator>Frantzeskou, G.</creator><creator>Stamatatos, E.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201211</creationdate><title>Author Identification in Imbalanced Sets of Source Code Samples</title><author>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-796ac86fe1ea8f476fac6e9b919b8e0523ac02309e1e6a2bea666bf4b546f0a33</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>byte-level n-grams</topic><topic>class imbalance</topic><topic>Forensics</topic><topic>Measurement</topic><topic>Natural languages</topic><topic>Software</topic><topic>Source code author identification</topic><topic>Support vector machines</topic><topic>Text categorization</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Chatzicharalampous, E.</creatorcontrib><creatorcontrib>Frantzeskou, G.</creatorcontrib><creatorcontrib>Stamatatos, E.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chatzicharalampous, E.</au><au>Frantzeskou, G.</au><au>Stamatatos, E.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Author Identification in Imbalanced Sets of Source Code Samples</atitle><btitle>2012 IEEE 24th International Conference on Tools with Artificial Intelligence</btitle><stitle>TAI</stitle><date>2012-11</date><risdate>2012</risdate><volume>1</volume><spage>790</spage><epage>797</epage><pages>790-797</pages><issn>1082-3409</issn><eissn>2375-0197</eissn><isbn>1479902276</isbn><isbn>9781479902279</isbn><eisbn>0769549152</eisbn><eisbn>9780769549156</eisbn><coden>IEEPAD</coden><abstract>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</abstract><pub>IEEE</pub><doi>10.1109/ICTAI.2012.112</doi><tpages>8</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1082-3409
ispartof	2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 2012, Vol.1, p.790-797
issn	1082-3409 2375-0197
language	eng
recordid	cdi_ieee_primary_6495124
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	byte-level n-grams class imbalance Forensics Measurement Natural languages Software Source code author identification Support vector machines Text categorization Training
title	Author Identification in Imbalanced Sets of Source Code Samples
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T06%3A57%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Author%20Identification%20in%20Imbalanced%20Sets%20of%20Source%20Code%20Samples&rft.btitle=2012%20IEEE%2024th%20International%20Conference%20on%20Tools%20with%20Artificial%20Intelligence&rft.au=Chatzicharalampous,%20E.&rft.date=2012-11&rft.volume=1&rft.spage=790&rft.epage=797&rft.pages=790-797&rft.issn=1082-3409&rft.eissn=2375-0197&rft.isbn=1479902276&rft.isbn_list=9781479902279&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICTAI.2012.112&rft_dat=%3Cieee_6IE%3E6495124%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=0769549152&rft.eisbn_list=9780769549156&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6495124&rfr_iscdi=true