Author Identification in Imbalanced Sets of Source Code Samples

Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chatzicharalampous, E., Frantzeskou, G., Stamatatos, E.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 797
container_issue
container_start_page 790
container_title
container_volume 1
creator Chatzicharalampous, E.
Frantzeskou, G.
Stamatatos, E.
description Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
doi_str_mv 10.1109/ICTAI.2012.112
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6495124</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6495124</ieee_id><sourcerecordid>6495124</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-796ac86fe1ea8f476fac6e9b919b8e0523ac02309e1e6a2bea666bf4b546f0a33</originalsourceid><addsrcrecordid>eNotjMtKw0AUQMcX2NZu3biZH0i9885dSQhVAwUXretyk9zBkTYpSbrw7y3o6nDgcIR4VLBSCvC5KndFtdKg9MX1lZhD8OgsKqevxUyb4DJQGG7EXNmACFoHfytmCnKdGQt4L5bj-A0ACoyD3M3ES3GevvpBVi13U4qpoSn1nUydrI41HahruJVbnkbZR7ntz0PDsuxblls6ng48Poi7SIeRl_9ciM_X9a58zzYfb1VZbLKkgpuygJ6a3EdWTHm0wUdqPGONCuucwWlDDWgDeAk86ZrJe19HWzvrI5AxC_H0903MvD8N6UjDz95bdEpb8wv-50xY</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Author Identification in Imbalanced Sets of Source Code Samples</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</creator><creatorcontrib>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</creatorcontrib><description>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</description><identifier>ISSN: 1082-3409</identifier><identifier>ISBN: 1479902276</identifier><identifier>ISBN: 9781479902279</identifier><identifier>EISSN: 2375-0197</identifier><identifier>EISBN: 0769549152</identifier><identifier>EISBN: 9780769549156</identifier><identifier>DOI: 10.1109/ICTAI.2012.112</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>byte-level n-grams ; class imbalance ; Forensics ; Measurement ; Natural languages ; Software ; Source code author identification ; Support vector machines ; Text categorization ; Training</subject><ispartof>2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 2012, Vol.1, p.790-797</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6495124$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6495124$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chatzicharalampous, E.</creatorcontrib><creatorcontrib>Frantzeskou, G.</creatorcontrib><creatorcontrib>Stamatatos, E.</creatorcontrib><title>Author Identification in Imbalanced Sets of Source Code Samples</title><title>2012 IEEE 24th International Conference on Tools with Artificial Intelligence</title><addtitle>TAI</addtitle><description>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</description><subject>byte-level n-grams</subject><subject>class imbalance</subject><subject>Forensics</subject><subject>Measurement</subject><subject>Natural languages</subject><subject>Software</subject><subject>Source code author identification</subject><subject>Support vector machines</subject><subject>Text categorization</subject><subject>Training</subject><issn>1082-3409</issn><issn>2375-0197</issn><isbn>1479902276</isbn><isbn>9781479902279</isbn><isbn>0769549152</isbn><isbn>9780769549156</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotjMtKw0AUQMcX2NZu3biZH0i9885dSQhVAwUXretyk9zBkTYpSbrw7y3o6nDgcIR4VLBSCvC5KndFtdKg9MX1lZhD8OgsKqevxUyb4DJQGG7EXNmACFoHfytmCnKdGQt4L5bj-A0ACoyD3M3ES3GevvpBVi13U4qpoSn1nUydrI41HahruJVbnkbZR7ntz0PDsuxblls6ng48Poi7SIeRl_9ciM_X9a58zzYfb1VZbLKkgpuygJ6a3EdWTHm0wUdqPGONCuucwWlDDWgDeAk86ZrJe19HWzvrI5AxC_H0903MvD8N6UjDz95bdEpb8wv-50xY</recordid><startdate>201211</startdate><enddate>201211</enddate><creator>Chatzicharalampous, E.</creator><creator>Frantzeskou, G.</creator><creator>Stamatatos, E.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201211</creationdate><title>Author Identification in Imbalanced Sets of Source Code Samples</title><author>Chatzicharalampous, E. ; Frantzeskou, G. ; Stamatatos, E.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-796ac86fe1ea8f476fac6e9b919b8e0523ac02309e1e6a2bea666bf4b546f0a33</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>byte-level n-grams</topic><topic>class imbalance</topic><topic>Forensics</topic><topic>Measurement</topic><topic>Natural languages</topic><topic>Software</topic><topic>Source code author identification</topic><topic>Support vector machines</topic><topic>Text categorization</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Chatzicharalampous, E.</creatorcontrib><creatorcontrib>Frantzeskou, G.</creatorcontrib><creatorcontrib>Stamatatos, E.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chatzicharalampous, E.</au><au>Frantzeskou, G.</au><au>Stamatatos, E.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Author Identification in Imbalanced Sets of Source Code Samples</atitle><btitle>2012 IEEE 24th International Conference on Tools with Artificial Intelligence</btitle><stitle>TAI</stitle><date>2012-11</date><risdate>2012</risdate><volume>1</volume><spage>790</spage><epage>797</epage><pages>790-797</pages><issn>1082-3409</issn><eissn>2375-0197</eissn><isbn>1479902276</isbn><isbn>9781479902279</isbn><eisbn>0769549152</eisbn><eisbn>9780769549156</eisbn><coden>IEEPAD</coden><abstract>Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.</abstract><pub>IEEE</pub><doi>10.1109/ICTAI.2012.112</doi><tpages>8</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1082-3409
ispartof 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, 2012, Vol.1, p.790-797
issn 1082-3409
2375-0197
language eng
recordid cdi_ieee_primary_6495124
source IEEE Electronic Library (IEL) Conference Proceedings
subjects byte-level n-grams
class imbalance
Forensics
Measurement
Natural languages
Software
Source code author identification
Support vector machines
Text categorization
Training
title Author Identification in Imbalanced Sets of Source Code Samples
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T06%3A57%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Author%20Identification%20in%20Imbalanced%20Sets%20of%20Source%20Code%20Samples&rft.btitle=2012%20IEEE%2024th%20International%20Conference%20on%20Tools%20with%20Artificial%20Intelligence&rft.au=Chatzicharalampous,%20E.&rft.date=2012-11&rft.volume=1&rft.spage=790&rft.epage=797&rft.pages=790-797&rft.issn=1082-3409&rft.eissn=2375-0197&rft.isbn=1479902276&rft.isbn_list=9781479902279&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICTAI.2012.112&rft_dat=%3Cieee_6IE%3E6495124%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=0769549152&rft.eisbn_list=9780769549156&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6495124&rfr_iscdi=true