Evaluating few shot and Contrastive learning Methods for Code Clone Detection

Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-11
Hauptverfasser: Khajezade, Mohamad, Fard, Fatemeh Hendijani, Shehata, Mohamed S
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Khajezade, Mohamad
Fard, Fatemeh Hendijani
Shehata, Mohamed S
description Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2651903716</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2651903716</sourcerecordid><originalsourceid>FETCH-proquest_journals_26519037163</originalsourceid><addsrcrecordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2651903716</pqid></control><display><type>article</type><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><source>Free E- Journals</source><creator>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creator><creatorcontrib>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creatorcontrib><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; C++ (programming language) ; Cloning ; Datasets ; Deep learning ; Evaluation ; Languages ; Machine learning ; Software engineering</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><title>arXiv.org</title><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><subject>Algorithms</subject><subject>C++ (programming language)</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Evaluation</subject><subject>Languages</subject><subject>Machine learning</subject><subject>Software engineering</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</recordid><startdate>20231109</startdate><enddate>20231109</enddate><creator>Khajezade, Mohamad</creator><creator>Fard, Fatemeh Hendijani</creator><creator>Shehata, Mohamed S</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231109</creationdate><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><author>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26519037163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>C++ (programming language)</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Evaluation</topic><topic>Languages</topic><topic>Machine learning</topic><topic>Software engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khajezade, Mohamad</au><au>Fard, Fatemeh Hendijani</au><au>Shehata, Mohamed S</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</atitle><jtitle>arXiv.org</jtitle><date>2023-11-09</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_2651903716
source Free E- Journals
subjects Algorithms
C++ (programming language)
Cloning
Datasets
Deep learning
Evaluation
Languages
Machine learning
Software engineering
title Evaluating few shot and Contrastive learning Methods for Code Clone Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T15%3A57%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Evaluating%20few%20shot%20and%20Contrastive%20learning%20Methods%20for%20Code%20Clone%20Detection&rft.jtitle=arXiv.org&rft.au=Khajezade,%20Mohamad&rft.date=2023-11-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2651903716%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2651903716&rft_id=info:pmid/&rfr_iscdi=true