Evaluating few shot and Contrastive learning Methods for Code Clone Detection

Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-11
Hauptverfasser:	Khajezade, Mohamad, Fard, Fatemeh Hendijani, Shehata, Mohamed S
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms C++ (programming language) Cloning Datasets Deep learning Evaluation Languages Machine learning Software engineering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Khajezade, Mohamad Fard, Fatemeh Hendijani Shehata, Mohamed S
description	Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2651903716</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2651903716</sourcerecordid><originalsourceid>FETCH-proquest_journals_26519037163</originalsourceid><addsrcrecordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2651903716</pqid></control><display><type>article</type><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><source>Free E- Journals</source><creator>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creator><creatorcontrib>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creatorcontrib><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; C++ (programming language) ; Cloning ; Datasets ; Deep learning ; Evaluation ; Languages ; Machine learning ; Software engineering</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><title>arXiv.org</title><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><subject>Algorithms</subject><subject>C++ (programming language)</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Evaluation</subject><subject>Languages</subject><subject>Machine learning</subject><subject>Software engineering</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</recordid><startdate>20231109</startdate><enddate>20231109</enddate><creator>Khajezade, Mohamad</creator><creator>Fard, Fatemeh Hendijani</creator><creator>Shehata, Mohamed S</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231109</creationdate><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><author>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26519037163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>C++ (programming language)</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Evaluation</topic><topic>Languages</topic><topic>Machine learning</topic><topic>Software engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khajezade, Mohamad</au><au>Fard, Fatemeh Hendijani</au><au>Shehata, Mohamed S</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</atitle><jtitle>arXiv.org</jtitle><date>2023-11-09</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2651903716
source	Free E- Journals
subjects	Algorithms C++ (programming language) Cloning Datasets Deep learning Evaluation Languages Machine learning Software engineering
title	Evaluating few shot and Contrastive learning Methods for Code Clone Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T15%3A57%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Evaluating%20few%20shot%20and%20Contrastive%20learning%20Methods%20for%20Code%20Clone%20Detection&rft.jtitle=arXiv.org&rft.au=Khajezade,%20Mohamad&rft.date=2023-11-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2651903716%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2651903716&rft_id=info:pmid/&rfr_iscdi=true