Evaluating few shot and Contrastive learning Methods for Code Clone Detection
Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-11 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Khajezade, Mohamad Fard, Fatemeh Hendijani Shehata, Mohamed S |
description | Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2651903716</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2651903716</sourcerecordid><originalsourceid>FETCH-proquest_journals_26519037163</originalsourceid><addsrcrecordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2651903716</pqid></control><display><type>article</type><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><source>Free E- Journals</source><creator>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creator><creatorcontrib>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</creatorcontrib><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; C++ (programming language) ; Cloning ; Datasets ; Deep learning ; Evaluation ; Languages ; Machine learning ; Software engineering</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><title>arXiv.org</title><description>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</description><subject>Algorithms</subject><subject>C++ (programming language)</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Evaluation</subject><subject>Languages</subject><subject>Machine learning</subject><subject>Software engineering</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNysEKgkAQgOElCJLyHQY6C-tuap3N6OKtuyw5prLs1O5or59BD9DpP_zfSkRK6zQ5HpTaiDiEUUqp8kJlmY5EXc3GToYH94AO3xB6YjCuhZIcexN4mBEsGu--okbuqQ3QkV9Ai1BacghnZLzzQG4n1p2xAeNft2J_qW7lNXl6ek0YuBlp8m5Zjcqz9CR1keb6P_UBheI9rQ</recordid><startdate>20231109</startdate><enddate>20231109</enddate><creator>Khajezade, Mohamad</creator><creator>Fard, Fatemeh Hendijani</creator><creator>Shehata, Mohamed S</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231109</creationdate><title>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</title><author>Khajezade, Mohamad ; Fard, Fatemeh Hendijani ; Shehata, Mohamed S</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26519037163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>C++ (programming language)</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Evaluation</topic><topic>Languages</topic><topic>Machine learning</topic><topic>Software engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Khajezade, Mohamad</creatorcontrib><creatorcontrib>Fard, Fatemeh Hendijani</creatorcontrib><creatorcontrib>Shehata, Mohamed S</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khajezade, Mohamad</au><au>Fard, Fatemeh Hendijani</au><au>Shehata, Mohamed S</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Evaluating few shot and Contrastive learning Methods for Code Clone Detection</atitle><jtitle>arXiv.org</jtitle><date>2023-11-09</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Context: Code Clone Detection (CCD) is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Recently, deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of \(\sim\)95\% on the CodeXGLUE benchmark. These models require many training data, mainly fine-tuned on Java or C++ datasets. However, no previous study evaluates the generalizability of these models where a limited amount of annotated data is available. Objective: The main objective of this research is to assess the ability of the CCD models as well as few shot learning algorithms for unseen programming problems and new languages (i.e., the model is not trained on these problems/languages). Method: We assess the generalizability of the state of the art models for CCD in few shot settings (i.e., only a few samples are available for fine-tuning) by setting three scenarios: i) unseen problems, ii) unseen languages, iii) combination of new languages and new problems. We choose three datasets of BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we employ Model Agnostic Meta-learning (MAML), where the model learns a meta-learner capable of extracting transferable knowledge from the train set; so that the model can be fine-tuned using a few samples. Finally, we combine contrastive learning with MAML to further study whether it can improve the results of MAML.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2651903716 |
source | Free E- Journals |
subjects | Algorithms C++ (programming language) Cloning Datasets Deep learning Evaluation Languages Machine learning Software engineering |
title | Evaluating few shot and Contrastive learning Methods for Code Clone Detection |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T15%3A57%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Evaluating%20few%20shot%20and%20Contrastive%20learning%20Methods%20for%20Code%20Clone%20Detection&rft.jtitle=arXiv.org&rft.au=Khajezade,%20Mohamad&rft.date=2023-11-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2651903716%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2651903716&rft_id=info:pmid/&rfr_iscdi=true |