Multi-Label Code Error Classification Using CodeT5 and ML-KNN

Programming is an essential skill in computer science and in a wide range of engineering-related disciplines. However, occurring errors, often referred to as "bugs" in code, can indeed be challenging to identify and rectify, both for students who are learning to program and for experienced...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2024, Vol.12, p.100805-100820
Hauptverfasser:	Amin, Md. Faizul Ibne, Shirafuji, Atsushi, Rahman, Md. Mostafizer, Watanobe, Yutaka
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Big Data Codes CodeT5 data analysis Education educational big data Error analysis error classification learning analytics Learning systems Machine learning ML-KNN multi-label classification programming learning Programming profession Software engineering Source coding Task analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	100820
container_issue
container_start_page	100805
container_title	IEEE access
container_volume	12
creator	Amin, Md. Faizul Ibne Shirafuji, Atsushi Rahman, Md. Mostafizer Watanobe, Yutaka
description	Programming is an essential skill in computer science and in a wide range of engineering-related disciplines. However, occurring errors, often referred to as "bugs" in code, can indeed be challenging to identify and rectify, both for students who are learning to program and for experienced professionals. These errors can lead to unexpected behaviors in programming. Understanding, finding, and effectively dealing with errors is an integral part of programming learning as well as software development. To classify the errors, we propose a multi-label error classification of source code for dealing with programming data by using the ML-KNN classifier with CodeT5 embeddings. In addition, several deep neural network (DNN) models, including GRU, LSTM, BiLSTM, and BiLSTM-A (attention mechanism) are also employed as baseline models to classify the errors. We trained all the models by using a large-scale dataset (original error labels) as well as modified datasets (summarized error labels) of the source code. The average classification accuracy of the proposed model is 95.91% and 84.77% for the original and summarized error-labeled datasets, respectively. The exact match accuracy is 22.57% and 27.22% respectively for the original and summarized error-labeled datasets. The comprehensive experimental results of the proposed approach are promising for multi-label error classification over the baseline models. Moreover, the findings derived from the proposed approach and data-driven analytical results hold significant promise for error classification, programming education, and related research endeavors.
doi_str_mv	10.1109/ACCESS.2024.3430558
format	Article
fullrecord	<record><control><sourceid>doaj_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10602509</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10602509</ieee_id><doaj_id>oai_doaj_org_article_b202799a45b14c6fa65860fc7acf2a25</doaj_id><sourcerecordid>oai_doaj_org_article_b202799a45b14c6fa65860fc7acf2a25</sourcerecordid><originalsourceid>FETCH-LOGICAL-c261t-9bf4989264214bd431f709b1127501c1b4ec0ded3175328de3a02ce2c15cc0b83</originalsourceid><addsrcrecordid>eNpNkM1OAjEURhujiQR5Al3MCwz29m-mCxdkgkoccAGsm7bTkpKRMS0ufHsHhhju5t58N99ZHIQeAU8BsHyeVdV8vZ4STNiUMoo5L2_QiICQOeVU3F7d92iS0h73U_YRL0boZfnTHkNea-ParOoal81j7GJWtTql4IPVx9Adsm0Kh935v-GZPjTZss4_VqsHdOd1m9zkssdo-zrfVO95_fm2qGZ1bomAYy6NZ7KURDACzDSMgi-wNACk4BgsGOYsblxDoeCUlI2jGhPriAVuLTYlHaPFwG06vVffMXzp-Ks6HdQ56OJO6XgMtnXK9B4KKTXjBpgVXgteCuxtoa0nmvCeRQeWjV1K0fl_HmB1EqoGoeokVF2E9q2noRWcc1cNgQnHkv4BJeNu6w</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multi-Label Code Error Classification Using CodeT5 and ML-KNN</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Amin, Md. Faizul Ibne ; Shirafuji, Atsushi ; Rahman, Md. Mostafizer ; Watanobe, Yutaka</creator><creatorcontrib>Amin, Md. Faizul Ibne ; Shirafuji, Atsushi ; Rahman, Md. Mostafizer ; Watanobe, Yutaka</creatorcontrib><description>Programming is an essential skill in computer science and in a wide range of engineering-related disciplines. However, occurring errors, often referred to as "bugs" in code, can indeed be challenging to identify and rectify, both for students who are learning to program and for experienced professionals. These errors can lead to unexpected behaviors in programming. Understanding, finding, and effectively dealing with errors is an integral part of programming learning as well as software development. To classify the errors, we propose a multi-label error classification of source code for dealing with programming data by using the ML-KNN classifier with CodeT5 embeddings. In addition, several deep neural network (DNN) models, including GRU, LSTM, BiLSTM, and BiLSTM-A (attention mechanism) are also employed as baseline models to classify the errors. We trained all the models by using a large-scale dataset (original error labels) as well as modified datasets (summarized error labels) of the source code. The average classification accuracy of the proposed model is 95.91% and 84.77% for the original and summarized error-labeled datasets, respectively. The exact match accuracy is 22.57% and 27.22% respectively for the original and summarized error-labeled datasets. The comprehensive experimental results of the proposed approach are promising for multi-label error classification over the baseline models. Moreover, the findings derived from the proposed approach and data-driven analytical results hold significant promise for error classification, programming education, and related research endeavors.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3430558</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Big Data ; Codes ; CodeT5 ; data analysis ; Education ; educational big data ; Error analysis ; error classification ; learning analytics ; Learning systems ; Machine learning ; ML-KNN ; multi-label classification ; programming learning ; Programming profession ; Software engineering ; Source coding ; Task analysis</subject><ispartof>IEEE access, 2024, Vol.12, p.100805-100820</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c261t-9bf4989264214bd431f709b1127501c1b4ec0ded3175328de3a02ce2c15cc0b83</cites><orcidid>0000-0002-0030-3859 ; 0000-0001-9890-4806 ; 0009-0001-0722-3536 ; 0000-0001-9368-7638</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10602509$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,777,781,861,2096,4010,27614,27904,27905,27906,54914</link.rule.ids></links><search><creatorcontrib>Amin, Md. Faizul Ibne</creatorcontrib><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Rahman, Md. Mostafizer</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><title>Multi-Label Code Error Classification Using CodeT5 and ML-KNN</title><title>IEEE access</title><addtitle>Access</addtitle><description>Programming is an essential skill in computer science and in a wide range of engineering-related disciplines. However, occurring errors, often referred to as "bugs" in code, can indeed be challenging to identify and rectify, both for students who are learning to program and for experienced professionals. These errors can lead to unexpected behaviors in programming. Understanding, finding, and effectively dealing with errors is an integral part of programming learning as well as software development. To classify the errors, we propose a multi-label error classification of source code for dealing with programming data by using the ML-KNN classifier with CodeT5 embeddings. In addition, several deep neural network (DNN) models, including GRU, LSTM, BiLSTM, and BiLSTM-A (attention mechanism) are also employed as baseline models to classify the errors. We trained all the models by using a large-scale dataset (original error labels) as well as modified datasets (summarized error labels) of the source code. The average classification accuracy of the proposed model is 95.91% and 84.77% for the original and summarized error-labeled datasets, respectively. The exact match accuracy is 22.57% and 27.22% respectively for the original and summarized error-labeled datasets. The comprehensive experimental results of the proposed approach are promising for multi-label error classification over the baseline models. Moreover, the findings derived from the proposed approach and data-driven analytical results hold significant promise for error classification, programming education, and related research endeavors.</description><subject>Accuracy</subject><subject>Big Data</subject><subject>Codes</subject><subject>CodeT5</subject><subject>data analysis</subject><subject>Education</subject><subject>educational big data</subject><subject>Error analysis</subject><subject>error classification</subject><subject>learning analytics</subject><subject>Learning systems</subject><subject>Machine learning</subject><subject>ML-KNN</subject><subject>multi-label classification</subject><subject>programming learning</subject><subject>Programming profession</subject><subject>Software engineering</subject><subject>Source coding</subject><subject>Task analysis</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkM1OAjEURhujiQR5Al3MCwz29m-mCxdkgkoccAGsm7bTkpKRMS0ufHsHhhju5t58N99ZHIQeAU8BsHyeVdV8vZ4STNiUMoo5L2_QiICQOeVU3F7d92iS0h73U_YRL0boZfnTHkNea-ParOoal81j7GJWtTql4IPVx9Adsm0Kh935v-GZPjTZss4_VqsHdOd1m9zkssdo-zrfVO95_fm2qGZ1bomAYy6NZ7KURDACzDSMgi-wNACk4BgsGOYsblxDoeCUlI2jGhPriAVuLTYlHaPFwG06vVffMXzp-Ks6HdQ56OJO6XgMtnXK9B4KKTXjBpgVXgteCuxtoa0nmvCeRQeWjV1K0fl_HmB1EqoGoeokVF2E9q2noRWcc1cNgQnHkv4BJeNu6w</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Amin, Md. Faizul Ibne</creator><creator>Shirafuji, Atsushi</creator><creator>Rahman, Md. Mostafizer</creator><creator>Watanobe, Yutaka</creator><general>IEEE</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-0030-3859</orcidid><orcidid>https://orcid.org/0000-0001-9890-4806</orcidid><orcidid>https://orcid.org/0009-0001-0722-3536</orcidid><orcidid>https://orcid.org/0000-0001-9368-7638</orcidid></search><sort><creationdate>2024</creationdate><title>Multi-Label Code Error Classification Using CodeT5 and ML-KNN</title><author>Amin, Md. Faizul Ibne ; Shirafuji, Atsushi ; Rahman, Md. Mostafizer ; Watanobe, Yutaka</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c261t-9bf4989264214bd431f709b1127501c1b4ec0ded3175328de3a02ce2c15cc0b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Big Data</topic><topic>Codes</topic><topic>CodeT5</topic><topic>data analysis</topic><topic>Education</topic><topic>educational big data</topic><topic>Error analysis</topic><topic>error classification</topic><topic>learning analytics</topic><topic>Learning systems</topic><topic>Machine learning</topic><topic>ML-KNN</topic><topic>multi-label classification</topic><topic>programming learning</topic><topic>Programming profession</topic><topic>Software engineering</topic><topic>Source coding</topic><topic>Task analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Amin, Md. Faizul Ibne</creatorcontrib><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Rahman, Md. Mostafizer</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Amin, Md. Faizul Ibne</au><au>Shirafuji, Atsushi</au><au>Rahman, Md. Mostafizer</au><au>Watanobe, Yutaka</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Label Code Error Classification Using CodeT5 and ML-KNN</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>100805</spage><epage>100820</epage><pages>100805-100820</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Programming is an essential skill in computer science and in a wide range of engineering-related disciplines. However, occurring errors, often referred to as "bugs" in code, can indeed be challenging to identify and rectify, both for students who are learning to program and for experienced professionals. These errors can lead to unexpected behaviors in programming. Understanding, finding, and effectively dealing with errors is an integral part of programming learning as well as software development. To classify the errors, we propose a multi-label error classification of source code for dealing with programming data by using the ML-KNN classifier with CodeT5 embeddings. In addition, several deep neural network (DNN) models, including GRU, LSTM, BiLSTM, and BiLSTM-A (attention mechanism) are also employed as baseline models to classify the errors. We trained all the models by using a large-scale dataset (original error labels) as well as modified datasets (summarized error labels) of the source code. The average classification accuracy of the proposed model is 95.91% and 84.77% for the original and summarized error-labeled datasets, respectively. The exact match accuracy is 22.57% and 27.22% respectively for the original and summarized error-labeled datasets. The comprehensive experimental results of the proposed approach are promising for multi-label error classification over the baseline models. Moreover, the findings derived from the proposed approach and data-driven analytical results hold significant promise for error classification, programming education, and related research endeavors.</abstract><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3430558</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0002-0030-3859</orcidid><orcidid>https://orcid.org/0000-0001-9890-4806</orcidid><orcidid>https://orcid.org/0009-0001-0722-3536</orcidid><orcidid>https://orcid.org/0000-0001-9368-7638</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2024, Vol.12, p.100805-100820
issn	2169-3536 2169-3536
language	eng
recordid	cdi_ieee_primary_10602509
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects	Accuracy Big Data Codes CodeT5 data analysis Education educational big data Error analysis error classification learning analytics Learning systems Machine learning ML-KNN multi-label classification programming learning Programming profession Software engineering Source coding Task analysis
title	Multi-Label Code Error Classification Using CodeT5 and ML-KNN
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T13%3A51%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-doaj_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Label%20Code%20Error%20Classification%20Using%20CodeT5%20and%20ML-KNN&rft.jtitle=IEEE%20access&rft.au=Amin,%20Md.%20Faizul%20Ibne&rft.date=2024&rft.volume=12&rft.spage=100805&rft.epage=100820&rft.pages=100805-100820&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3430558&rft_dat=%3Cdoaj_ieee_%3Eoai_doaj_org_article_b202799a45b14c6fa65860fc7acf2a25%3C/doaj_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10602509&rft_doaj_id=oai_doaj_org_article_b202799a45b14c6fa65860fc7acf2a25&rfr_iscdi=true