NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages

Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and conseq...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Abdullah, Abdulhady Abas, Abdulla, Srwa Hasan, Toufiq, Dalia Mohammad, Maghdid, Halgurd S, Rashid, Tarik A, Farho, Pakshan F, Sabr, Shadan Sh, Taher, Akar H, Hamad, Darya S, Veisi, Hadi, Asaad, Aras T
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Abdullah, Abdulhady Abas Abdulla, Srwa Hasan Toufiq, Dalia Mohammad Maghdid, Halgurd S Rashid, Tarik A Farho, Pakshan F Sabr, Shadan Sh Taher, Akar H Hamad, Darya S Veisi, Hadi Asaad, Aras T
description	Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and consequently large corpora for widely used languages like English, Spanish, Turkish, Persian, and many more, these applications have been developed accurately. However, the Kurdish language still requires more corpora and large datasets to be included in NLP applications. This is because Kurdish has a rich linguistic structure, varied dialects, and a limited dataset, which poses unique challenges for Kurdish NLP (KNLP) application development. While several studies have been conducted in KNLP for various applications, Kurdish NER (KNER) remains a challenge for many KNLP tasks, including text analysis and classification. In this work, we address this limitation by proposing a methodology for fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first create a Kurdish corpus, followed by designing a modified model architecture and implementing the training procedures. To evaluate the trained model, a set of experiments is conducted to demonstrate the performance of the KNER model using different tokenization methods and trained models. The experimental results show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance, achieving a 12.8% improvement in F1-score compared to traditional models, and consequently establishes a new benchmark for KNLP.
doi_str_mv	10.48550/arxiv.2412.15252
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_15252</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_15252</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_152523</originalsourceid><addsrcrecordid>eNqFzrEOgjAYBOAuDkZ9ACf_UYciVJoYR02JEwNhJxVL_RNoTSkib68SnZ0uudwlHyHLKAziPefhVronPgIWRyyIOONsSi6pyChk9iiyXB4gQaNo3hk0-ldCZR2kslFXEMajHyBTpdUGPVoD6_d_Az36GxqobU-dam3nSgW1NLqTWrVzMqlk3arFN2dklYj8dKajprg7bKQbio-qGFW7_4sXb-hA3Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages</title><source>arXiv.org</source><creator>Abdullah, Abdulhady Abas ; Abdulla, Srwa Hasan ; Toufiq, Dalia Mohammad ; Maghdid, Halgurd S ; Rashid, Tarik A ; Farho, Pakshan F ; Sabr, Shadan Sh ; Taher, Akar H ; Hamad, Darya S ; Veisi, Hadi ; Asaad, Aras T</creator><creatorcontrib>Abdullah, Abdulhady Abas ; Abdulla, Srwa Hasan ; Toufiq, Dalia Mohammad ; Maghdid, Halgurd S ; Rashid, Tarik A ; Farho, Pakshan F ; Sabr, Shadan Sh ; Taher, Akar H ; Hamad, Darya S ; Veisi, Hadi ; Asaad, Aras T</creatorcontrib><description>Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and consequently large corpora for widely used languages like English, Spanish, Turkish, Persian, and many more, these applications have been developed accurately. However, the Kurdish language still requires more corpora and large datasets to be included in NLP applications. This is because Kurdish has a rich linguistic structure, varied dialects, and a limited dataset, which poses unique challenges for Kurdish NLP (KNLP) application development. While several studies have been conducted in KNLP for various applications, Kurdish NER (KNER) remains a challenge for many KNLP tasks, including text analysis and classification. In this work, we address this limitation by proposing a methodology for fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first create a Kurdish corpus, followed by designing a modified model architecture and implementing the training procedures. To evaluate the trained model, a set of experiments is conducted to demonstrate the performance of the KNER model using different tokenization methods and trained models. The experimental results show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance, achieving a 12.8% improvement in F1-score compared to traditional models, and consequently establishes a new benchmark for KNLP.</description><identifier>DOI: 10.48550/arxiv.2412.15252</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.15252$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.15252$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Abdullah, Abdulhady Abas</creatorcontrib><creatorcontrib>Abdulla, Srwa Hasan</creatorcontrib><creatorcontrib>Toufiq, Dalia Mohammad</creatorcontrib><creatorcontrib>Maghdid, Halgurd S</creatorcontrib><creatorcontrib>Rashid, Tarik A</creatorcontrib><creatorcontrib>Farho, Pakshan F</creatorcontrib><creatorcontrib>Sabr, Shadan Sh</creatorcontrib><creatorcontrib>Taher, Akar H</creatorcontrib><creatorcontrib>Hamad, Darya S</creatorcontrib><creatorcontrib>Veisi, Hadi</creatorcontrib><creatorcontrib>Asaad, Aras T</creatorcontrib><title>NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages</title><description>Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and consequently large corpora for widely used languages like English, Spanish, Turkish, Persian, and many more, these applications have been developed accurately. However, the Kurdish language still requires more corpora and large datasets to be included in NLP applications. This is because Kurdish has a rich linguistic structure, varied dialects, and a limited dataset, which poses unique challenges for Kurdish NLP (KNLP) application development. While several studies have been conducted in KNLP for various applications, Kurdish NER (KNER) remains a challenge for many KNLP tasks, including text analysis and classification. In this work, we address this limitation by proposing a methodology for fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first create a Kurdish corpus, followed by designing a modified model architecture and implementing the training procedures. To evaluate the trained model, a set of experiments is conducted to demonstrate the performance of the KNER model using different tokenization methods and trained models. The experimental results show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance, achieving a 12.8% improvement in F1-score compared to traditional models, and consequently establishes a new benchmark for KNLP.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzrEOgjAYBOAuDkZ9ACf_UYciVJoYR02JEwNhJxVL_RNoTSkib68SnZ0uudwlHyHLKAziPefhVronPgIWRyyIOONsSi6pyChk9iiyXB4gQaNo3hk0-ldCZR2kslFXEMajHyBTpdUGPVoD6_d_Az36GxqobU-dam3nSgW1NLqTWrVzMqlk3arFN2dklYj8dKajprg7bKQbio-qGFW7_4sXb-hA3Q</recordid><startdate>20241215</startdate><enddate>20241215</enddate><creator>Abdullah, Abdulhady Abas</creator><creator>Abdulla, Srwa Hasan</creator><creator>Toufiq, Dalia Mohammad</creator><creator>Maghdid, Halgurd S</creator><creator>Rashid, Tarik A</creator><creator>Farho, Pakshan F</creator><creator>Sabr, Shadan Sh</creator><creator>Taher, Akar H</creator><creator>Hamad, Darya S</creator><creator>Veisi, Hadi</creator><creator>Asaad, Aras T</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241215</creationdate><title>NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages</title><author>Abdullah, Abdulhady Abas ; Abdulla, Srwa Hasan ; Toufiq, Dalia Mohammad ; Maghdid, Halgurd S ; Rashid, Tarik A ; Farho, Pakshan F ; Sabr, Shadan Sh ; Taher, Akar H ; Hamad, Darya S ; Veisi, Hadi ; Asaad, Aras T</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_152523</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Abdullah, Abdulhady Abas</creatorcontrib><creatorcontrib>Abdulla, Srwa Hasan</creatorcontrib><creatorcontrib>Toufiq, Dalia Mohammad</creatorcontrib><creatorcontrib>Maghdid, Halgurd S</creatorcontrib><creatorcontrib>Rashid, Tarik A</creatorcontrib><creatorcontrib>Farho, Pakshan F</creatorcontrib><creatorcontrib>Sabr, Shadan Sh</creatorcontrib><creatorcontrib>Taher, Akar H</creatorcontrib><creatorcontrib>Hamad, Darya S</creatorcontrib><creatorcontrib>Veisi, Hadi</creatorcontrib><creatorcontrib>Asaad, Aras T</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Abdullah, Abdulhady Abas</au><au>Abdulla, Srwa Hasan</au><au>Toufiq, Dalia Mohammad</au><au>Maghdid, Halgurd S</au><au>Rashid, Tarik A</au><au>Farho, Pakshan F</au><au>Sabr, Shadan Sh</au><au>Taher, Akar H</au><au>Hamad, Darya S</au><au>Veisi, Hadi</au><au>Asaad, Aras T</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages</atitle><date>2024-12-15</date><risdate>2024</risdate><abstract>Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and consequently large corpora for widely used languages like English, Spanish, Turkish, Persian, and many more, these applications have been developed accurately. However, the Kurdish language still requires more corpora and large datasets to be included in NLP applications. This is because Kurdish has a rich linguistic structure, varied dialects, and a limited dataset, which poses unique challenges for Kurdish NLP (KNLP) application development. While several studies have been conducted in KNLP for various applications, Kurdish NER (KNER) remains a challenge for many KNLP tasks, including text analysis and classification. In this work, we address this limitation by proposing a methodology for fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first create a Kurdish corpus, followed by designing a modified model architecture and implementing the training procedures. To evaluate the trained model, a set of experiments is conducted to demonstrate the performance of the KNER model using different tokenization methods and trained models. The experimental results show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance, achieving a 12.8% improvement in F1-score compared to traditional models, and consequently establishes a new benchmark for KNLP.</abstract><doi>10.48550/arxiv.2412.15252</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.15252
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_15252
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
title	NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T09%3A32%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NER-%20RoBERTa:%20Fine-Tuning%20RoBERTa%20for%20Named%20Entity%20Recognition%20(NER)%20within%20low-resource%20languages&rft.au=Abdullah,%20Abdulhady%20Abas&rft.date=2024-12-15&rft_id=info:doi/10.48550/arxiv.2412.15252&rft_dat=%3Carxiv_GOX%3E2412_15252%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true