NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages
Nowadays, Natural Language Processing (NLP) is an important tool for most people's daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and conseq...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Nowadays, Natural Language Processing (NLP) is an important tool for most
people's daily life routines, ranging from understanding speech, translation,
named entity recognition (NER), and text categorization, to generative text
models such as ChatGPT. Due to the existence of big data and consequently large
corpora for widely used languages like English, Spanish, Turkish, Persian, and
many more, these applications have been developed accurately. However, the
Kurdish language still requires more corpora and large datasets to be included
in NLP applications. This is because Kurdish has a rich linguistic structure,
varied dialects, and a limited dataset, which poses unique challenges for
Kurdish NLP (KNLP) application development. While several studies have been
conducted in KNLP for various applications, Kurdish NER (KNER) remains a
challenge for many KNLP tasks, including text analysis and classification. In
this work, we address this limitation by proposing a methodology for
fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first
create a Kurdish corpus, followed by designing a modified model architecture
and implementing the training procedures. To evaluate the trained model, a set
of experiments is conducted to demonstrate the performance of the KNER model
using different tokenization methods and trained models. The experimental
results show that fine-tuned RoBERTa with the SentencePiece tokenization method
substantially improves KNER performance, achieving a 12.8% improvement in
F1-score compared to traditional models, and consequently establishes a new
benchmark for KNLP. |
---|---|
DOI: | 10.48550/arxiv.2412.15252 |