Shahmukhi named entity recognition by using contextualized word embeddings

Named Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a systematically premeditated and established task, however, a little work has...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2023-11, Vol.229, p.120489, Article 120489
Hauptverfasser: Tehseen, Amina, Ehsan, Toqeer, Liaqat, Hannan Bin, Kong, Xiangjie, Ali, Amjad, Al-Fuqaha, Ala
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Named Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a systematically premeditated and established task, however, a little work has been done for Shahmukhi. This paper presents Shahmukhi NER with four key contributions. First, a Bi-directional Long-Short Term Memory (BiLSTM) network based NER model has been developed by incorporating various features including character and word embeddings and Part of Speech (POS) tagging. Second, transfer learning has been employed by training context-free Word2Vec and contextualized Embeddings from Language Models (ELMo) word representations. The word representations have been trained using a Shahmukhi corpus of 14.9 million words. Third, we prepared a cleaner version of an existing Shahmukhi NER corpus by performing Unicode normalization and tokenization errands. The corpus has been deduplicated and results are reported on an unseen evaluation set which produced valid results. Fourth, we have studied the impact of two annotation schemes; Inside-Outside (IO) and Inside-Outside-Beginning (IOB) for Shahmukhi. Transfer learning was quite helpful to enhance the performance of NER models especially ELMo embeddings significantly improved the results by prompting contextualized embedding vectors. This is the first study to use character embeddings, POS tagging and transfer learning for Shahmukhi named entity recognition. The IO scheme based model achieved an accuracy of 98.60% with an f-score of 83.75. The IOB scheme based model performed with an accuracy of 98.43% and an f-score of 75.55. These scores are quite promising for an under-resourced morphologically-rich language. •The Shahmukhi NER corpus is prepared via Unicode normalization and cleaning steps.•Experiments are performed using various features with BiLSTM NER taggers.•The impact of IO and IOB annotation schemes on Shahmukhi is studied.•Unseen corpus has been used for model evaluation.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.120489