AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling

Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Langu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2022-07, Vol.59 (4), p.102964, Article 102964
Hauptverfasser: El Mekki, Abdellah, El Mahdaouy, Abdelkader, Berrada, Ismail, Khoumsi, Ahmed
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling. •We present AdaSL, an unsupervised framework for dialectal Arabic sequence labeling.•We introduce a sub-word pooling aggregator for the full representation of words.•We apply AdaSL for multilingual and Arabic Transformer pre-trained language models.•We validate AdaSL on Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.•We achieve new SOTA zero-sh
ISSN:0306-4573
1873-5371
DOI:10.1016/j.ipm.2022.102964