Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically fro...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and
Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens.
We collected the content of the corpora from several social media platforms.
The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter.
The corpora of the other three dialects (~ 50K tokens each) came manually from
Facebook and YouTube posts and comments.
Thirty five (35) annotators who are native speakers of the target dialects
carried out the annotations. The annotators segemented all words in the four
corpora into prefixes, stems and suffixes and labeled each with different
morphological features such as part of speech, lemma, and a gloss in English.
An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the
annation. The annotators were trained on a set of guidelines and on how to use
ADAT. We developed ADAT to assist the annotators and to ensure compatibility
with SAMA and Curras tagsets. The tool is open source, and the four corpora are
also available online. |
---|---|
DOI: | 10.48550/arxiv.2212.06468 |