Freely Available Arabic Corpora: A Scoping Review

•According to our review, there are we identified 48 sources for freely available Arabic corpora from mainly peer-reviewed sources.•Arabic is underrepresented when it comes to freely available corpora.•First of its kind following PRISMA guidelines searching most common IT databases.•Our findings pot...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer methods and programs in biomedicine update 2022, Vol.2, p.100049, Article 100049
Hauptverfasser: Ahmed, Arfan, Ali, Nashva, Alzubaidi, Mahmood, Zaghouani, Wajdi, Abd-alrazaq, Alaa A, Househ, Mowafa
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•According to our review, there are we identified 48 sources for freely available Arabic corpora from mainly peer-reviewed sources.•Arabic is underrepresented when it comes to freely available corpora.•First of its kind following PRISMA guidelines searching most common IT databases.•Our findings potentially help those wanting to build ML/AI models in the Arabic language identify sources. Furthermore encourages researchers to make more databases available. Corpora play a vital role when training machine learning (ML) models and building systems that use natural language processing (NLP). It can be challenging for researchers to access corpora in a language other than English, and even more so if the corpora are not available for free of cost. The Arabic language is used by more than 1.5 billion Muslims and is the native language of over 250 million people as the Quran, the core text of Islam, is written in Arabic. To highlight peer-reviewed literature reporting free and accessible Arabic corpora. We aimed to benefit researchers by providing insights into freely available Arabic and accessible corpora, allowing them to achieve their research goals with ease. By conducting a scoping review using PRISMA guidelines, we searched the most common information technology (IT) databases and identified free of cost and accessible Arabic corpora. We identified a total of 48 accessible corpora sources available free of cost in the Arabic language, we present our findings according to categories to further help readers understand the corpora with direct links where available. The results were classified by corpora type into five categories based on their primary purpose. Arabic is underrepresented considering freely available corpora as most such corpora are available in English. Although previous studies have performed searches for corpora, ours is the first of its kind as it follows the PRISMA guidelines and includes peer-reviewed articles in the literature, obtained by searching the most common IT databases and source recommendations from language experts.
ISSN:2666-9900
2666-9900
DOI:10.1016/j.cmpbup.2022.100049