WiHArD: Wikipedia based Hierarchical Arabic Dataset

WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ)...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Djelloul BOUCHIHA
Format: Dataset
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ) and Math (رياضيات). Texts in this level describe general notions related to these domains. • "Level 2" classes are Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). Texts in this level describe specific notions related to these sub-domains. Four files are shared for the benefit of the NLP and IA communities, especially researchers working on Arabic language: 1. WiHArD_Directory_Hierarchy.zip contains the directory hierarchy. 2. WiHArD.csv, a CSV file of three columns: "text" column contains the Arabic texts; "category_path" and "category_code" columns contain respectively the category path and the category code. 3. WiHArD_Level1.csv, a CSV file restricted to the texts the first level, namely Culture (ثقافة), History (تاريخ) and Math (رياضيات). 4. WiHArD_Level2.csv, a CSV file restricted to the texts of the second level, namely Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة).
DOI:10.17632/kdkryh5rs2.1