WiHArD: Wikipedia based Hierarchical Arabic Dataset
WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ)...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Dataset |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes:
• "Level 1" classes are Culture (ثقافة), History (تاريخ) and Math (رياضيات). Texts in this level describe general notions related to these domains.
• "Level 2" classes are Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). Texts in this level describe specific notions related to these sub-domains.
Four files are shared for the benefit of the NLP and IA communities, especially researchers working on Arabic language:
1. WiHArD_Directory_Hierarchy.zip contains the directory hierarchy.
2. WiHArD.csv, a CSV file of three columns: "text" column contains the Arabic texts; "category_path" and "category_code" columns contain respectively the category path and the category code.
3. WiHArD_Level1.csv, a CSV file restricted to the texts the first level, namely Culture (ثقافة), History (تاريخ) and Math (رياضيات).
4. WiHArD_Level2.csv, a CSV file restricted to the texts of the second level, namely Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). |
---|---|
DOI: | 10.17632/kdkryh5rs2.1 |