Towards acquisition of a thematic Persian corpus from the Tebyan Portal: TebCorp

The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information re...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Khalifehsoltani, S N, Cholmaghani, A, Vahdani, A, Moallemi, R
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information retrieval researches. In this paper we tried to exploit Tebyan portal - containing vast amount of prominent Persian articles - as a linguistic resource to build a multipurpose thematic corpus for Persian. We will present particular details on building this corpus including information retrieval and collection assessment. We will then conclude by giving practical information about this corpus.
DOI:10.1109/ICCET.2010.5485685