CESS-A System to Categorize Bangla Web Text Documents

Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on Asian and low-resource language information processing 2020-08, Vol.19 (5), p.1-18
Hauptverfasser: Dhar, Ankita, Mukherjee, Himadri, Dash, Niladri Sekhar, Roy, Kaushik
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequency-inverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained with multilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20 Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to show the language-independent nature of the system.
ISSN:2375-4699
2375-4702
DOI:10.1145/3398070