A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering

•The study describes the essential phases of Arabic text clustering.•Bond energy algorithm for text document clustering is presented.•Fuzzy merge algorithm is explored to improve clustering.•Several clustering algorithms are compared and evaluated on Arabic datasets. Conventional textual documents c...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2020-11, Vol.159, p.113598, Article 113598
Hauptverfasser: AlMahmoud, Rana Husni, Hammo, Bassam, Faris, Hossam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•The study describes the essential phases of Arabic text clustering.•Bond energy algorithm for text document clustering is presented.•Fuzzy merge algorithm is explored to improve clustering.•Several clustering algorithms are compared and evaluated on Arabic datasets. Conventional textual documents clustering algorithms suffer from several shortcomings, such as the slow convergence of the immense high-dimensional data, the sensitivity to the initial value, and the understandability of the description of the resulted clusters. Although many clustering algorithms have been developed for English and other languages, very few have tackled the problem of clustering the under-resourced Arabic language. In this work, we propose a modified version of the Bond Energy Algorithm (BEA) combined with a fuzzy merging technique to solve the problem of Arabic text document clustering. The proposed algorithm, Clustering Arabic Documents based on Bond Energy, hereafter named CADBE, attempts to identify and display natural variable clusters within huge sized data. CADBE has three steps to cluster Arabic documents: the first step instantiates a cluster affinity matrix using the BEA, the second step uses a new and novel method to partition the cluster matrix automatically into small coherent clusters, and the last step uses a fuzzy merging technique to merge similar clusters based on the associations and interrelations between the resulted clusters. Experimental results showed that the proposed algorithm effectively outperformed the conventional clustering algorithms such as Expectation–Maximization (EM), Single Linkage, and UPGMA in terms of clustering purity and entropy. It also outperformed k-means, k-means++, spherical k-means, and CoclusMod in most test cases. However, there are several merits of CADBE. First, unlike the traditional clustering algorithms, it does not require to specify the number of clusters. In addition, it produces clusters with distinct boundaries, which makes its results more objective, and finally it is deterministic, such that it is insensitive to the order in which documents are presented to the algorithm.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2020.113598