A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering

The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the fir...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications 2021, Vol.80 (1), p.37-82
Hauptverfasser:	Abasi, Ammar Kamal, Khader, Ahamad Tajudin, Al-Betar, Mohammed Azmi, Naim, Syibrah, Makhadmeh, Sharif Naser, Alyasseri, Zaid Abdi Alkareem
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Clustering Computer Communication Networks Computer Science Data Structures and Information Theory Documents Indexing Information retrieval Multimedia Information Systems Optimization Scientific papers Special Purpose and Application-Based Systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-020-09504-2