An ensemble clustering approach for topic discovery using implicit text segmentation

Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their cont...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of information science 2021-08, Vol.47 (4), p.431-457
Hauptverfasser:	Memon, Muhammad Qasim, Lu, Yu, Chen, Penghe, Memon, Aasma, Pathan, Muhammad Salman, Zardari, Zulfiqar Ali
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Clustering Data retrieval Segmentation Segments
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their contents are not associated with each other. Existing clustering approaches follow the TS method which relies on word frequencies and may not be suitable to cluster multi-topic text collections. In this work, we propose a new ensemble clustering approach (ECA) is a novel topic-modelling-based clustering approach, which induces the combination of TS and text clustering. We improvised a LDA-onto (LDA-ontology) is a TS-based model, which presents a deterioration of a document into segments (i.e. sub-documents), wherein each sub-document is associated with exactly one sub-topic. We deal with the problem of clustering when it comes to a document that is intrinsically related to various topics and its topical structure is missing. ECA is tested through well-known datasets in order to provide a comprehensive presentation and validation of clustering algorithms using LDA-onto. ECA exhibits the semantic relations of keywords in sub-documents and resultant clusters belong to original documents that they contain. Moreover, present research sheds the light on clustering performances and it indicates that there is no difference over performances (in terms of F-measure) when the number of topics changes. Our findings give above par results in order to analyse the problem of text clustering in a broader spectrum without applying dimension reduction techniques over high sparse data. Specifically, ECA provides an efficient and significant framework than the traditional and segment-based approach, such that achieved results are statistically significant with an average improvement of over 10.2%. For the most part, proposed framework can be evaluated in applications where meaningful data retrieval is useful, such as document summarization, text retrieval, novelty and topic detection.
ISSN:	0165-5515 1741-6485
DOI:	10.1177/0165551520911590