An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2020-03, Vol.57 (2), p.102034, Article 102034
Hauptverfasser:	Curiskis, Stephan A., Drake, Barry, Osborn, Thomas R., Kennedy, Paul J.
Format:	Artikel
Sprache:	eng
Schlagworte:	Clustering Data mining Data models Datasets Dirichlet problem Distance measurement Document clustering Embedding Embedding models Modelling Neural networks Online social networks Representations Social networks Text categorization Topic discovery Topic modelling User generated content
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2019.04.002