A Semi-supervised Approach of Cluster-Based Topic Modeling for Effective Tweet Hashtag Recommendation

Twitter has emerged as a significant source of data to be used for text summarization, Topic Modeling, document clustering, information retrieval, sentiment analysis, etc. Using hashtags, Twitter users may categorize their tweets as hashtags provide the essential meta-information in connecting tweet...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:SN computer science 2024-10, Vol.5 (7), p.951, Article 951
Hauptverfasser: Pattanayak, Pradipta Kumar, Tripathy, Rudra M., Padhy, Sudarsan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Twitter has emerged as a significant source of data to be used for text summarization, Topic Modeling, document clustering, information retrieval, sentiment analysis, etc. Using hashtags, Twitter users may categorize their tweets as hashtags provide the essential meta-information in connecting tweets to the underlying themes. However, the majority of tweets do not have hashtags, which makes it challenging to search for a particular theme. The proposed model is designed to recommend appropriate hashtags for tweets by considering their categorization into sports, politics, health, or technology. In this paper, we proposed a novel heuristic for recommending hashtags of tweets. Taking 20,000 tweets, which includes 5000 tweets from each of the specified four topics along with their respective hashtags. These hashtags were manually assigned by a group of experts, which were subsequently excluded during the topic modeling process. The basic data-cleaning technique is applied to clean and tokenize the tweets. Then Word2Vec technique is used to vectorize the tokens which captured the semantic meaning of the words in the tweets and overcomes the data sparsity issues. The dimension of the data is reduced using Singular Value Decomposition (SVD) followed by t-SNE (t-distributed Stochastic Neighbor Embedding). The reduced data is divided into four clusters and a semi-supervised method is introduced to link these clusters to the aforementioned topics, which eventually helped to produce the hashtag for a list of tweets. On comparison of our results with existing techniques, it is observed that our model performance is better with respect to the metrics: precision, recall, and F1-score.
ISSN:2661-8907
2662-995X
2661-8907
DOI:10.1007/s42979-024-03299-x