Improvisation of Reddit flair detection using TF-IDF and countvectorizer
The internet has become an essential part of everyone. Through the internet, many tasks are made easier, like online payments. With the increase in the number of internet users, the data is also increasing. People are fond of social media platforms like Facebook, Instagram, Reddit and Twitter. Reddi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The internet has become an essential part of everyone. Through the internet, many tasks are made easier, like online payments. With the increase in the number of internet users, the data is also increasing. People are fond of social media platforms like Facebook, Instagram, Reddit and Twitter. Reddit, a social networking website that provides a platform for users to create posts, discussion groups, etc., generates a massive amount of data daily. This data needs to be organized and analyzed to label it and divide it into specific categories. Posts on this social networking platform are organized with the help of categories defined by Reddit and are known as ”subreddits”. Through this study, an effort has been made to design a model that can detect the flair (category) of a Reddit post. The dataset is collected from the Reddit Application Program Interface using PRAW Library. This requires word embedding; that is, words that have a meaning similar to each other are represented analogously. Word embedding is done using techniques like Countvectorizer and TF-IDF. For the prediction of the flairs various algorithms are used, such as Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, etc. The results of this study illustrate the importance of the proposed work. The dataset for this study is self-scraped from Reddit API where the instance with even a single blank column was removed. There is limited research done on flair identification on the Reddit dataset, which add-ons the uniqueness of the research done. |
---|---|
ISSN: | 0094-243X 1551-7616 |
DOI: | 10.1063/5.0181369 |