Towards Sentiment Analysis for Romanian Twitter Content

With the increased popularity of social media platforms such as Twitter or Facebook, sentiment analysis (SA) over the microblogging content becomes of crucial importance. The literature reports good results for well-resourced languages such as English, Spanish or German, but open research space stil...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Algorithms 2022-10, Vol.15 (10), p.357
Hauptverfasser:	Neagu, Dan Claudiu, Rus, Andrei Bogdan, Grec, Mihai, Boroianu, Mihai Augustin, Bogdan, Nicolae, Gal, Attila
Format:	Artikel
Sprache:	eng
Schlagworte:	Classification Classifiers Computational linguistics COVID-19 vaccines Data mining Datasets Deep learning Digital media Language processing Languages Machine learning Natural language interfaces natural language processing Privacy Sentiment analysis Social media Social networks Translating and interpreting Twitter underrepresented language
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the increased popularity of social media platforms such as Twitter or Facebook, sentiment analysis (SA) over the microblogging content becomes of crucial importance. The literature reports good results for well-resourced languages such as English, Spanish or German, but open research space still exists for underrepresented languages such as Romanian, where there is a lack of public training datasets or pretrained word embeddings. The majority of research on Romanian SA tackles the issue in a binary classification manner (positive vs. negative), using a single public dataset which consists of product reviews. In this paper, we respond to the need for a media surveillance project to possess a custom multinomial SA classifier for usage in a restrictive and specific production setup. We describe in detail how such a classifier was built, with the help of an English dataset (containing around 15,000 tweets) translated to Romanian with a public translation service. We test the most popular classification methods that could be applied to SA, including standard machine learning, deep learning and BERT. As we could not find any results for multinomial sentiment classification (positive, negative and neutral) in Romanian, we set two benchmark accuracies of ≈78% using standard machine learning and ≈81% using BERT. Furthermore, we demonstrate that the automatic translation service does not downgrade the learning performance by comparing the accuracies achieved by the models trained on the original dataset with the models trained on the translated data.
ISSN:	1999-4893 1999-4893
DOI:	10.3390/a15100357