Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data

The underrepresentation of the Indonesian language in the field of Natural Language Processing (NLP) can be attributed to several key factors, including the absence of annotated datasets, limited language resources, and a lack of standardization in these resources. One notable linguistic phenomenon...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of advanced computer science & applications 2023, Vol.14 (10)
Hauptverfasser:	Astuti, Laksmita Widya, Sari, Yunita, -, Suprapto
Format:	Artikel
Sprache:	eng
Schlagworte:	Adolescents Code switching Data mining Datasets English language Indonesian language Language usage Linguistics Natural language processing Research projects Sentiment analysis Social media Sociolinguistics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The underrepresentation of the Indonesian language in the field of Natural Language Processing (NLP) can be attributed to several key factors, including the absence of annotated datasets, limited language resources, and a lack of standardization in these resources. One notable linguistic phenomenon in Indonesia is code-mixing between Bahasa Indonesia and English, which is influenced by various sociolinguistic factors, including individual speaker characteristics, the linguistic environment, the societal status of languages, and everyday language usage. In an effort to address the challenges posed by code-mixed data, this research project has successfully created a code-mixed dataset for sentiment analysis. This dataset was constructed based on keywords derived from the sociolinguistic phenomenon observed among teenagers in South Jakarta. Utilizing this newly developed dataset, we conducted a series of experiments employing different pre-processing techniques and pre-trained models. The results of these experiments have demonstrated that the IndoBERTweet pre-trained model is highly effective in solving sentiment analysis tasks when applied to Indonesian-English code-mixed data. These experiments yielded an average precision of 76.07%, a recall of 75.52%, an F-1 score of 75.51%, and an accuracy of 76.56%.
ISSN:	2158-107X 2156-5570
DOI:	10.14569/IJACSA.2023.0141053