A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis

•Experimental comparison of sixteen preprocessing techniques for Sentiment Analysis.•Use of two Twitter datasets and four popular machine learning algorithms.•Evaluation of the techniques’ resulting classification accuracy.•Lemmatization, number removal, and contractions’ replacement increase accura...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2018-11, Vol.110, p.298-310
Hauptverfasser: Symeonidis, Symeon, Effrosynidis, Dimitrios, Arampatzis, Avi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Experimental comparison of sixteen preprocessing techniques for Sentiment Analysis.•Use of two Twitter datasets and four popular machine learning algorithms.•Evaluation of the techniques’ resulting classification accuracy.•Lemmatization, number removal, and contractions’ replacement increase accuracy.•Ablation and combination study was executed to check interactions among techniques. Pre-processing is the first step in text classification, and choosing right pre-processing techniques can improve classification effectiveness. We experimentally compare 16 commonly used pre-processing techniques on two Twitter datasets for Sentiment Analysis, employing four popular machine learning algorithms, namely, Linear SVC, Bernoulli Naïve Bayes, Logistic Regression, and Convolutional Neural Networks. We evaluate the pre-processing techniques on their resulting classification accuracy and number of features they produce. We find that techniques like lemmatization, removing numbers, and replacing contractions, improve accuracy, while others like removing punctuation do not. Finally, in order to investigate interactions—desirable or otherwise—between the techniques when they are employed simultaneously in a pipeline fashion, an ablation and combination study is contacted. The results of ablation and combination show the significance of techniques such as replacing numbers and replacing repetitions of punctuation.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2018.06.022