Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection

•Two public annotated corpora to evaluate stance detection classifiers over tweets in Spanish.•Semi-supervised learning model for stance annotation in Spanish social posts.•Combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the corpora. The pap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2023-05, Vol.60 (3), p.103294, Article 103294
Hauptverfasser:	Martínez, Rubén Yáñez, Blanco, Guillermo, Lourenço, Anália
Format:	Artikel
Sprache:	eng
Schlagworte:	Corpus annotation Density-based clustering Science & Technology Semi-supervised learning Social media Stance detection Transformer embeddings
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Two public annotated corpora to evaluate stance detection classifiers over tweets in Spanish.•Semi-supervised learning model for stance annotation in Spanish social posts.•Combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the corpora. The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2023.103294