TweetNorm: a benchmark for lexical normalization of Spanish tweets

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Language Resources and Evaluation 2015-12, Vol.49 (4), p.883-905
Hauptverfasser:	Algeria, Iñaki, Aranberri, Nora, Comas, Pere R., Fresno, Víctor, Gamallo, Pablo, Padró, Lluis, San Vicente, Iñaki, Turmo, Jordi, Zubiaga, Arkaitz
Format:	Artikel
Sprache:	eng
Schlagworte:	Abundance Benchmarking Benchmarks Computational Linguistics Computer mediated communication Computer Science Corpus Corpus linguistics Digital media Evaluation Language Language and Literature Lexical normalization Lexicografia Linguistics Mitjans de comunicació social Natural language processing Normalització lingüística Organizations Original Paper Social media Social networks Social Sciences Spanish language Standard language Texts Twitter
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.
ISSN:	1574-020X 1572-8412 1574-0218
DOI:	10.1007/s10579-015-9315-6