TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We be...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Natural language engineering 2016-05, Vol.22 (3), p.423-456
Hauptverfasser:	HABIB, MENA B., VAN KEULEN, MAURICE
Format:	Artikel
Sprache:	eng
Schlagworte:	Classification Computational linguistics Computer mediated communication Customer feedback Feature extraction Information retrieval Information technology Internet Knowledge bases (artificial intelligence) Natural language processing R&D Research & development Search engines Social networks Support vector machines Telematics User behavior User generated content
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.
ISSN:	1351-3249 1469-8110
DOI:	10.1017/S1351324915000194