Multi-level out-of-vocabulary words handling approach

Distributed representation models can generate a vector representation only for words that belong to a finite vocabulary collected from the training data. If out-of-vocabulary (OOV) words are not handled properly, they can impair the performance of machine learning methods in a given natural languag...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge-based systems 2022-09, Vol.251, p.108911, Article 108911
Hauptverfasser:	Lochter, Johannes V., Silva, Renato M., Almeida, Tiago A.
Format:	Artikel
Sprache:	eng
Schlagworte:	Distributed vector representation Machine learning Natural language processing Out-of-vocabulary words
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Distributed representation models can generate a vector representation only for words that belong to a finite vocabulary collected from the training data. If out-of-vocabulary (OOV) words are not handled properly, they can impair the performance of machine learning methods in a given natural language processing task. This study offers a new methodology based on the consolidated top-down human reading theory, which may serve as a strong basis for developing new techniques to deal with the OOV problem. For this, we present MLOH, a Multi-Level OOV Handling approach, based on three chained strategies: analogy, decoding, and prediction. The techniques available in the literature, in general, are limited since they often resolve specific types of OOV words, such as those that can be inferred by analyzing their morphological structure or context. Compared to the process used by human readers to infer unknown words, using a single strategy is generally not effective. We evaluated MLOH performance on tasks that can be highly affected by OOV words, such as part-of-speech tagging, named entity recognition, and text categorization of short and noisy texts. The results indicate that the proposed approach is promising since it could handle most of the OOV words presented, is more generalist, and obtained competitive performance in all experiments.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.108911