Recovering Word Forms by Context for Morphologically Rich Languages

In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first tr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of mathematical sciences (New York, N.Y.) N.Y.), 2023-07, Vol.273 (4), p.527-532
Hauptverfasser:	Alekseev, A. M., Nikolenko, S. I.
Format:	Artikel
Sprache:	eng
Schlagworte:	Computational linguistics Language processing Machine learning Machine translation Mathematics Mathematics and Statistics Natural language interfaces Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first try, apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67, 3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.
ISSN:	1072-3374 1573-8795
DOI:	10.1007/s10958-023-06518-7