The Impact of Multilinguality and Tokenization on Statistical Machine Translation

Multilingual neural machine translation systems has achieved state-of-the-art results on translation quality, especially for low-resource languages, yet statistical machine translations systems has not been trained and examined in similar multilingual setup. This work defines a multilingual statisti...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Asvarov, Alidar, Grabovoy, Andrey
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Buildings Costs machine translation Measurement moses multilingual Production statistical machine translation Technological innovation tokenization Training Training data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multilingual neural machine translation systems has achieved state-of-the-art results on translation quality, especially for low-resource languages, yet statistical machine translations systems has not been trained and examined in similar multilingual setup. This work defines a multilingual statistical machine translation system as a many-to-one system capable of translating from any of the predefined languages to the one target language. We study how the multilingual setting affects translations quality compared to a regular one-to-one language machine translation system. And we examine how this setting affects related languages with different amount of training data. The research is conducted in multiple languages of different language families. The impact of different tokenizers and preprocessing methods is researched as well. Specifically, we compare the default Moses tokenizer with the SentencePiece tokenizer, as well as dedicated Chinese and Japanese word splitters. We also investigate the impact of lowercasing and conduct our experiments on data of different sizes. We find out that multilinguality gives a small gain across all of the metrics. Languages with sufficient amount of good quality training data do not affect the quality of related languages with lesser quality data. The SentencePiece tokenizer shows lower BLEU scores on average, but outperforms other tokenizers on chrF++ and METEOR metrics. Lowercasing increases scores of all metrics in all of the scenarios.
ISSN:	2305-7254 2305-7254 2343-0737
DOI:	10.23919/FRUCT61870.2024.10516416