PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation
https://aclanthology.org/2023.loresmt-1.3 Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-trai...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | https://aclanthology.org/2023.loresmt-1.3 Multilingual pre-training significantly improves many multilingual NLP tasks,
including machine translation. Most existing methods are based on some variants
of masked language modeling and text-denoising objectives on monolingual data.
Multilingual pre-training on monolingual data ignores the availability of
parallel data in many language pairs. Also, some other works integrate the
available human-generated parallel translation data in their pre-training. This
kind of parallel data is definitely helpful, but it is limited even in
high-resource language pairs. This paper introduces a novel semi-supervised
method, SPDG, that generates high-quality pseudo-parallel data for multilingual
pre-training. First, a denoising model is pre-trained on monolingual data to
reorder, add, remove, and substitute words, enhancing the pre-training
documents' quality. Then, we generate different pseudo-translations for each
pre-training document using dictionaries for word-by-word translation and
applying the pre-trained denoising model. The resulting pseudo-parallel data is
then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our
experiments show that PEACH outperforms existing approaches used in training
mT5 and mBART on various translation tasks, including supervised, zero- and
few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between
similar languages makes it particularly useful for low-resource languages. Our
results demonstrate that with high-quality dictionaries for generating accurate
pseudo-parallel, PEACH can be valuable for low-resource languages. |
---|---|
DOI: | 10.48550/arxiv.2304.01282 |