Neural machine translation-oriented data selection and training method

The invention discloses a neural machine translation-oriented data selection and training method. The method comprises the following steps of: constructing a monolingual corpus; carrying out cleaning, filtering, word segmentation and sub-word segmentation preprocessing on the monolingual corpus to o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: JIANG YANHONG, YANG MURUN, LIU XINGYU
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a neural machine translation-oriented data selection and training method. The method comprises the following steps of: constructing a monolingual corpus; carrying out cleaning, filtering, word segmentation and sub-word segmentation preprocessing on the monolingual corpus to obtain training data; using the training data to finely adjust a pre-training model through a language model; inputting the monolingual data of two languages into codes, comparing the vector similarity of the two coded monolingual data, merging two sentences with the highest similarity into pseudo bilingual data, and constructing a pseudo parallel corpus; processing the pseudo-parallel corpora by using a word segmentation and sub-word segmentation method of the pre-training model, and initializing encoder parameters of a neural machine translation framework by using the pre-training model; pre-training a neural machine translation model by using the processed pseudo-parallel corpus; and finely tuning the neural mach