RESHAPE: Reverse-Edited Synthetic Hypotheses for Automatic Post-Editing

Synthetic training data has been extensively used to train Automatic Post-Editing (APE) models in many recent studies because the quantity of human-created data has been considered insufficient. However, the most widely used synthetic APE dataset, eSCAPE, overlooks respecting the minimal editing pro...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2022, Vol.10, p.28274-28282
Hauptverfasser:	Lee, Wonkee, Jung, Baikjin, Shin, Jaehun, Lee, Jong-Hyeok
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic post-editing back-translation Data models Datasets Decoding decoding strategy Editing Feeds Limiting machine translation synthetic data generation Training Training data Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Synthetic training data has been extensively used to train Automatic Post-Editing (APE) models in many recent studies because the quantity of human-created data has been considered insufficient. However, the most widely used synthetic APE dataset, eSCAPE, overlooks respecting the minimal editing property of genuine data, and this defect may have been a limiting factor for the performance of APE models. This article suggests adapting back-translation to APE to constrain edit distance, while using stochastic sampling in decoding to maintain the diversity of outputs, to create a new synthetic APE dataset, RESHAPE . Our experiments show that (1) RESHAPE contains more samples resembling genuine APE data than eSCAPE does, and (2) using RESHAPE as new training data improves APE models' performance substantially over using eSCAPE.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3154768