Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction

Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Japanese Society for Artificial Intelligence 2023/07/01, Vol.38(4), pp.A-L41_1-10
Hauptverfasser:	Kato, Hideyoshi, Okabe, Masaaki, Kitano, Michiharu, Yadohisa, Hiroshi
Format:	Artikel
Sprache:	eng ; jpn
Schlagworte:	Algorithms Data augmentation Error correction Error correction & detection Machine translation natural language processing proofreading pseudo data generation spelling check
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.
ISSN:	1346-0714 1346-8030
DOI:	10.1527/tjsai.38-4_A-L41