Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction

Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Japanese Society for Artificial Intelligence 2023/07/01, Vol.38(4), pp.A-L41_1-10
Hauptverfasser:	Kato, Hideyoshi, Okabe, Masaaki, Kitano, Michiharu, Yadohisa, Hiroshi
Format:	Artikel
Sprache:	eng ; jpn
Schlagworte:	Algorithms Data augmentation Error correction Error correction & detection Machine translation natural language processing proofreading pseudo data generation spelling check
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	10
container_issue	4
container_start_page	A-L41_1
container_title	Transactions of the Japanese Society for Artificial Intelligence
container_volume	38
creator	Kato, Hideyoshi Okabe, Masaaki Kitano, Michiharu Yadohisa, Hiroshi
description	Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.
doi_str_mv	10.1527/tjsai.38-4_A-L41
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2864892394</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2864892394</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1741-ef4b69886758499e5669daf1b8ad249b3719edd91414771b8026335bdd5cd4303</originalsourceid><addsrcrecordid>eNpNkEFPwzAMhSMEEtPYnWMkzh1Jk7bJcRrbAA2BEDtHaeOOTl06nO7Avydj0-BiW9b7nuVHyC1nY56lxX2_CbYZC5VIM0mWkl-QARcyTxQT7PI0s4LLazIKoSkZ46mQnGUD8v5ge0sn-_UWfG_7pvN0FRq_pm8IPdrGg6MvnYM20MbTZ7uzHgLQBdrtNsor29IZYod02iFCdTC4IVe1bQOMTn1IVvPZx_QxWb4unqaTZVLxQvIEalnmWqm8yJTUGrI8187WvFTWpVKXouAanNNcclkUcc3SXIisdC6rnBRMDMnd0XeH3dceQm823R59PGlSlUulU6FlVLGjqsIuBITa7LDZWvw2nJlDeOY3PCOUieGZGF5E5kdkE3q7hjNgMT7cwn8glj_wLKg-LRrw4gcdwnxw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2864892394</pqid></control><display><type>article</type><title>Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction</title><source>J-STAGE Free</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Kato, Hideyoshi ; Okabe, Masaaki ; Kitano, Michiharu ; Yadohisa, Hiroshi</creator><creatorcontrib>Kato, Hideyoshi ; Okabe, Masaaki ; Kitano, Michiharu ; Yadohisa, Hiroshi</creatorcontrib><description>Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.</description><identifier>ISSN: 1346-0714</identifier><identifier>EISSN: 1346-8030</identifier><identifier>DOI: 10.1527/tjsai.38-4_A-L41</identifier><language>eng ; jpn</language><publisher>Tokyo: The Japanese Society for Artificial Intelligence</publisher><subject>Algorithms ; Data augmentation ; Error correction ; Error correction & detection ; Machine translation ; natural language processing ; proofreading ; pseudo data generation ; spelling check</subject><ispartof>Transactions of the Japanese Society for Artificial Intelligence, 2023/07/01, Vol.38(4), pp.A-L41_1-10</ispartof><rights>The Japanese Society for Artificial Intelligence 2023</rights><rights>Copyright Japan Science and Technology Agency 2023</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1741-ef4b69886758499e5669daf1b8ad249b3719edd91414771b8026335bdd5cd4303</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,1877,27901,27902</link.rule.ids></links><search><creatorcontrib>Kato, Hideyoshi</creatorcontrib><creatorcontrib>Okabe, Masaaki</creatorcontrib><creatorcontrib>Kitano, Michiharu</creatorcontrib><creatorcontrib>Yadohisa, Hiroshi</creatorcontrib><title>Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction</title><title>Transactions of the Japanese Society for Artificial Intelligence</title><description>Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.</description><subject>Algorithms</subject><subject>Data augmentation</subject><subject>Error correction</subject><subject>Error correction & detection</subject><subject>Machine translation</subject><subject>natural language processing</subject><subject>proofreading</subject><subject>pseudo data generation</subject><subject>spelling check</subject><issn>1346-0714</issn><issn>1346-8030</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpNkEFPwzAMhSMEEtPYnWMkzh1Jk7bJcRrbAA2BEDtHaeOOTl06nO7Avydj0-BiW9b7nuVHyC1nY56lxX2_CbYZC5VIM0mWkl-QARcyTxQT7PI0s4LLazIKoSkZ46mQnGUD8v5ge0sn-_UWfG_7pvN0FRq_pm8IPdrGg6MvnYM20MbTZ7uzHgLQBdrtNsor29IZYod02iFCdTC4IVe1bQOMTn1IVvPZx_QxWb4unqaTZVLxQvIEalnmWqm8yJTUGrI8187WvFTWpVKXouAanNNcclkUcc3SXIisdC6rnBRMDMnd0XeH3dceQm823R59PGlSlUulU6FlVLGjqsIuBITa7LDZWvw2nJlDeOY3PCOUieGZGF5E5kdkE3q7hjNgMT7cwn8glj_wLKg-LRrw4gcdwnxw</recordid><startdate>20230701</startdate><enddate>20230701</enddate><creator>Kato, Hideyoshi</creator><creator>Okabe, Masaaki</creator><creator>Kitano, Michiharu</creator><creator>Yadohisa, Hiroshi</creator><general>The Japanese Society for Artificial Intelligence</general><general>Japan Science and Technology Agency</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20230701</creationdate><title>Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction</title><author>Kato, Hideyoshi ; Okabe, Masaaki ; Kitano, Michiharu ; Yadohisa, Hiroshi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1741-ef4b69886758499e5669daf1b8ad249b3719edd91414771b8026335bdd5cd4303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng ; jpn</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Data augmentation</topic><topic>Error correction</topic><topic>Error correction & detection</topic><topic>Machine translation</topic><topic>natural language processing</topic><topic>proofreading</topic><topic>pseudo data generation</topic><topic>spelling check</topic><toplevel>online_resources</toplevel><creatorcontrib>Kato, Hideyoshi</creatorcontrib><creatorcontrib>Okabe, Masaaki</creatorcontrib><creatorcontrib>Kitano, Michiharu</creatorcontrib><creatorcontrib>Yadohisa, Hiroshi</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Transactions of the Japanese Society for Artificial Intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kato, Hideyoshi</au><au>Okabe, Masaaki</au><au>Kitano, Michiharu</au><au>Yadohisa, Hiroshi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction</atitle><jtitle>Transactions of the Japanese Society for Artificial Intelligence</jtitle><date>2023-07-01</date><risdate>2023</risdate><volume>38</volume><issue>4</issue><spage>A-L41_1</spage><epage>10</epage><pages>A-L41_1-10</pages><artnum>38-4_A-L41</artnum><issn>1346-0714</issn><eissn>1346-8030</eissn><abstract>Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.</abstract><cop>Tokyo</cop><pub>The Japanese Society for Artificial Intelligence</pub><doi>10.1527/tjsai.38-4_A-L41</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1346-0714
ispartof	Transactions of the Japanese Society for Artificial Intelligence, 2023/07/01, Vol.38(4), pp.A-L41_1-10
issn	1346-0714 1346-8030
language	eng ; jpn
recordid	cdi_proquest_journals_2864892394
source	J-STAGE Free; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects	Algorithms Data augmentation Error correction Error correction & detection Machine translation natural language processing proofreading pseudo data generation spelling check
title	Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T22%3A36%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Data%20Augmentation%20Using%20Pretrained%20Models%20in%20Japanese%20Grammatical%20Error%20Correction&rft.jtitle=Transactions%20of%20the%20Japanese%20Society%20for%20Artificial%20Intelligence&rft.au=Kato,%20Hideyoshi&rft.date=2023-07-01&rft.volume=38&rft.issue=4&rft.spage=A-L41_1&rft.epage=10&rft.pages=A-L41_1-10&rft.artnum=38-4_A-L41&rft.issn=1346-0714&rft.eissn=1346-8030&rft_id=info:doi/10.1527/tjsai.38-4_A-L41&rft_dat=%3Cproquest_cross%3E2864892394%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2864892394&rft_id=info:pmid/&rfr_iscdi=true