Generative Data Augmentation for Commonsense Reasoning

Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investig...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-11
Hauptverfasser:	Yang, Yiben, Malaviya, Chaitanya, Fernandez, Jared, Swayamdipta, Swabha, Ronan Le Bras, Ji-Ping, Wang, Bhagavatula, Chandra, Choi, Yejin, Downey, Doug
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Computer Science - Computation and Language Data augmentation Learning Reasoning Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yang, Yiben Malaviya, Chaitanya Fernandez, Jared Swayamdipta, Swabha Ronan Le Bras Ji-Ping, Wang Bhagavatula, Chandra Choi, Yejin Downey, Doug
description	Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.
doi_str_mv	10.48550/arxiv.2004.11546
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2004_11546</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2395073098</sourcerecordid><originalsourceid>FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</originalsourceid><addsrcrecordid>eNotj11LwzAUhoMgOOZ-gFcWvG49-U4ux9QpDATR63LWJqPDJjNph_576yYceOHw8PI-hNxQqISREu4xfXfHigGIilIp1AWZMc5paQRjV2SR8x4AmNJMSj4jau2CSzh0R1c84IDFctz1LgzTJ4bCx1SsYt_HkN10xZvDHEMXdtfk0uNndov_nJOPp8f31XO5eV2_rJabEqngqqTWOoCWNdqwZquMtdZvnVeKg5KqRbAahUBKnTSeNlx637aq1YIKPfGUz8ntufdkVR9S12P6qf_s6pPdRNydiUOKX6PLQ72PYwrTqJpxK0FzsIb_AlWiUS0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2395073098</pqid></control><display><type>article</type><title>Generative Data Augmentation for Commonsense Reasoning</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</creator><creatorcontrib>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</creatorcontrib><description>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2004.11546</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Annotations ; Computer Science - Computation and Language ; Data augmentation ; Learning ; Reasoning ; Training</subject><ispartof>arXiv.org, 2020-11</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2004.11546$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.18653/v1/2020.findings-emnlp.90$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Yang, Yiben</creatorcontrib><creatorcontrib>Malaviya, Chaitanya</creatorcontrib><creatorcontrib>Fernandez, Jared</creatorcontrib><creatorcontrib>Swayamdipta, Swabha</creatorcontrib><creatorcontrib>Ronan Le Bras</creatorcontrib><creatorcontrib>Ji-Ping, Wang</creatorcontrib><creatorcontrib>Bhagavatula, Chandra</creatorcontrib><creatorcontrib>Choi, Yejin</creatorcontrib><creatorcontrib>Downey, Doug</creatorcontrib><title>Generative Data Augmentation for Commonsense Reasoning</title><title>arXiv.org</title><description>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</description><subject>Annotations</subject><subject>Computer Science - Computation and Language</subject><subject>Data augmentation</subject><subject>Learning</subject><subject>Reasoning</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj11LwzAUhoMgOOZ-gFcWvG49-U4ux9QpDATR63LWJqPDJjNph_576yYceOHw8PI-hNxQqISREu4xfXfHigGIilIp1AWZMc5paQRjV2SR8x4AmNJMSj4jau2CSzh0R1c84IDFctz1LgzTJ4bCx1SsYt_HkN10xZvDHEMXdtfk0uNndov_nJOPp8f31XO5eV2_rJabEqngqqTWOoCWNdqwZquMtdZvnVeKg5KqRbAahUBKnTSeNlx637aq1YIKPfGUz8ntufdkVR9S12P6qf_s6pPdRNydiUOKX6PLQ72PYwrTqJpxK0FzsIb_AlWiUS0</recordid><startdate>20201117</startdate><enddate>20201117</enddate><creator>Yang, Yiben</creator><creator>Malaviya, Chaitanya</creator><creator>Fernandez, Jared</creator><creator>Swayamdipta, Swabha</creator><creator>Ronan Le Bras</creator><creator>Ji-Ping, Wang</creator><creator>Bhagavatula, Chandra</creator><creator>Choi, Yejin</creator><creator>Downey, Doug</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20201117</creationdate><title>Generative Data Augmentation for Commonsense Reasoning</title><author>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Annotations</topic><topic>Computer Science - Computation and Language</topic><topic>Data augmentation</topic><topic>Learning</topic><topic>Reasoning</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Yiben</creatorcontrib><creatorcontrib>Malaviya, Chaitanya</creatorcontrib><creatorcontrib>Fernandez, Jared</creatorcontrib><creatorcontrib>Swayamdipta, Swabha</creatorcontrib><creatorcontrib>Ronan Le Bras</creatorcontrib><creatorcontrib>Ji-Ping, Wang</creatorcontrib><creatorcontrib>Bhagavatula, Chandra</creatorcontrib><creatorcontrib>Choi, Yejin</creatorcontrib><creatorcontrib>Downey, Doug</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Yiben</au><au>Malaviya, Chaitanya</au><au>Fernandez, Jared</au><au>Swayamdipta, Swabha</au><au>Ronan Le Bras</au><au>Ji-Ping, Wang</au><au>Bhagavatula, Chandra</au><au>Choi, Yejin</au><au>Downey, Doug</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Generative Data Augmentation for Commonsense Reasoning</atitle><jtitle>arXiv.org</jtitle><date>2020-11-17</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2004.11546</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-11
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2004_11546
source	arXiv.org; Free E- Journals
subjects	Annotations Computer Science - Computation and Language Data augmentation Learning Reasoning Training
title	Generative Data Augmentation for Commonsense Reasoning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T06%3A22%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Generative%20Data%20Augmentation%20for%20Commonsense%20Reasoning&rft.jtitle=arXiv.org&rft.au=Yang,%20Yiben&rft.date=2020-11-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2004.11546&rft_dat=%3Cproquest_arxiv%3E2395073098%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2395073098&rft_id=info:pmid/&rfr_iscdi=true