Generative Data Augmentation for Commonsense Reasoning
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investig...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2020-11 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Yang, Yiben Malaviya, Chaitanya Fernandez, Jared Swayamdipta, Swabha Ronan Le Bras Ji-Ping, Wang Bhagavatula, Chandra Choi, Yejin Downey, Doug |
description | Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization. |
doi_str_mv | 10.48550/arxiv.2004.11546 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2004_11546</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2395073098</sourcerecordid><originalsourceid>FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</originalsourceid><addsrcrecordid>eNotj11LwzAUhoMgOOZ-gFcWvG49-U4ux9QpDATR63LWJqPDJjNph_576yYceOHw8PI-hNxQqISREu4xfXfHigGIilIp1AWZMc5paQRjV2SR8x4AmNJMSj4jau2CSzh0R1c84IDFctz1LgzTJ4bCx1SsYt_HkN10xZvDHEMXdtfk0uNndov_nJOPp8f31XO5eV2_rJabEqngqqTWOoCWNdqwZquMtdZvnVeKg5KqRbAahUBKnTSeNlx637aq1YIKPfGUz8ntufdkVR9S12P6qf_s6pPdRNydiUOKX6PLQ72PYwrTqJpxK0FzsIb_AlWiUS0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2395073098</pqid></control><display><type>article</type><title>Generative Data Augmentation for Commonsense Reasoning</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</creator><creatorcontrib>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</creatorcontrib><description>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2004.11546</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Annotations ; Computer Science - Computation and Language ; Data augmentation ; Learning ; Reasoning ; Training</subject><ispartof>arXiv.org, 2020-11</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2004.11546$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.18653/v1/2020.findings-emnlp.90$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Yang, Yiben</creatorcontrib><creatorcontrib>Malaviya, Chaitanya</creatorcontrib><creatorcontrib>Fernandez, Jared</creatorcontrib><creatorcontrib>Swayamdipta, Swabha</creatorcontrib><creatorcontrib>Ronan Le Bras</creatorcontrib><creatorcontrib>Ji-Ping, Wang</creatorcontrib><creatorcontrib>Bhagavatula, Chandra</creatorcontrib><creatorcontrib>Choi, Yejin</creatorcontrib><creatorcontrib>Downey, Doug</creatorcontrib><title>Generative Data Augmentation for Commonsense Reasoning</title><title>arXiv.org</title><description>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</description><subject>Annotations</subject><subject>Computer Science - Computation and Language</subject><subject>Data augmentation</subject><subject>Learning</subject><subject>Reasoning</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj11LwzAUhoMgOOZ-gFcWvG49-U4ux9QpDATR63LWJqPDJjNph_576yYceOHw8PI-hNxQqISREu4xfXfHigGIilIp1AWZMc5paQRjV2SR8x4AmNJMSj4jau2CSzh0R1c84IDFctz1LgzTJ4bCx1SsYt_HkN10xZvDHEMXdtfk0uNndov_nJOPp8f31XO5eV2_rJabEqngqqTWOoCWNdqwZquMtdZvnVeKg5KqRbAahUBKnTSeNlx637aq1YIKPfGUz8ntufdkVR9S12P6qf_s6pPdRNydiUOKX6PLQ72PYwrTqJpxK0FzsIb_AlWiUS0</recordid><startdate>20201117</startdate><enddate>20201117</enddate><creator>Yang, Yiben</creator><creator>Malaviya, Chaitanya</creator><creator>Fernandez, Jared</creator><creator>Swayamdipta, Swabha</creator><creator>Ronan Le Bras</creator><creator>Ji-Ping, Wang</creator><creator>Bhagavatula, Chandra</creator><creator>Choi, Yejin</creator><creator>Downey, Doug</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20201117</creationdate><title>Generative Data Augmentation for Commonsense Reasoning</title><author>Yang, Yiben ; Malaviya, Chaitanya ; Fernandez, Jared ; Swayamdipta, Swabha ; Ronan Le Bras ; Ji-Ping, Wang ; Bhagavatula, Chandra ; Choi, Yejin ; Downey, Doug</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a1436-199e00d2c782cb68999fbef6630656da097a44a11e58f1c35ffdd6d74147cb613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Annotations</topic><topic>Computer Science - Computation and Language</topic><topic>Data augmentation</topic><topic>Learning</topic><topic>Reasoning</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Yiben</creatorcontrib><creatorcontrib>Malaviya, Chaitanya</creatorcontrib><creatorcontrib>Fernandez, Jared</creatorcontrib><creatorcontrib>Swayamdipta, Swabha</creatorcontrib><creatorcontrib>Ronan Le Bras</creatorcontrib><creatorcontrib>Ji-Ping, Wang</creatorcontrib><creatorcontrib>Bhagavatula, Chandra</creatorcontrib><creatorcontrib>Choi, Yejin</creatorcontrib><creatorcontrib>Downey, Doug</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Yiben</au><au>Malaviya, Chaitanya</au><au>Fernandez, Jared</au><au>Swayamdipta, Swabha</au><au>Ronan Le Bras</au><au>Ji-Ping, Wang</au><au>Bhagavatula, Chandra</au><au>Choi, Yejin</au><au>Downey, Doug</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Generative Data Augmentation for Commonsense Reasoning</atitle><jtitle>arXiv.org</jtitle><date>2020-11-17</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2004.11546</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2020-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2004_11546 |
source | arXiv.org; Free E- Journals |
subjects | Annotations Computer Science - Computation and Language Data augmentation Learning Reasoning Training |
title | Generative Data Augmentation for Commonsense Reasoning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T06%3A22%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Generative%20Data%20Augmentation%20for%20Commonsense%20Reasoning&rft.jtitle=arXiv.org&rft.au=Yang,%20Yiben&rft.date=2020-11-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2004.11546&rft_dat=%3Cproquest_arxiv%3E2395073098%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2395073098&rft_id=info:pmid/&rfr_iscdi=true |