Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. I...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Association for Computational Linguistics 2020-01, Vol.8, p.662-678
Hauptverfasser: Bartolo, Max, Roberts, Alastair, Welbl, Johannes, Riedel, Sebastian, Stenetorp, Pontus
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 678
container_issue
container_start_page 662
container_title Transactions of the Association for Computational Linguistics
container_volume 8
creator Bartolo, Max
Roberts, Alastair
Welbl, Johannes
Riedel, Sebastian
Stenetorp, Pontus
description Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F ).
doi_str_mv 10.1162/tacl_a_00338
format Article
fullrecord <record><control><sourceid>proquest_mit_j</sourceid><recordid>TN_cdi_mit_journals_10_1162_tacl_a_00338</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_e46d9c38f2dd4f15bdbba4f050f5913d</doaj_id><sourcerecordid>2893884195</sourcerecordid><originalsourceid>FETCH-LOGICAL-c454t-905ee8ac8b0b14880a73ea4c11dee8fe5c148c15902e5dd80b8966f922b78d853</originalsourceid><addsrcrecordid>eNp1kU1LI0EQhgdZQVFv_oCBvezBaHX39KR6YQ_ZoCYQEGUXvDU1_ZFMSKZjzySgv96OEYmgpyreeuqtoirLzhlcMlbyq47MQpMGEAIPsmMuoN8T2H_8sZcfZWdtOwcAhgyh5MfZ_V9HXd7NXD4Y_87Hzca1XT2lrm6m-cBuXGwp1rTIR-slNfmgaUKXiqHJfYj5gyO7BYdhuYpu5po2VU6zQ0-L1p29x5Ps_831v-GoN7m7HQ8Hk54pZNH1FEjnkAxWULECEagvHBWGMZt076RJqmFSAXfSWoQKVVl6xXnVR4tSnGTjna8NNNerWC8pPutAtX4TQpxqil1tFk67orTKCPTc2sIzWdmqosKDBC8VEzZ5_dx5rWJ4WqcT6HlYxyatrzkqgVgwtZ14saNMDG0bnf-YykBvf6D3f5DwXzt8We_5fYP--QLdIhvUAoREqTlwllo1KP1Srz73vwIIIZug</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893884195</pqid></control><display><type>article</type><title>Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension</title><source>ProQuest Central (Alumni Edition)</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>ProQuest Central Korea</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Bartolo, Max ; Roberts, Alastair ; Welbl, Johannes ; Riedel, Sebastian ; Stenetorp, Pontus</creator><creatorcontrib>Bartolo, Max ; Roberts, Alastair ; Welbl, Johannes ; Riedel, Sebastian ; Stenetorp, Pontus</creatorcontrib><description>Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F ).</description><identifier>ISSN: 2307-387X</identifier><identifier>EISSN: 2307-387X</identifier><identifier>DOI: 10.1162/tacl_a_00338</identifier><language>eng</language><publisher>One Rogers Street, Cambridge, MA 02142-1209, USA: MIT Press</publisher><subject>Annotations ; Datasets ; Generalization ; Performance degradation ; Questions ; Reading comprehension</subject><ispartof>Transactions of the Association for Computational Linguistics, 2020-01, Vol.8, p.662-678</ispartof><rights>2020. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c454t-905ee8ac8b0b14880a73ea4c11dee8fe5c148c15902e5dd80b8966f922b78d853</citedby><cites>FETCH-LOGICAL-c454t-905ee8ac8b0b14880a73ea4c11dee8fe5c148c15902e5dd80b8966f922b78d853</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2893884195?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,778,782,862,2098,21377,21378,21380,27913,27914,33519,33733,33994,43648,43794,43942,64372,64376,72228</link.rule.ids></links><search><creatorcontrib>Bartolo, Max</creatorcontrib><creatorcontrib>Roberts, Alastair</creatorcontrib><creatorcontrib>Welbl, Johannes</creatorcontrib><creatorcontrib>Riedel, Sebastian</creatorcontrib><creatorcontrib>Stenetorp, Pontus</creatorcontrib><title>Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension</title><title>Transactions of the Association for Computational Linguistics</title><description>Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F ).</description><subject>Annotations</subject><subject>Datasets</subject><subject>Generalization</subject><subject>Performance degradation</subject><subject>Questions</subject><subject>Reading comprehension</subject><issn>2307-387X</issn><issn>2307-387X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNp1kU1LI0EQhgdZQVFv_oCBvezBaHX39KR6YQ_ZoCYQEGUXvDU1_ZFMSKZjzySgv96OEYmgpyreeuqtoirLzhlcMlbyq47MQpMGEAIPsmMuoN8T2H_8sZcfZWdtOwcAhgyh5MfZ_V9HXd7NXD4Y_87Hzca1XT2lrm6m-cBuXGwp1rTIR-slNfmgaUKXiqHJfYj5gyO7BYdhuYpu5po2VU6zQ0-L1p29x5Ps_831v-GoN7m7HQ8Hk54pZNH1FEjnkAxWULECEagvHBWGMZt076RJqmFSAXfSWoQKVVl6xXnVR4tSnGTjna8NNNerWC8pPutAtX4TQpxqil1tFk67orTKCPTc2sIzWdmqosKDBC8VEzZ5_dx5rWJ4WqcT6HlYxyatrzkqgVgwtZ14saNMDG0bnf-YykBvf6D3f5DwXzt8We_5fYP--QLdIhvUAoREqTlwllo1KP1Srz73vwIIIZug</recordid><startdate>20200101</startdate><enddate>20200101</enddate><creator>Bartolo, Max</creator><creator>Roberts, Alastair</creator><creator>Welbl, Johannes</creator><creator>Riedel, Sebastian</creator><creator>Stenetorp, Pontus</creator><general>MIT Press</general><general>MIT Press Journals, The</general><general>The MIT Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>DOA</scope></search><sort><creationdate>20200101</creationdate><title>Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension</title><author>Bartolo, Max ; Roberts, Alastair ; Welbl, Johannes ; Riedel, Sebastian ; Stenetorp, Pontus</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c454t-905ee8ac8b0b14880a73ea4c11dee8fe5c148c15902e5dd80b8966f922b78d853</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Annotations</topic><topic>Datasets</topic><topic>Generalization</topic><topic>Performance degradation</topic><topic>Questions</topic><topic>Reading comprehension</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bartolo, Max</creatorcontrib><creatorcontrib>Roberts, Alastair</creatorcontrib><creatorcontrib>Welbl, Johannes</creatorcontrib><creatorcontrib>Riedel, Sebastian</creatorcontrib><creatorcontrib>Stenetorp, Pontus</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Transactions of the Association for Computational Linguistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bartolo, Max</au><au>Roberts, Alastair</au><au>Welbl, Johannes</au><au>Riedel, Sebastian</au><au>Stenetorp, Pontus</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension</atitle><jtitle>Transactions of the Association for Computational Linguistics</jtitle><date>2020-01-01</date><risdate>2020</risdate><volume>8</volume><spage>662</spage><epage>678</epage><pages>662-678</pages><issn>2307-387X</issn><eissn>2307-387X</eissn><abstract>Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F ).</abstract><cop>One Rogers Street, Cambridge, MA 02142-1209, USA</cop><pub>MIT Press</pub><doi>10.1162/tacl_a_00338</doi><tpages>17</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2307-387X
ispartof Transactions of the Association for Computational Linguistics, 2020-01, Vol.8, p.662-678
issn 2307-387X
2307-387X
language eng
recordid cdi_mit_journals_10_1162_tacl_a_00338
source ProQuest Central (Alumni Edition); DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; ProQuest Central Korea; ProQuest Central UK/Ireland; ProQuest Central
subjects Annotations
Datasets
Generalization
Performance degradation
Questions
Reading comprehension
title Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T07%3A56%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_mit_j&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beat%20the%20AI:%20Investigating%20Adversarial%20Human%20Annotation%20for%20Reading%20Comprehension&rft.jtitle=Transactions%20of%20the%20Association%20for%20Computational%20Linguistics&rft.au=Bartolo,%20Max&rft.date=2020-01-01&rft.volume=8&rft.spage=662&rft.epage=678&rft.pages=662-678&rft.issn=2307-387X&rft.eissn=2307-387X&rft_id=info:doi/10.1162/tacl_a_00338&rft_dat=%3Cproquest_mit_j%3E2893884195%3C/proquest_mit_j%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2893884195&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_e46d9c38f2dd4f15bdbba4f050f5913d&rfr_iscdi=true