Cross-functional Analysis of Generalization in Behavioral Learning

In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training ( ) would improve coverage of phenomena not sufficiently re...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Association for Computational Linguistics 2023-08, Vol.11, p.1066-1081
Hauptverfasser:	Luz de Araujo, Pedro Henrique, Roth, Benjamin
Format:	Artikel
Sprache:	eng
Schlagworte:	Behavior Computer science Data mining Feedback Functional analysis Generalization Learning Linguistics Machine learning Natural language processing Optimization Reading comprehension Regularization Sentiment analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1081
container_issue
container_start_page	1066
container_title	Transactions of the Association for Computational Linguistics
container_volume	11
creator	Luz de Araujo, Pedro Henrique Roth, Benjamin
description	In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training ( ) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce , an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.
doi_str_mv	10.1162/tacl_a_00590
format	Article
fullrecord	<record><control><sourceid>proquest_mit_j</sourceid><recordid>TN_cdi_mit_journals_10_1162_tacl_a_00590</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_08a134b2cab2421499773335120e819f</doaj_id><sourcerecordid>2893946803</sourcerecordid><originalsourceid>FETCH-LOGICAL-c412t-85fc86a311c729fddfdaa5560fc3fb8c6e6d82e3c9b0ecd2a1434b560bc968533</originalsourceid><addsrcrecordid>eNp1kE1Lw0AQhoMoWGpv_oCAFw9G9yPZ7F6EtmgtFLwoeFsmm926Jc3W3VRof71JI1JBLzPDzDPvMG8UXWJ0izEjdw2oSoJEKBPoJBoQivKE8vzt9Kg-j0YhrBBCmGOOGBlEk6l3ISRmW6vGuhqqeNyGXbAhdiae6Vp7qOweumFs63ii3-HTurYZLzT42tbLi-jMQBX06DsPo9fHh5fpU7J4ns2n40WiUkyahGdGcQYUY5UTYcrSlABZxpBR1BRcMc1KTjRVokBalQRwStOinRdKMJ5ROozmvW7pYCU33q7B76QDKw8N55cSfGNVpSXigNtloqAgKcGpEHlOKc0wQZpjYVqtq15r493HVodGrtzWt58HSbigImUcdRdvekp1Jnltfq5iJDvT5bHpLX7d42t7pPcPev8H2iGfGEua5qxdI4hQechybze_Bb4A2n6V-Q</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893946803</pqid></control><display><type>article</type><title>Cross-functional Analysis of Generalization in Behavioral Learning</title><source>DOAJ Directory of Open Access Journals</source><source>Free E-Journal (出版社公開部分のみ）</source><source>ProQuest Central (Alumni)</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Luz de Araujo, Pedro Henrique ; Roth, Benjamin</creator><creatorcontrib>Luz de Araujo, Pedro Henrique ; Roth, Benjamin</creatorcontrib><description>In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training ( ) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce , an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.</description><identifier>ISSN: 2307-387X</identifier><identifier>EISSN: 2307-387X</identifier><identifier>DOI: 10.1162/tacl_a_00590</identifier><language>eng</language><publisher>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA: MIT Press</publisher><subject>Behavior ; Computer science ; Data mining ; Feedback ; Functional analysis ; Generalization ; Learning ; Linguistics ; Machine learning ; Natural language processing ; Optimization ; Reading comprehension ; Regularization ; Sentiment analysis</subject><ispartof>Transactions of the Association for Computational Linguistics, 2023-08, Vol.11, p.1066-1081</ispartof><rights>2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c412t-85fc86a311c729fddfdaa5560fc3fb8c6e6d82e3c9b0ecd2a1434b560bc968533</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2893946803?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,21388,21389,21391,27924,27925,33530,33744,34005,43659,43805,43953,64385,64389,72469</link.rule.ids></links><search><creatorcontrib>Luz de Araujo, Pedro Henrique</creatorcontrib><creatorcontrib>Roth, Benjamin</creatorcontrib><title>Cross-functional Analysis of Generalization in Behavioral Learning</title><title>Transactions of the Association for Computational Linguistics</title><description>In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training ( ) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce , an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.</description><subject>Behavior</subject><subject>Computer science</subject><subject>Data mining</subject><subject>Feedback</subject><subject>Functional analysis</subject><subject>Generalization</subject><subject>Learning</subject><subject>Linguistics</subject><subject>Machine learning</subject><subject>Natural language processing</subject><subject>Optimization</subject><subject>Reading comprehension</subject><subject>Regularization</subject><subject>Sentiment analysis</subject><issn>2307-387X</issn><issn>2307-387X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNp1kE1Lw0AQhoMoWGpv_oCAFw9G9yPZ7F6EtmgtFLwoeFsmm926Jc3W3VRof71JI1JBLzPDzDPvMG8UXWJ0izEjdw2oSoJEKBPoJBoQivKE8vzt9Kg-j0YhrBBCmGOOGBlEk6l3ISRmW6vGuhqqeNyGXbAhdiae6Vp7qOweumFs63ii3-HTurYZLzT42tbLi-jMQBX06DsPo9fHh5fpU7J4ns2n40WiUkyahGdGcQYUY5UTYcrSlABZxpBR1BRcMc1KTjRVokBalQRwStOinRdKMJ5ROozmvW7pYCU33q7B76QDKw8N55cSfGNVpSXigNtloqAgKcGpEHlOKc0wQZpjYVqtq15r493HVodGrtzWt58HSbigImUcdRdvekp1Jnltfq5iJDvT5bHpLX7d42t7pPcPev8H2iGfGEua5qxdI4hQechybze_Bb4A2n6V-Q</recordid><startdate>20230815</startdate><enddate>20230815</enddate><creator>Luz de Araujo, Pedro Henrique</creator><creator>Roth, Benjamin</creator><general>MIT Press</general><general>MIT Press Journals, The</general><general>The MIT Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>DOA</scope></search><sort><creationdate>20230815</creationdate><title>Cross-functional Analysis of Generalization in Behavioral Learning</title><author>Luz de Araujo, Pedro Henrique ; Roth, Benjamin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c412t-85fc86a311c729fddfdaa5560fc3fb8c6e6d82e3c9b0ecd2a1434b560bc968533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Behavior</topic><topic>Computer science</topic><topic>Data mining</topic><topic>Feedback</topic><topic>Functional analysis</topic><topic>Generalization</topic><topic>Learning</topic><topic>Linguistics</topic><topic>Machine learning</topic><topic>Natural language processing</topic><topic>Optimization</topic><topic>Reading comprehension</topic><topic>Regularization</topic><topic>Sentiment analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Luz de Araujo, Pedro Henrique</creatorcontrib><creatorcontrib>Roth, Benjamin</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer science database</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Publicly Available Content database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Transactions of the Association for Computational Linguistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Luz de Araujo, Pedro Henrique</au><au>Roth, Benjamin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-functional Analysis of Generalization in Behavioral Learning</atitle><jtitle>Transactions of the Association for Computational Linguistics</jtitle><date>2023-08-15</date><risdate>2023</risdate><volume>11</volume><spage>1066</spage><epage>1081</epage><pages>1066-1081</pages><issn>2307-387X</issn><eissn>2307-387X</eissn><abstract>In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training ( ) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce , an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.</abstract><cop>One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA</cop><pub>MIT Press</pub><doi>10.1162/tacl_a_00590</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2307-387X
ispartof	Transactions of the Association for Computational Linguistics, 2023-08, Vol.11, p.1066-1081
issn	2307-387X 2307-387X
language	eng
recordid	cdi_mit_journals_10_1162_tacl_a_00590
source	DOAJ Directory of Open Access Journals; Free E-Journal (出版社公開部分のみ）; ProQuest Central (Alumni); ProQuest Central UK/Ireland; ProQuest Central
subjects	Behavior Computer science Data mining Feedback Functional analysis Generalization Learning Linguistics Machine learning Natural language processing Optimization Reading comprehension Regularization Sentiment analysis
title	Cross-functional Analysis of Generalization in Behavioral Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T06%3A40%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_mit_j&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-functional%20Analysis%20of%20Generalization%20in%20Behavioral%20Learning&rft.jtitle=Transactions%20of%20the%20Association%20for%20Computational%20Linguistics&rft.au=Luz%20de%20Araujo,%20Pedro%20Henrique&rft.date=2023-08-15&rft.volume=11&rft.spage=1066&rft.epage=1081&rft.pages=1066-1081&rft.issn=2307-387X&rft.eissn=2307-387X&rft_id=info:doi/10.1162/tacl_a_00590&rft_dat=%3Cproquest_mit_j%3E2893946803%3C/proquest_mit_j%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2893946803&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_08a134b2cab2421499773335120e819f&rfr_iscdi=true