Defending against Insertion-based Textual Backdoor Attacks via Attribution

Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-08
Hauptverfasser:	Li, Jiazhao, Wu, Zhuofeng, Wei, Ping, Xiao, Chaowei, Vinod Vydiswaran, V G
Format:	Artikel
Sprache:	eng
Schlagworte:	Insertion Poisoning Poisons Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Li, Jiazhao Wu, Zhuofeng Wei, Ping Xiao, Chaowei Vinod Vydiswaran, V G
description	Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2809962006</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2809962006</sourcerecordid><originalsourceid>FETCH-proquest_journals_28099620063</originalsourceid><addsrcrecordid>eNqNiksKwjAUAIMgWLR3CLguxBdb26Vf1LX78mrTkloSzUvE49uCB3A1AzMTFoGUqyRfA8xYTNQJISDbQJrKiF0PqlGm1qbl2KI25PnFkHJeW5NUSKrmN_XxAXu-w_ujttbxrfeDEn9rHN3pKoz7gk0b7EnFP87Z8nS87c_J09lXUOTLzgZnhlRCLooiAyEy-d_1BePTPQE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2809962006</pqid></control><display><type>article</type><title>Defending against Insertion-based Textual Backdoor Attacks via Attribution</title><source>Free E- Journals</source><creator>Li, Jiazhao ; Wu, Zhuofeng ; Wei, Ping ; Xiao, Chaowei ; Vinod Vydiswaran, V G</creator><creatorcontrib>Li, Jiazhao ; Wu, Zhuofeng ; Wei, Ping ; Xiao, Chaowei ; Vinod Vydiswaran, V G</creatorcontrib><description>Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Insertion ; Poisoning ; Poisons ; Training</subject><ispartof>arXiv.org, 2023-08</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Li, Jiazhao</creatorcontrib><creatorcontrib>Wu, Zhuofeng</creatorcontrib><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Xiao, Chaowei</creatorcontrib><creatorcontrib>Vinod Vydiswaran, V G</creatorcontrib><title>Defending against Insertion-based Textual Backdoor Attacks via Attribution</title><title>arXiv.org</title><description>Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.</description><subject>Insertion</subject><subject>Poisoning</subject><subject>Poisons</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNiksKwjAUAIMgWLR3CLguxBdb26Vf1LX78mrTkloSzUvE49uCB3A1AzMTFoGUqyRfA8xYTNQJISDbQJrKiF0PqlGm1qbl2KI25PnFkHJeW5NUSKrmN_XxAXu-w_ujttbxrfeDEn9rHN3pKoz7gk0b7EnFP87Z8nS87c_J09lXUOTLzgZnhlRCLooiAyEy-d_1BePTPQE</recordid><startdate>20230807</startdate><enddate>20230807</enddate><creator>Li, Jiazhao</creator><creator>Wu, Zhuofeng</creator><creator>Wei, Ping</creator><creator>Xiao, Chaowei</creator><creator>Vinod Vydiswaran, V G</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20230807</creationdate><title>Defending against Insertion-based Textual Backdoor Attacks via Attribution</title><author>Li, Jiazhao ; Wu, Zhuofeng ; Wei, Ping ; Xiao, Chaowei ; Vinod Vydiswaran, V G</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28099620063</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Insertion</topic><topic>Poisoning</topic><topic>Poisons</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Jiazhao</creatorcontrib><creatorcontrib>Wu, Zhuofeng</creatorcontrib><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Xiao, Chaowei</creatorcontrib><creatorcontrib>Vinod Vydiswaran, V G</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Jiazhao</au><au>Wu, Zhuofeng</au><au>Wei, Ping</au><au>Xiao, Chaowei</au><au>Vinod Vydiswaran, V G</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Defending against Insertion-based Textual Backdoor Attacks via Attribution</atitle><jtitle>arXiv.org</jtitle><date>2023-08-07</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2809962006
source	Free E- Journals
subjects	Insertion Poisoning Poisons Training
title	Defending against Insertion-based Textual Backdoor Attacks via Attribution
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T04%3A45%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Defending%20against%20Insertion-based%20Textual%20Backdoor%20Attacks%20via%20Attribution&rft.jtitle=arXiv.org&rft.au=Li,%20Jiazhao&rft.date=2023-08-07&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2809962006%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2809962006&rft_id=info:pmid/&rfr_iscdi=true