Learning diverse attacks on large language models for robust red-teaming and safety tuning

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lee, Seanie, Kim, Minsu, Cherif, Lynn, Dobre, David, Lee, Juho, Hwang, Sung Ju, Kawaguchi, Kenji, Gidel, Gauthier, Bengio, Yoshua, Malkin, Nikolay, Jain, Moksh
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Cryptography and Security Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lee, Seanie Kim, Minsu Cherif, Lynn Dobre, David Lee, Juho Hwang, Sung Ju Kawaguchi, Kenji Gidel, Gauthier Bengio, Yoshua Malkin, Nikolay Jain, Moksh
description	Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.
doi_str_mv	10.48550/arxiv.2405.18540
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_18540</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_18540</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-a39c51db7ea1d1a38a6a801e1c6c4d90bc02671bad91b53b95614dfce999a81a3</originalsourceid><addsrcrecordid>eNotj71uwyAUhVk6VGkfoFN5AbsQA4axivonWeqSKYt1gWvLqo0rwFHz9rXTLuec5TvSR8gDZ6XQUrIniD_DudwLJkuupWC35NQgxDCEnvrhjDEhhZzBfSU6BzpC7HHN0C-wjmn2OCbazZHG2S4p04i-yAjTxkPwNEGH-ULzsj3ekZsOxoT3_70jx9eX4-G9aD7fPg7PTQGqZgVUxknubY3APYdKgwLNOHKnnPCGWcf2quYWvOFWVtZIxYXvHBpjQK_Ajjz-3V7l2u84TBAv7SbZXiWrXyjdThs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><source>arXiv.org</source><creator>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</creator><creatorcontrib>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</creatorcontrib><description>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</description><identifier>DOI: 10.48550/arxiv.2405.18540</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Cryptography and Security ; Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.18540$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.18540$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lee, Seanie</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Cherif, Lynn</creatorcontrib><creatorcontrib>Dobre, David</creatorcontrib><creatorcontrib>Lee, Juho</creatorcontrib><creatorcontrib>Hwang, Sung Ju</creatorcontrib><creatorcontrib>Kawaguchi, Kenji</creatorcontrib><creatorcontrib>Gidel, Gauthier</creatorcontrib><creatorcontrib>Bengio, Yoshua</creatorcontrib><creatorcontrib>Malkin, Nikolay</creatorcontrib><creatorcontrib>Jain, Moksh</creatorcontrib><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><description>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71uwyAUhVk6VGkfoFN5AbsQA4axivonWeqSKYt1gWvLqo0rwFHz9rXTLuec5TvSR8gDZ6XQUrIniD_DudwLJkuupWC35NQgxDCEnvrhjDEhhZzBfSU6BzpC7HHN0C-wjmn2OCbazZHG2S4p04i-yAjTxkPwNEGH-ULzsj3ekZsOxoT3_70jx9eX4-G9aD7fPg7PTQGqZgVUxknubY3APYdKgwLNOHKnnPCGWcf2quYWvOFWVtZIxYXvHBpjQK_Ajjz-3V7l2u84TBAv7SbZXiWrXyjdThs</recordid><startdate>20240528</startdate><enddate>20240528</enddate><creator>Lee, Seanie</creator><creator>Kim, Minsu</creator><creator>Cherif, Lynn</creator><creator>Dobre, David</creator><creator>Lee, Juho</creator><creator>Hwang, Sung Ju</creator><creator>Kawaguchi, Kenji</creator><creator>Gidel, Gauthier</creator><creator>Bengio, Yoshua</creator><creator>Malkin, Nikolay</creator><creator>Jain, Moksh</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240528</creationdate><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><author>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-a39c51db7ea1d1a38a6a801e1c6c4d90bc02671bad91b53b95614dfce999a81a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Lee, Seanie</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Cherif, Lynn</creatorcontrib><creatorcontrib>Dobre, David</creatorcontrib><creatorcontrib>Lee, Juho</creatorcontrib><creatorcontrib>Hwang, Sung Ju</creatorcontrib><creatorcontrib>Kawaguchi, Kenji</creatorcontrib><creatorcontrib>Gidel, Gauthier</creatorcontrib><creatorcontrib>Bengio, Yoshua</creatorcontrib><creatorcontrib>Malkin, Nikolay</creatorcontrib><creatorcontrib>Jain, Moksh</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lee, Seanie</au><au>Kim, Minsu</au><au>Cherif, Lynn</au><au>Dobre, David</au><au>Lee, Juho</au><au>Hwang, Sung Ju</au><au>Kawaguchi, Kenji</au><au>Gidel, Gauthier</au><au>Bengio, Yoshua</au><au>Malkin, Nikolay</au><au>Jain, Moksh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning diverse attacks on large language models for robust red-teaming and safety tuning</atitle><date>2024-05-28</date><risdate>2024</risdate><abstract>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</abstract><doi>10.48550/arxiv.2405.18540</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.18540
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_18540
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Cryptography and Security Computer Science - Learning
title	Learning diverse attacks on large language models for robust red-teaming and safety tuning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T22%3A01%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20diverse%20attacks%20on%20large%20language%20models%20for%20robust%20red-teaming%20and%20safety%20tuning&rft.au=Lee,%20Seanie&rft.date=2024-05-28&rft_id=info:doi/10.48550/arxiv.2405.18540&rft_dat=%3Carxiv_GOX%3E2405_18540%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true