Learning diverse attacks on large language models for robust red-teaming and safety tuning

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lee, Seanie, Kim, Minsu, Cherif, Lynn, Dobre, David, Lee, Juho, Hwang, Sung Ju, Kawaguchi, Kenji, Gidel, Gauthier, Bengio, Yoshua, Malkin, Nikolay, Jain, Moksh
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Lee, Seanie
Kim, Minsu
Cherif, Lynn
Dobre, David
Lee, Juho
Hwang, Sung Ju
Kawaguchi, Kenji
Gidel, Gauthier
Bengio, Yoshua
Malkin, Nikolay
Jain, Moksh
description Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.
doi_str_mv 10.48550/arxiv.2405.18540
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_18540</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_18540</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-a39c51db7ea1d1a38a6a801e1c6c4d90bc02671bad91b53b95614dfce999a81a3</originalsourceid><addsrcrecordid>eNotj71uwyAUhVk6VGkfoFN5AbsQA4axivonWeqSKYt1gWvLqo0rwFHz9rXTLuec5TvSR8gDZ6XQUrIniD_DudwLJkuupWC35NQgxDCEnvrhjDEhhZzBfSU6BzpC7HHN0C-wjmn2OCbazZHG2S4p04i-yAjTxkPwNEGH-ULzsj3ekZsOxoT3_70jx9eX4-G9aD7fPg7PTQGqZgVUxknubY3APYdKgwLNOHKnnPCGWcf2quYWvOFWVtZIxYXvHBpjQK_Ajjz-3V7l2u84TBAv7SbZXiWrXyjdThs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><source>arXiv.org</source><creator>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</creator><creatorcontrib>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</creatorcontrib><description>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</description><identifier>DOI: 10.48550/arxiv.2405.18540</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Cryptography and Security ; Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.18540$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.18540$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lee, Seanie</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Cherif, Lynn</creatorcontrib><creatorcontrib>Dobre, David</creatorcontrib><creatorcontrib>Lee, Juho</creatorcontrib><creatorcontrib>Hwang, Sung Ju</creatorcontrib><creatorcontrib>Kawaguchi, Kenji</creatorcontrib><creatorcontrib>Gidel, Gauthier</creatorcontrib><creatorcontrib>Bengio, Yoshua</creatorcontrib><creatorcontrib>Malkin, Nikolay</creatorcontrib><creatorcontrib>Jain, Moksh</creatorcontrib><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><description>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71uwyAUhVk6VGkfoFN5AbsQA4axivonWeqSKYt1gWvLqo0rwFHz9rXTLuec5TvSR8gDZ6XQUrIniD_DudwLJkuupWC35NQgxDCEnvrhjDEhhZzBfSU6BzpC7HHN0C-wjmn2OCbazZHG2S4p04i-yAjTxkPwNEGH-ULzsj3ekZsOxoT3_70jx9eX4-G9aD7fPg7PTQGqZgVUxknubY3APYdKgwLNOHKnnPCGWcf2quYWvOFWVtZIxYXvHBpjQK_Ajjz-3V7l2u84TBAv7SbZXiWrXyjdThs</recordid><startdate>20240528</startdate><enddate>20240528</enddate><creator>Lee, Seanie</creator><creator>Kim, Minsu</creator><creator>Cherif, Lynn</creator><creator>Dobre, David</creator><creator>Lee, Juho</creator><creator>Hwang, Sung Ju</creator><creator>Kawaguchi, Kenji</creator><creator>Gidel, Gauthier</creator><creator>Bengio, Yoshua</creator><creator>Malkin, Nikolay</creator><creator>Jain, Moksh</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240528</creationdate><title>Learning diverse attacks on large language models for robust red-teaming and safety tuning</title><author>Lee, Seanie ; Kim, Minsu ; Cherif, Lynn ; Dobre, David ; Lee, Juho ; Hwang, Sung Ju ; Kawaguchi, Kenji ; Gidel, Gauthier ; Bengio, Yoshua ; Malkin, Nikolay ; Jain, Moksh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-a39c51db7ea1d1a38a6a801e1c6c4d90bc02671bad91b53b95614dfce999a81a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Lee, Seanie</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Cherif, Lynn</creatorcontrib><creatorcontrib>Dobre, David</creatorcontrib><creatorcontrib>Lee, Juho</creatorcontrib><creatorcontrib>Hwang, Sung Ju</creatorcontrib><creatorcontrib>Kawaguchi, Kenji</creatorcontrib><creatorcontrib>Gidel, Gauthier</creatorcontrib><creatorcontrib>Bengio, Yoshua</creatorcontrib><creatorcontrib>Malkin, Nikolay</creatorcontrib><creatorcontrib>Jain, Moksh</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lee, Seanie</au><au>Kim, Minsu</au><au>Cherif, Lynn</au><au>Dobre, David</au><au>Lee, Juho</au><au>Hwang, Sung Ju</au><au>Kawaguchi, Kenji</au><au>Gidel, Gauthier</au><au>Bengio, Yoshua</au><au>Malkin, Nikolay</au><au>Jain, Moksh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning diverse attacks on large language models for robust red-teaming and safety tuning</atitle><date>2024-05-28</date><risdate>2024</risdate><abstract>Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.</abstract><doi>10.48550/arxiv.2405.18540</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2405.18540
ispartof
issn
language eng
recordid cdi_arxiv_primary_2405_18540
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Cryptography and Security
Computer Science - Learning
title Learning diverse attacks on large language models for robust red-teaming and safety tuning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T22%3A01%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20diverse%20attacks%20on%20large%20language%20models%20for%20robust%20red-teaming%20and%20safety%20tuning&rft.au=Lee,%20Seanie&rft.date=2024-05-28&rft_id=info:doi/10.48550/arxiv.2405.18540&rft_dat=%3Carxiv_GOX%3E2405_18540%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true