Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-10
Hauptverfasser:	Ren, Qibing, Li, Hao, Liu, Dongrui, Xie, Zhanxu, Lu, Xiaoya, Yu, Qiao, Sha, Lei, Junchi Yan, Ma, Lizhuang, Shao, Jing
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Large language models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Ren, Qibing Li, Hao Liu, Dongrui Xie, Zhanxu Lu, Xiaoya Yu, Qiao Sha, Lei Junchi Yan Ma, Lizhuang Shao, Jing
description	This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3116751463</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116751463</sourcerecordid><originalsourceid>FETCH-proquest_journals_31167514633</originalsourceid><addsrcrecordid>eNqNzbsKwjAYhuEgCBbtPfzgHGiSHsRNqiLSuujiVGKb2kNoNAev3wxegNM3PC98MxRQxgjexJQuUGjMEEURTTOaJCxAl73QvJdwV04bIdstlE7aHlunJyiKEs5eH1rwEXbW8noE22nlnh1cfY2b3tTqI7RoIJdOmBWat1waEf52idbHwy0_4ZdWb--2GvzR5KlihKRZQuKUsf-qL4OoPaA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116751463</pqid></control><display><type>article</type><title>Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues</title><source>Open Access: Freely Accessible Journals by multiple vendors</source><creator>Ren, Qibing ; Li, Hao ; Liu, Dongrui ; Xie, Zhanxu ; Lu, Xiaoya ; Yu, Qiao ; Sha, Lei ; Junchi Yan ; Ma, Lizhuang ; Shao, Jing</creator><creatorcontrib>Ren, Qibing ; Li, Hao ; Liu, Dongrui ; Xie, Zhanxu ; Lu, Xiaoya ; Yu, Qiao ; Sha, Lei ; Junchi Yan ; Ma, Lizhuang ; Shao, Jing</creatorcontrib><description>This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Large language models</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ren, Qibing</creatorcontrib><creatorcontrib>Li, Hao</creatorcontrib><creatorcontrib>Liu, Dongrui</creatorcontrib><creatorcontrib>Xie, Zhanxu</creatorcontrib><creatorcontrib>Lu, Xiaoya</creatorcontrib><creatorcontrib>Yu, Qiao</creatorcontrib><creatorcontrib>Sha, Lei</creatorcontrib><creatorcontrib>Junchi Yan</creatorcontrib><creatorcontrib>Ma, Lizhuang</creatorcontrib><creatorcontrib>Shao, Jing</creatorcontrib><title>Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues</title><title>arXiv.org</title><description>This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.</description><subject>Datasets</subject><subject>Large language models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNzbsKwjAYhuEgCBbtPfzgHGiSHsRNqiLSuujiVGKb2kNoNAev3wxegNM3PC98MxRQxgjexJQuUGjMEEURTTOaJCxAl73QvJdwV04bIdstlE7aHlunJyiKEs5eH1rwEXbW8noE22nlnh1cfY2b3tTqI7RoIJdOmBWat1waEf52idbHwy0_4ZdWb--2GvzR5KlihKRZQuKUsf-qL4OoPaA</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Ren, Qibing</creator><creator>Li, Hao</creator><creator>Liu, Dongrui</creator><creator>Xie, Zhanxu</creator><creator>Lu, Xiaoya</creator><creator>Yu, Qiao</creator><creator>Sha, Lei</creator><creator>Junchi Yan</creator><creator>Ma, Lizhuang</creator><creator>Shao, Jing</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241014</creationdate><title>Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues</title><author>Ren, Qibing ; Li, Hao ; Liu, Dongrui ; Xie, Zhanxu ; Lu, Xiaoya ; Yu, Qiao ; Sha, Lei ; Junchi Yan ; Ma, Lizhuang ; Shao, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31167514633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Large language models</topic><toplevel>online_resources</toplevel><creatorcontrib>Ren, Qibing</creatorcontrib><creatorcontrib>Li, Hao</creatorcontrib><creatorcontrib>Liu, Dongrui</creatorcontrib><creatorcontrib>Xie, Zhanxu</creatorcontrib><creatorcontrib>Lu, Xiaoya</creatorcontrib><creatorcontrib>Yu, Qiao</creatorcontrib><creatorcontrib>Sha, Lei</creatorcontrib><creatorcontrib>Junchi Yan</creatorcontrib><creatorcontrib>Ma, Lizhuang</creatorcontrib><creatorcontrib>Shao, Jing</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ren, Qibing</au><au>Li, Hao</au><au>Liu, Dongrui</au><au>Xie, Zhanxu</au><au>Lu, Xiaoya</au><au>Yu, Qiao</au><au>Sha, Lei</au><au>Junchi Yan</au><au>Ma, Lizhuang</au><au>Shao, Jing</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues</atitle><jtitle>arXiv.org</jtitle><date>2024-10-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3116751463
source	Open Access: Freely Accessible Journals by multiple vendors
subjects	Datasets Large language models
title	Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T17%3A25%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Derail%20Yourself:%20Multi-turn%20LLM%20Jailbreak%20Attack%20through%20Self-discovered%20Clues&rft.jtitle=arXiv.org&rft.au=Ren,%20Qibing&rft.date=2024-10-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3116751463%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3116751463&rft_id=info:pmid/&rfr_iscdi=true