RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Wang, Peiran, Liu, Xiaogeng, Xiao, Chaowei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Wang, Peiran
Liu, Xiaogeng
Xiao, Chaowei
description In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3116452568</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116452568</sourcerecordid><originalsourceid>FETCH-proquest_journals_31164525683</originalsourceid><addsrcrecordid>eNqNi9EKgjAYRkcQJOU7DLoWdHMm3UUW0ZVE0KVM_dWpOds_e_4MeoCuDnznOwviMM4DLw4ZWxEXsfV9n0U7JgR3yOMGabKnCVQwlGqo6VWqPjcgO3qwVhYdtY3RU91QSW9gjYK37L1cIpQ0Nfo52rktZmpUVunhOxaAuCHLSvYI7o9rsj2f7seLNxr9mgBt1urJDLPKeBBEoWAiivl_rw_4L0Eu</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116452568</pqid></control><display><type>article</type><title>RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process</title><source>Free E- Journals</source><creator>Wang, Peiran ; Liu, Xiaogeng ; Xiao, Chaowei</creator><creatorcontrib>Wang, Peiran ; Liu, Xiaogeng ; Xiao, Chaowei</creatorcontrib><description>In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Decomposition ; Ethics ; Large language models ; Prompt engineering ; Retrieval</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Wang, Peiran</creatorcontrib><creatorcontrib>Liu, Xiaogeng</creatorcontrib><creatorcontrib>Xiao, Chaowei</creatorcontrib><title>RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process</title><title>arXiv.org</title><description>In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.</description><subject>Decomposition</subject><subject>Ethics</subject><subject>Large language models</subject><subject>Prompt engineering</subject><subject>Retrieval</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi9EKgjAYRkcQJOU7DLoWdHMm3UUW0ZVE0KVM_dWpOds_e_4MeoCuDnznOwviMM4DLw4ZWxEXsfV9n0U7JgR3yOMGabKnCVQwlGqo6VWqPjcgO3qwVhYdtY3RU91QSW9gjYK37L1cIpQ0Nfo52rktZmpUVunhOxaAuCHLSvYI7o9rsj2f7seLNxr9mgBt1urJDLPKeBBEoWAiivl_rw_4L0Eu</recordid><startdate>20241129</startdate><enddate>20241129</enddate><creator>Wang, Peiran</creator><creator>Liu, Xiaogeng</creator><creator>Xiao, Chaowei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241129</creationdate><title>RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process</title><author>Wang, Peiran ; Liu, Xiaogeng ; Xiao, Chaowei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31164525683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Decomposition</topic><topic>Ethics</topic><topic>Large language models</topic><topic>Prompt engineering</topic><topic>Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Peiran</creatorcontrib><creatorcontrib>Liu, Xiaogeng</creatorcontrib><creatorcontrib>Xiao, Chaowei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Peiran</au><au>Liu, Xiaogeng</au><au>Xiao, Chaowei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process</atitle><jtitle>arXiv.org</jtitle><date>2024-11-29</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3116452568
source Free E- Journals
subjects Decomposition
Ethics
Large language models
Prompt engineering
Retrieval
title RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T00%3A42%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=RePD:%20Defending%20Jailbreak%20Attack%20through%20a%20Retrieval-based%20Prompt%20Decomposition%20Process&rft.jtitle=arXiv.org&rft.au=Wang,%20Peiran&rft.date=2024-11-29&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3116452568%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3116452568&rft_id=info:pmid/&rfr_iscdi=true