Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation bi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Cantini, Riccardo, Cosenza, Giada, Orsino, Alessio, Talia, Domenico
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Cantini, Riccardo
Cosenza, Giada
Orsino, Alessio
Talia, Domenico
description Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.
doi_str_mv 10.48550/arxiv.2407.08441
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_08441</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_08441</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_084413</originalsourceid><addsrcrecordid>eNqFjkEPwVAQhN_FQfADnOwfUC0VblJCREhE3JvFajaePtl9Gv332sbdZWaSmUw-Y_pRGMTz6TQcoXy4CMZxOAvCeRxHbVMkQrBHyWrNszdW4eBuZBVOhNaWsGTU4UaIFrBDthchfMBR3PPlFe5OIFElVc4zSG4FiaIwWji5y1t9XjXgXXMCa8tX9ujZ5V3TuqNV6v28Ywab9Xm1HTaE6Uv4iVKmNWnakE7-L76MyUoT</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation</title><source>arXiv.org</source><creator>Cantini, Riccardo ; Cosenza, Giada ; Orsino, Alessio ; Talia, Domenico</creator><creatorcontrib>Cantini, Riccardo ; Cosenza, Giada ; Orsino, Alessio ; Talia, Domenico</creatorcontrib><description>Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.</description><identifier>DOI: 10.48550/arxiv.2407.08441</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.08441$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.08441$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cantini, Riccardo</creatorcontrib><creatorcontrib>Cosenza, Giada</creatorcontrib><creatorcontrib>Orsino, Alessio</creatorcontrib><creatorcontrib>Talia, Domenico</creatorcontrib><title>Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation</title><description>Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjkEPwVAQhN_FQfADnOwfUC0VblJCREhE3JvFajaePtl9Gv332sbdZWaSmUw-Y_pRGMTz6TQcoXy4CMZxOAvCeRxHbVMkQrBHyWrNszdW4eBuZBVOhNaWsGTU4UaIFrBDthchfMBR3PPlFe5OIFElVc4zSG4FiaIwWji5y1t9XjXgXXMCa8tX9ujZ5V3TuqNV6v28Ywab9Xm1HTaE6Uv4iVKmNWnakE7-L76MyUoT</recordid><startdate>20240711</startdate><enddate>20240711</enddate><creator>Cantini, Riccardo</creator><creator>Cosenza, Giada</creator><creator>Orsino, Alessio</creator><creator>Talia, Domenico</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240711</creationdate><title>Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation</title><author>Cantini, Riccardo ; Cosenza, Giada ; Orsino, Alessio ; Talia, Domenico</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_084413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Cantini, Riccardo</creatorcontrib><creatorcontrib>Cosenza, Giada</creatorcontrib><creatorcontrib>Orsino, Alessio</creatorcontrib><creatorcontrib>Talia, Domenico</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cantini, Riccardo</au><au>Cosenza, Giada</au><au>Orsino, Alessio</au><au>Talia, Domenico</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation</atitle><date>2024-07-11</date><risdate>2024</risdate><abstract>Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.</abstract><doi>10.48550/arxiv.2407.08441</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2407.08441
ispartof
issn
language eng
recordid cdi_arxiv_primary_2407_08441
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
title Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T05%3A59%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Are%20Large%20Language%20Models%20Really%20Bias-Free?%20Jailbreak%20Prompts%20for%20Assessing%20Adversarial%20Robustness%20to%20Bias%20Elicitation&rft.au=Cantini,%20Riccardo&rft.date=2024-07-11&rft_id=info:doi/10.48550/arxiv.2407.08441&rft_dat=%3Carxiv_GOX%3E2407_08441%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true