HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
With the introduction of the transformers architecture, LLMs have revolutionized the NLP field with ever more powerful models. Nevertheless, their development came up with several challenges. The exponential growth in computational power and reasoning capabilities of language models has heightened c...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the introduction of the transformers architecture, LLMs have
revolutionized the NLP field with ever more powerful models. Nevertheless,
their development came up with several challenges. The exponential growth in
computational power and reasoning capabilities of language models has
heightened concerns about their security. As models become more powerful,
ensuring their safety has become a crucial focus in research. This paper aims
to address gaps in the current literature on jailbreaking techniques and the
evaluation of LLM vulnerabilities. Our contributions include the creation of a
novel dataset designed to assess the harmfulness of model outputs across
multiple harm levels, as well as a focus on fine-grained harm-level analysis.
Using this framework, we provide a comprehensive benchmark of state-of-the-art
jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model.
Additionally, we examine how quantization techniques, such as AWQ and GPTQ,
influence the alignment and robustness of models, revealing trade-offs between
enhanced robustness with regards to transfer attacks and potential increases in
vulnerability on direct ones. This study aims to demonstrate the influence of
harmful input queries on the complexity of jailbreaking techniques, as well as
to deepen our understanding of LLM vulnerabilities and improve methods for
assessing model robustness when confronted with harmful content, particularly
in the context of compression strategies. |
---|---|
DOI: | 10.48550/arxiv.2411.06835 |