Link: http://arxiv.org/abs/2501.12210v1
PDF Link: http://arxiv.org/pdf/2501.12210v1
Summary: With the rise of generative large language models (LLMs) like LLaMA andChatGPT, these models have significantly transformed daily life and work byproviding advanced insights.
However, as jailbreak attacks continue tocircumvent built-in safety mechanisms, exploiting carefully crafted scenariosor tokens, the safety risks of LLMs have come into focus.
While numerousdefense strategies--such as prompt detection, modification, and modelfine-tuning--have been proposed to counter these attacks, a critical questionarises: do these defenses compromise the utility and usability of LLMs forlegitimate users? Existing research predominantly focuses on the effectivenessof defense strategies without thoroughly examining their impact on performance,leaving a gap in understanding the trade-offs between LLM safety andperformance.
Our research addresses this gap by conducting a comprehensivestudy on the utility degradation, safety elevation, and exaggerated-safetyescalation of LLMs with jailbreak defense strategies.
We propose USEBench, anovel benchmark designed to evaluate these aspects, along with USEIndex, acomprehensive metric for assessing overall model performance.
Throughexperiments on seven state-of-the-art LLMs, we found that mainstream jailbreakdefenses fail to ensure both safety and performance simultaneously.
Althoughmodel-finetuning performs the best overall, their effectiveness varies acrossLLMs.
Furthermore, vertical comparisons reveal that developers commonlyprioritize performance over safety when iterating or fine-tuning their LLMs.
Published on arXiv on: 2025-01-21T15:24:29Z