Skip to content
arxiv papers 1 min read

Efficient Safety Retrofitting Against Jailbreaking for LLMs

Link: http://arxiv.org/abs/2502.13603v1

PDF Link: http://arxiv.org/pdf/2502.13603v1

Summary: Direct Preference Optimization (DPO) is an efficient alignment technique thatsteers LLMs towards preferable outputs by training on preference data,bypassing the need for explicit reward models.

Its simplicity enables easyadaptation to various domains and safety requirements.

This paper examinesDPO's effectiveness in model safety against jailbreaking attacks whileminimizing data requirements and training costs.

We introduce Egida, a datasetexpanded from multiple sources, which includes 27 different safety topics and18 different attack styles, complemented with synthetic and human labels.

Thisdata is used to boost the safety of state-of-the-art LLMs(Llama-3.

1-8B/70B-Instruct, Qwen-2.

5-7B/72B-Instruct) across topics and attackstyles.

In addition to safety evaluations, we assess their post-alignmentperformance degradation in general purpose tasks, and their tendency to overrefusal.

Following the proposed methodology, trained models reduce their AttackSuccess Rate by 10%-30%, using small training efforts (2,000 samples) with lowcomputational cost (3\$ for 8B models, 20\$ for 72B models).

Safety alignedmodels generalize to unseen topics and attack styles, with the most successfulattack style reaching a success rate around 5%.

Size and family are found tostrongly influence model malleability towards safety, pointing at theimportance of pre-training choices.

To validate our findings, a largeindependent assessment of human preference agreement with Llama-Guard-3-8B isconducted by the authors and the associated dataset Egida-HSafe is released.

Overall, this study illustrates how affordable and accessible it is to enhanceLLM safety using DPO while outlining its current limitations.

All datasets andmodels are released to enable reproducibility and further research.

Published on arXiv on: 2025-02-19T10:33:18Z