Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Link: http://arxiv.org/abs/2502.11455v1

PDF Link: http://arxiv.org/pdf/2502.11455v1

Summary: Safety alignment is critical in pre-training large language models (LLMs) togenerate responses aligned with human values and refuse harmful queries.

UnlikeLLM, the current safety alignment of VLMs is often achieved with post-hocsafety fine-tuning.

However, these methods are less effective to white-boxattacks.

To address this, we propose $\textit{Adversary-aware DPO (ADPO)}$, anovel training framework that explicitly considers adversarial.

$\textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPOto enhance the safety alignment of VLMs under worst-case adversarialperturbations.

$\textit{ADPO}$ introduces two key components: (1) anadversarial-trained reference model that generates human-preferred responsesunder worst-case perturbations, and (2) an adversarial-aware DPO loss thatgenerates winner-loser pairs accounting for adversarial distortions.

Bycombining these innovations, $\textit{ADPO}$ ensures that VLMs remain robustand reliable even in the presence of sophisticated jailbreak attacks.

Extensiveexperiments demonstrate that $\textit{ADPO}$ outperforms baselines in thesafety alignment and general utility of VLMs.

Published on arXiv on: 2025-02-17T05:28:47Z