Link: http://arxiv.org/abs/2502.11455v1
PDF Link: http://arxiv.org/pdf/2502.11455v1
Summary: Safety alignment is critical in pre-training large language models (LLMs) togenerate responses aligned with human values and refuse harmful queries.
UnlikeLLM, the current safety alignment of VLMs is often achieved with post-hocsafety fine-tuning.
However, these methods are less effective to white-boxattacks.
To address this, we propose $\textit{Adversary-aware DPO (ADPO)}$, anovel training framework that explicitly considers adversarial.
$\textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPOto enhance the safety alignment of VLMs under worst-case adversarialperturbations.
$\textit{ADPO}$ introduces two key components: (1) anadversarial-trained reference model that generates human-preferred responsesunder worst-case perturbations, and (2) an adversarial-aware DPO loss thatgenerates winner-loser pairs accounting for adversarial distortions.
Bycombining these innovations, $\textit{ADPO}$ ensures that VLMs remain robustand reliable even in the presence of sophisticated jailbreak attacks.
Extensiveexperiments demonstrate that $\textit{ADPO}$ outperforms baselines in thesafety alignment and general utility of VLMs.
Published on arXiv on: 2025-02-17T05:28:47Z