More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Link: http://arxiv.org/abs/2504.02193v1

PDF Link: http://arxiv.org/pdf/2504.02193v1

Summary: Aligning large language models (LLMs) with human values is an increasinglycritical step in post-training.

Direct Preference Optimization (DPO) hasemerged as a simple, yet effective alternative to reinforcement learning fromhuman feedback (RLHF).

Synthetic preference data with its low cost and highquality enable effective alignment through single- or multi-model generatedpreference data.

Our study reveals a striking, safety-specific phenomenonassociated with DPO alignment: Although multi-model generated data enhancesperformance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) byproviding diverse responses, it also tends to facilitate reward hacking duringtraining.

This can lead to a high attack success rate (ASR) when modelsencounter jailbreaking prompts.

The issue is particularly pronounced whenemploying stronger models like GPT-4o or larger models in the same family togenerate chosen responses paired with target model self-generated rejectedresponses, resulting in dramatically poorer safety outcomes.

Furthermore, withrespect to safety, using solely self-generated responses (single-modelgeneration) for both chosen and rejected pairs significantly outperformsconfigurations that incorporate responses from stronger models, whether useddirectly as chosen data or as part of a multi-model response pool.

Wedemonstrate that multi-model preference data exhibits high linear separabilitybetween chosen and rejected responses, which allows models to exploitsuperficial cues rather than internalizing robust safety constraints.

Ourexperiments, conducted on models from the Llama, Mistral, and Qwen families,consistently validate these findings.

Published on arXiv on: 2025-04-03T00:36:40Z