Link: http://arxiv.org/abs/2508.13743v1
PDF Link: http://arxiv.org/pdf/2508.13743v1
Summary: Large language models (LLMs), while increasingly used in domains requiringfactual rigor, often display a troubling behavior: sycophancy, the tendency toalign with user beliefs regardless of correctness.
This tendency is reinforcedby preference-based alignment techniques that optimize for user satisfactionbut can undermine truthfulness.
While relatively benign in casual dialogue,sycophancy poses serious risks in high-stakes settings such as scientificquestion answering (QA), where model outputs may shape collaborative reasoning,decision-making, and knowledge formation.
Despite its importance, thisphenomenon remains underexamined in factual QA contexts.
We address this gap byintroducing a unified evaluation framework to quantify the impact ofsycophantic context on model behavior in scientific QA, measuring how muchuser-imposed social pressure distorts model outputs.
The framework incorporatesadversarial prompting setups and targeted metrics, such as misleadingresistance and sycophancy resistance, that capture a model's ability tomaintain factual consistency under misleading cues.
Systematic evaluationsacross open-source and proprietary models reveal pervasive sycophantictendencies, driven more by alignment strategy than by model size.
To mitigatethis issue, we propose Pressure-Tune, a lightweight post-training method thatfine-tunes models on synthetic adversarial dialogues paired withchain-of-thought rationales.
These rationales reject user misinformation whilereinforcing factual commitments.
Experiments on challenging scientific QAbenchmarks show that Pressure-Tune significantly enhances sycophancy resistancewithout compromising accuracy or responsiveness to valid feedback, offering apractical pathway toward more truthful and principled model behavior.
Published on arXiv on: 2025-08-19T11:30:52Z