Link: http://arxiv.org/abs/2507.06489v1
PDF Link: http://arxiv.org/pdf/2507.06489v1
Summary: Robust verbal confidence generated by large language models (LLMs) is crucialfor the deployment of LLMs to ensure transparency, trust, and safety inhuman-AI interactions across many high-stakes applications.
In this paper, wepresent the first comprehensive study on the robustness of verbal confidenceunder adversarial attacks.
We introduce a novel framework for attacking verbalconfidence scores through both perturbation and jailbreak-based methods, andshow that these attacks can significantly jeopardize verbal confidenceestimates and lead to frequent answer changes.
We examine a variety ofprompting strategies, model sizes, and application domains, revealing thatcurrent confidence elicitation methods are vulnerable and that commonly useddefence techniques are largely ineffective or counterproductive.
Our findingsunderscore the urgent need to design more robust mechanisms for confidenceexpression in LLMs, as even subtle semantic-preserving modifications can leadto misleading confidence in responses.
Published on arXiv on: 2025-07-09T02:19:46Z