Link: http://arxiv.org/abs/2508.19980v1
PDF Link: http://arxiv.org/pdf/2508.19980v1
Summary: As language models are increasingly deployed as autonomous agents inhigh-stakes settings, ensuring that they reliably follow user-defined rules hasbecome a critical safety concern.
To this end, we study whether language modelsexhibit contextual robustness, or the capability to adhere to context-dependentsafety specifications.
For this analysis, we develop a benchmark (PasswordEval)that measures whether language models can correctly determine when a userrequest is authorized (i.
e.
, with a correct password).
We find that currentopen- and closed-source models struggle with this seemingly simple task, andthat, perhaps surprisingly, reasoning capabilities do not generally improveperformance.
In fact, we find that reasoning traces frequently leakconfidential information, which calls into question whether reasoning tracesshould be exposed to users in such applications.
We also scale the difficultyof our evaluation along multiple axes: (i) by adding adversarial user pressurethrough various jailbreaking strategies, and (ii) through longer multi-turnconversations where password verification is more challenging.
Overall, ourresults suggest that current frontier models are not well-suited to handlingconfidential information, and that reasoning capabilities may need to betrained in a different manner to make them safer for release in high-stakessettings.
Published on arXiv on: 2025-08-27T15:39:46Z