Link: http://arxiv.org/abs/2506.13726v1
PDF Link: http://arxiv.org/pdf/2506.13726v1
Summary: The introduction of advanced reasoning capabilities have improved theproblem-solving performance of large language models, particularly on math andcoding benchmarks.
However, it remains unclear whether these reasoning modelsare more or less vulnerable to adversarial prompt attacks than theirnon-reasoning counterparts.
In this work, we present a systematic evaluation ofweaknesses in advanced reasoning models compared to similar non-reasoningmodels across a diverse set of prompt-based attack categories.
Usingexperimental data, we find that on average the reasoning-augmented models are\emph{slightly more robust} than non-reasoning models (42.
51\% vs 45.
53\%attack success rate, lower is better).
However, this overall trend maskssignificant category-specific differences: for certain attack types thereasoning models are substantially \emph{more vulnerable} (e.
g.
, up to 32percentage points worse on a tree-of-attacks prompt), while for others they aremarkedly \emph{more robust} (e.
g.
, 29.
8 points better on cross-site scriptinginjection).
Our findings highlight the nuanced security implications ofadvanced reasoning in language models and emphasize the importance ofstress-testing safety across diverse adversarial techniques.
Published on arXiv on: 2025-06-16T17:32:18Z