Link: http://arxiv.org/abs/2511.02376v1
PDF Link: http://arxiv.org/pdf/2511.02376v1
Summary: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks whereadversarial prompts elicit harmful outputs, yet most evaluations focus onsingle-turn interactions while real-world attacks unfold through adaptivemulti-turn conversations.
We present AutoAdv, a training-free framework forautomated multi-turn jailbreaking that achieves up to 95% attack success rateon Llama-3.
1-8B within six turns a 24 percent improvement over single turnbaselines.
AutoAdv uniquely combines three adaptive mechanisms: a patternmanager that learns from successful attacks to enhance future prompts, atemperature manager that dynamically adjusts sampling parameters based onfailure modes, and a two-phase rewriting strategy that disguises harmfulrequests then iteratively refines them.
Extensive evaluation across commercialand open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistentvulnerabilities in current safety mechanisms, with multi-turn attacksconsistently outperforming single-turn approaches.
These findings demonstratethat alignment strategies optimized for single-turn interactions fail tomaintain robustness across extended conversations, highlighting an urgent needfor multi-turn-aware defenses.
Published on arXiv on: 2025-11-04T08:56:28Z