AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Link: http://arxiv.org/abs/2511.02376v1

PDF Link: http://arxiv.org/pdf/2511.02376v1

Summary: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks whereadversarial prompts elicit harmful outputs, yet most evaluations focus onsingle-turn interactions while real-world attacks unfold through adaptivemulti-turn conversations.

We present AutoAdv, a training-free framework forautomated multi-turn jailbreaking that achieves up to 95% attack success rateon Llama-3.

1-8B within six turns a 24 percent improvement over single turnbaselines.

AutoAdv uniquely combines three adaptive mechanisms: a patternmanager that learns from successful attacks to enhance future prompts, atemperature manager that dynamically adjusts sampling parameters based onfailure modes, and a two-phase rewriting strategy that disguises harmfulrequests then iteratively refines them.

Extensive evaluation across commercialand open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistentvulnerabilities in current safety mechanisms, with multi-turn attacksconsistently outperforming single-turn approaches.

These findings demonstratethat alignment strategies optimized for single-turn interactions fail tomaintain robustness across extended conversations, highlighting an urgent needfor multi-turn-aware defenses.

Published on arXiv on: 2025-11-04T08:56:28Z