Link: http://arxiv.org/abs/2503.10619v1
PDF Link: http://arxiv.org/pdf/2503.10619v1
Summary: We introduce Siege, a multi-turn adversarial framework that models thegradual erosion of Large Language Model (LLM) safety through a tree searchperspective.
Unlike single-turn jailbreaks that rely on one meticulouslyengineered prompt, Siege expands the conversation at each turn in abreadth-first fashion, branching out multiple adversarial prompts that exploitpartial compliance from previous responses.
By tracking these incrementalpolicy leaks and re-injecting them into subsequent queries, Siege reveals howminor concessions can accumulate into fully disallowed outputs.
Evaluations onthe JailbreakBench dataset show that Siege achieves a 100% success rate onGPT-3.
5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queriesthan baselines such as Crescendo or GOAT.
This tree search methodology offersan in-depth view of how model safeguards degrade over successive dialogueturns, underscoring the urgency of robust multi-turn testing procedures forlanguage models.
Published on arXiv on: 2025-03-13T17:57:32Z