Skip to content
arxiv papers 1 min read

Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Link: http://arxiv.org/abs/2503.10619v1

PDF Link: http://arxiv.org/pdf/2503.10619v1

Summary: We introduce Siege, a multi-turn adversarial framework that models thegradual erosion of Large Language Model (LLM) safety through a tree searchperspective.

Unlike single-turn jailbreaks that rely on one meticulouslyengineered prompt, Siege expands the conversation at each turn in abreadth-first fashion, branching out multiple adversarial prompts that exploitpartial compliance from previous responses.

By tracking these incrementalpolicy leaks and re-injecting them into subsequent queries, Siege reveals howminor concessions can accumulate into fully disallowed outputs.

Evaluations onthe JailbreakBench dataset show that Siege achieves a 100% success rate onGPT-3.

5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queriesthan baselines such as Crescendo or GOAT.

This tree search methodology offersan in-depth view of how model safeguards degrade over successive dialogueturns, underscoring the urgency of robust multi-turn testing procedures forlanguage models.

Published on arXiv on: 2025-03-13T17:57:32Z