Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

Link: http://arxiv.org/abs/2503.21598v1

PDF Link: http://arxiv.org/pdf/2503.21598v1

Summary: Large Language Models (LLMs) have transformed task automation and contentgeneration across various domains while incorporating safety filters to preventmisuse.

We introduce a novel jailbreaking framework that employs distributedprompt processing combined with iterative refinements to bypass these safetymeasures, particularly in generating malicious code.

Our architecture consistsof four key modules: prompt segmentation, parallel processing, responseaggregation, and LLM-based jury evaluation.

Tested on 500 malicious promptsacross 10 cybersecurity categories, the framework achieves a 73.

2% Success Rate(SR) in generating malicious code.

Notably, our comparative analysis revealsthat traditional single-LLM judge evaluation overestimates SRs (93.

8%) comparedto our LLM jury system (73.

2%), with manual verification confirming thatsingle-judge assessments often accept incomplete implementations.

Moreover, wedemonstrate that our distributed architecture improves SRs by 12% over thenon-distributed approach in an ablation study, highlighting both theeffectiveness of distributed prompt processing and the importance of robustevaluation methodologies in assessing jailbreak attempts.

Published on arXiv on: 2025-03-27T15:19:55Z