Link: http://arxiv.org/abs/2503.21598v1
PDF Link: http://arxiv.org/pdf/2503.21598v1
Summary: Large Language Models (LLMs) have transformed task automation and contentgeneration across various domains while incorporating safety filters to preventmisuse.
We introduce a novel jailbreaking framework that employs distributedprompt processing combined with iterative refinements to bypass these safetymeasures, particularly in generating malicious code.
Our architecture consistsof four key modules: prompt segmentation, parallel processing, responseaggregation, and LLM-based jury evaluation.
Tested on 500 malicious promptsacross 10 cybersecurity categories, the framework achieves a 73.
2% Success Rate(SR) in generating malicious code.
Notably, our comparative analysis revealsthat traditional single-LLM judge evaluation overestimates SRs (93.
8%) comparedto our LLM jury system (73.
2%), with manual verification confirming thatsingle-judge assessments often accept incomplete implementations.
Moreover, wedemonstrate that our distributed architecture improves SRs by 12% over thenon-distributed approach in an ablation study, highlighting both theeffectiveness of distributed prompt processing and the importance of robustevaluation methodologies in assessing jailbreak attempts.
Published on arXiv on: 2025-03-27T15:19:55Z