Link: http://arxiv.org/abs/2506.18543v1
PDF Link: http://arxiv.org/pdf/2506.18543v1
Summary: The widespread deployment of large language models (LLMs) has raised criticalconcerns over their vulnerability to jailbreak attacks, i.
e.
, adversarialprompts that bypass alignment mechanisms and elicit harmful or policy-violatingoutputs.
While proprietary models like GPT-4 have undergone extensiveevaluation, the robustness of emerging open-source alternatives such asDeepSeek remains largely underexplored, despite their growing adoption inreal-world applications.
In this paper, we present the first systematicjailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.
5 andGPT-4 using the HarmBench benchmark.
We evaluate seven representative attackstrategies across 510 harmful behaviors categorized by both function andsemantic domain.
Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE)architecture introduces routing sparsity that offers selective robustnessagainst optimization-based attacks such as TAP-T, but leads to significantlyhigher vulnerability under prompt-based and manually engineered attacks.
Incontrast, GPT-4 Turbo demonstrates stronger and more consistent safetyalignment across diverse behaviors, likely due to its dense Transformer designand reinforcement learning from human feedback.
Fine-grained behavioralanalysis and case studies further show that DeepSeek often routes adversarialprompts to under-aligned expert modules, resulting in inconsistent refusalbehaviors.
These findings highlight a fundamental trade-off betweenarchitectural efficiency and alignment generalization, emphasizing the need fortargeted safety tuning and modular alignment strategies to ensure securedeployment of open-source LLMs.
Published on arXiv on: 2025-06-23T11:53:31Z