Skip to content
arxiv papers 1 min read

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

Link: http://arxiv.org/abs/2506.18543v1

PDF Link: http://arxiv.org/pdf/2506.18543v1

Summary: The widespread deployment of large language models (LLMs) has raised criticalconcerns over their vulnerability to jailbreak attacks, i.

e.

, adversarialprompts that bypass alignment mechanisms and elicit harmful or policy-violatingoutputs.

While proprietary models like GPT-4 have undergone extensiveevaluation, the robustness of emerging open-source alternatives such asDeepSeek remains largely underexplored, despite their growing adoption inreal-world applications.

In this paper, we present the first systematicjailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.

5 andGPT-4 using the HarmBench benchmark.

We evaluate seven representative attackstrategies across 510 harmful behaviors categorized by both function andsemantic domain.

Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE)architecture introduces routing sparsity that offers selective robustnessagainst optimization-based attacks such as TAP-T, but leads to significantlyhigher vulnerability under prompt-based and manually engineered attacks.

Incontrast, GPT-4 Turbo demonstrates stronger and more consistent safetyalignment across diverse behaviors, likely due to its dense Transformer designand reinforcement learning from human feedback.

Fine-grained behavioralanalysis and case studies further show that DeepSeek often routes adversarialprompts to under-aligned expert modules, resulting in inconsistent refusalbehaviors.

These findings highlight a fundamental trade-off betweenarchitectural efficiency and alignment generalization, emphasizing the need fortargeted safety tuning and modular alignment strategies to ensure securedeployment of open-source LLMs.

Published on arXiv on: 2025-06-23T11:53:31Z