Link: http://arxiv.org/abs/2502.18935v1
PDF Link: http://arxiv.org/pdf/2502.18935v1
Summary: Large language models (LLMs) have demonstrated remarkable capabilities acrossvarious applications, highlighting the urgent need for comprehensive safetyevaluations.
In particular, the enhanced Chinese language proficiency of LLMs,combined with the unique characteristics and complexity of Chinese expressions,has driven the emergence of Chinese-specific benchmarks for safety assessment.
However, these benchmarks generally fall short in effectively exposing LLMsafety vulnerabilities.
To address the gap, we introduce JailBench, the firstcomprehensive Chinese benchmark for evaluating deep-seated vulnerabilities inLLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinesecontext.
To improve generation efficiency, we employ a novel AutomaticJailbreak Prompt Engineer (AJPE) framework for JailBench construction, whichincorporates jailbreak techniques to enhance assessing effectiveness andleverages LLMs to automatically scale up the dataset through context-learning.
The proposed JailBench is extensively evaluated over 13 mainstream LLMs andachieves the highest attack success rate against ChatGPT compared to existingChinese benchmarks, underscoring its efficacy in identifying latentvulnerabilities in LLMs, as well as illustrating the substantial room forimprovement in the security and trustworthiness of LLMs within the Chinesecontext.
Our benchmark is publicly available athttps://github.
com/STAIR-BUPT/JailBench.
Published on arXiv on: 2025-02-26T08:36:42Z