Link: http://arxiv.org/abs/2508.10390v1
PDF Link: http://arxiv.org/pdf/2508.10390v1
Summary: Evaluating jailbreak attacks is challenging when prompts are not overtlyharmful or fail to induce harmful outputs.
Unfortunately, many existingred-teaming datasets contain such unsuitable prompts.
To evaluate attacksaccurately, these datasets need to be assessed and cleaned for maliciousness.
However, existing malicious content detection methods rely on either manualannotation, which is labor-intensive, or large language models (LLMs), whichhave inconsistent accuracy in harmful types.
To balance accuracy andefficiency, we propose a hybrid evaluation framework named MDH (Maliciouscontent Detection based on LLMs with Human assistance) that combines LLM-basedannotation with minimal human oversight, and apply it to dataset cleaning anddetection of jailbroken responses.
Furthermore, we find that well-crafteddeveloper messages can significantly boost jailbreak success, leading us topropose two new strategies: D-Attack, which leverages context simulation, andDH-CoT, which incorporates hijacked chains of thought.
The Codes, datasets,judgements, and detection results will be released in github repository:https://github.
com/AlienZhang1996/DH-CoT.
Published on arXiv on: 2025-08-14T06:46:56Z