Link: http://arxiv.org/abs/2505.16765v1
PDF Link: http://arxiv.org/pdf/2505.16765v1
Summary: Jailbreak attacks pose a serious threat to large language models (LLMs) bybypassing built-in safety mechanisms and leading to harmful outputs.
Studyingthese attacks is crucial for identifying vulnerabilities and improving modelsecurity.
This paper presents a systematic survey of jailbreak methods from thenovel perspective of stealth.
We find that existing attacks struggle tosimultaneously achieve toxic stealth (concealing toxic content) and linguisticstealth (maintaining linguistic naturalness).
Motivated by this, we proposeStegoAttack, a fully stealthy jailbreak attack that uses steganography to hidethe harmful query within benign, semantically coherent text.
The attack thenprompts the LLM to extract the hidden query and respond in an encrypted manner.
This approach effectively hides malicious intent while preserving naturalness,allowing it to evade both built-in and external safety mechanisms.
We evaluateStegoAttack on four safety-aligned LLMs from major providers, benchmarkingagainst eight state-of-the-art methods.
StegoAttack achieves an average attacksuccess rate (ASR) of 92.
00%, outperforming the strongest baseline by 11.
0%.
Its ASR drops by less than 1% even under external detection (e.
g.
, LlamaGuard).
Moreover, it attains the optimal comprehensive scores on stealthdetection metrics, demonstrating both high efficacy and exceptional stealthcapabilities.
The code is available athttps://anonymous.
4open.
science/r/StegoAttack-Jail66
Published on arXiv on: 2025-05-22T15:07:34Z