Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Link: http://arxiv.org/abs/2502.04322v1

PDF Link: http://arxiv.org/pdf/2502.04322v1

Summary: Despite extensive safety alignment efforts, large language models (LLMs)remain vulnerable to jailbreak attacks that elicit harmful behavior.

Whileexisting studies predominantly focus on attack methods that require technicalexpertise, two critical questions remain underexplored: (1) Are jailbrokenresponses truly useful in enabling average users to carry out harmful actions?(2) Do safety vulnerabilities exist in more common, simple human-LLMinteractions? In this paper, we demonstrate that LLM responses most effectivelyfacilitate harmful actions when they are both actionable and informative--twoattributes easily elicited in multi-step, multilingual interactions.

Using thisinsight, we propose HarmScore, a jailbreak metric that measures how effectivelyan LLM response enables harmful actions, and Speak Easy, a simple multi-step,multilingual attack framework.

Notably, by incorporating Speak Easy into directrequest and jailbreak baselines, we see an average absolute increase of 0.

319in Attack Success Rate and 0.

426 in HarmScore in both open-source andproprietary LLMs across four safety benchmarks.

Our work reveals a critical yetoften overlooked vulnerability: Malicious users can easily exploit commoninteraction patterns for harmful intentions.

Published on arXiv on: 2025-02-06T18:59:02Z