Bypassing Safety Guardrails in LLMs Using Humor

Link: http://arxiv.org/abs/2504.06577v1

PDF Link: http://arxiv.org/pdf/2504.06577v1

Summary: In this paper, we show it is possible to bypass the safety guardrails oflarge language models (LLMs) through a humorous prompt including the unsaferequest.

In particular, our method does not edit the unsafe request and followsa fixed template -- it is simple to implement and does not need additional LLMsto craft prompts.

Extensive experiments show the effectiveness of our methodacross different LLMs.

We also show that both removing and adding more humor toour method can reduce its effectiveness -- excessive humor possibly distractsthe LLM from fulfilling its unsafe request.

Thus, we argue that LLMjailbreaking occurs when there is a proper balance between focus on the unsaferequest and presence of humor.

Published on arXiv on: 2025-04-09T04:58:14Z