Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Link: http://arxiv.org/abs/2505.15753v1

PDF Link: http://arxiv.org/pdf/2505.15753v1

Summary: Large Language Models (LLMs) are known to be vulnerable to jailbreakingattacks, wherein adversaries exploit carefully engineered prompts to induceharmful or unethical responses.

Such threats have raised critical concernsabout the safety and reliability of LLMs in real-world deployment.

Whileexisting defense mechanisms partially mitigate such risks, subsequentadvancements in adversarial techniques have enabled novel jailbreaking methodsto circumvent these protections, exposing the limitations of static defenseframeworks.

In this work, we explore defending against evolving jailbreakingthreats through the lens of context retrieval.

First, we conduct a preliminarystudy demonstrating that even a minimal set of safety-aligned examples againsta particular jailbreak can significantly enhance robustness against this attackpattern.

Building on this insight, we further leverage the retrieval-augmentedgeneration (RAG) techniques and propose Safety Context Retrieval (SCR), ascalable and robust safeguarding paradigm for LLMs against jailbreaking.

Ourcomprehensive experiments demonstrate how SCR achieves superior defensiveperformance against both established and emerging jailbreaking tactics,contributing a new paradigm to LLM safety.

Our code will be available uponpublication.

Published on arXiv on: 2025-05-21T16:58:14Z