Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Link: http://arxiv.org/abs/2412.02159v1

PDF Link: http://arxiv.org/pdf/2412.02159v1

Summary: Defending large language models against jailbreaks so that they never engagein a broadly-defined set of forbidden behaviors is an open problem.

In thispaper, we investigate the difficulty of jailbreak-defense when we only want toforbid a narrowly-defined set of behaviors.

As a case study, we focus onpreventing an LLM from helping a user make a bomb.

We find that populardefenses such as safety training, adversarial training, and input/outputclassifiers are unable to fully solve this problem.

In pursuit of a bettersolution, we develop a transcript-classifier defense which outperforms thebaseline defenses we test.

However, our classifier defense still fails in somecircumstances, which highlights the difficulty of jailbreak-defense even in anarrow domain.

Published on arXiv on: 2024-12-03T04:34:58Z