Link: http://arxiv.org/abs/2412.02159v1
PDF Link: http://arxiv.org/pdf/2412.02159v1
Summary: Defending large language models against jailbreaks so that they never engagein a broadly-defined set of forbidden behaviors is an open problem.
In thispaper, we investigate the difficulty of jailbreak-defense when we only want toforbid a narrowly-defined set of behaviors.
As a case study, we focus onpreventing an LLM from helping a user make a bomb.
We find that populardefenses such as safety training, adversarial training, and input/outputclassifiers are unable to fully solve this problem.
In pursuit of a bettersolution, we develop a transcript-classifier defense which outperforms thebaseline defenses we test.
However, our classifier defense still fails in somecircumstances, which highlights the difficulty of jailbreak-defense even in anarrow domain.
Published on arXiv on: 2024-12-03T04:34:58Z