Link: http://arxiv.org/abs/2412.07672v1
PDF Link: http://arxiv.org/pdf/2412.07672v1
Summary: Defense in large language models (LLMs) is crucial to counter the numerousattackers exploiting these systems to generate harmful content throughmanipulated prompts, known as jailbreak attacks.
Although many defensestrategies have been proposed, they often require access to the model'sinternal structure or need additional training, which is impractical forservice providers using LLM APIs, such as OpenAI APIs or Claude APIs.
In thispaper, we propose a moving target defense approach that alters decodinghyperparameters to enhance model robustness against various jailbreak attacks.
Our approach does not require access to the model's internal structure andincurs no additional training costs.
The proposed defense includes two keycomponents: (1) optimizing the decoding strategy by identifying and adjustingdecoding hyperparameters that influence token generation probabilities, and (2)transforming the decoding hyperparameters and model system prompts into dynamictargets, which are continuously altered during each runtime.
By continuouslymodifying decoding strategies and prompts, the defense effectively mitigatesthe existing attacks.
Our results demonstrate that our defense is the mosteffective against jailbreak attacks in three of the models tested when usingLLMs as black-box APIs.
Moreover, our defense offers lower inference costs andmaintains comparable response quality, making it a potential layer ofprotection when used alongside other defense methods.
Published on arXiv on: 2024-12-10T17:02:28Z