Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Link: http://arxiv.org/abs/2501.02018v1

PDF Link: http://arxiv.org/pdf/2501.02018v1

Summary: Large Language Models (LLMs) have been shown to be susceptible to jailbreakattacks, or adversarial attacks used to illicit high risk behavior from amodel.

Jailbreaks have been exploited by cybercriminals and blackhat actors tocause significant harm, highlighting the critical need to safeguardwidely-deployed models.

Safeguarding approaches, which include fine-tuningmodels or having LLMs "self-reflect", may lengthen the inference time of amodel, incur a computational penalty, reduce the semantic fluency of an output,and restrict ``normal'' model behavior.

Importantly, these Safety-PerformanceTrade-offs (SPTs) remain an understudied area.

In this work, we introduce anovel safeguard, called SafeNudge, that combines Controlled Text Generationwith "nudging", or using text interventions to change the behavior of a model.

SafeNudge triggers during text-generation while a jailbreak attack is beingexecuted, and can reduce successful jailbreak attempts by 30% by guiding theLLM towards a safe responses.

It adds minimal latency to inference and has anegligible impact on the semantic fluency of outputs.

Further, we allow fortunable SPTs.

SafeNudge is open-source and available through https://pypi.

org/,and is compatible with models loaded with the Hugging Face "transformers"library.

Published on arXiv on: 2025-01-02T15:15:38Z