Link: http://arxiv.org/abs/2411.19038v1
PDF Link: http://arxiv.org/pdf/2411.19038v1
Summary: In recent years, conversational large language models (LLMs) have showntremendous success in tasks such as casual conversation, question answering,and personalized dialogue, making significant advancements in domains likevirtual assistance, social interaction, and online customer engagement.
However, they often generate responses that are not aligned with human values(e.
g.
, ethical standards, safety, or social norms), leading to potentiallyunsafe or inappropriate outputs.
While several techniques have been proposed toaddress this problem, they come with a cost, requiring computationallyexpensive training or dramatically increasing the inference time.
In thispaper, we present DIESEL, a lightweight inference guidance technique that canbe seamlessly integrated into any autoregressive LLM to semantically filterundesired concepts from the response.
DIESEL can function either as astandalone safeguard or as an additional layer of defense, enhancing responsesafety by reranking the LLM's proposed tokens based on their similarity topredefined negative concepts in the latent space.
This approach provides anefficient and effective solution for maintaining alignment with human values.
Our evaluation demonstrates DIESEL's effectiveness on state-of-the-artconversational models (e.
g.
, Llama 3), even in challenging jailbreakingscenarios that test the limits of response safety.
We further show that DIESELcan be generalized to use cases other than safety, providing a versatilesolution for general-purpose response filtering with minimal computationaloverhead.
Published on arXiv on: 2024-11-28T10:33:11Z