Link: http://arxiv.org/abs/2504.01533v1
PDF Link: http://arxiv.org/pdf/2504.01533v1
Summary: Large Language Models (LLMs) face threats from jailbreak prompts.
Existingmethods for defending against jailbreak attacks are primarily based onauxiliary models.
These strategies, however, often require extensive datacollection or training.
We propose LightDefense, a lightweight defensemechanism targeted at white-box models, which utilizes a safety-orienteddirection to adjust the probabilities of tokens in the vocabulary, makingsafety disclaimers appear among the top tokens after sorting tokens byprobability in descending order.
We further innovatively leverage LLM'suncertainty about prompts to measure their harmfulness and adaptively adjustdefense strength, effectively balancing safety and helpfulness.
Theeffectiveness of LightDefense in defending against 5 attack methods across 2target LLMs, without compromising helpfulness to benign user queries,highlights its potential as a novel and lightweight defense mechanism,enhancing security of LLMs.
Published on arXiv on: 2025-04-02T09:21:26Z